Autonomous Mobile Bot Part 2: Node Lifecycle Management

This is part of a series of posts where I document my progress in building an autonomous mobile bot from scratch. The ultimate goal of this project is to build a robot that can accomplish the following tasks:

Autonomously map an indoor space using SLAM and save the generated map
Load the generated map file and localize the robot inside the map
Move the robot around by sending move commands from a GUI that shows the map

This particular blog post will deal with how I modified the Navigation2 stack (https://github.com/ros-planning/navigation2) to manage the lifecycle of nodes.

Background:

In ROS2 there are two types of nodes: the ordinary Node and the LifecycleNode. The ordinary Node is the same as the node in ROS1. The LifecycleNode on the other hand, is a node with different states and the node can transition from one state to another.

Specifically, the four states of a LifecycleNode are:

Unconfigured
Inactive
Active
Finalized

Transitioning LifecycleNodes from one state to another during the runtime of the whole system is called managing the lifecycle of the nodes.

Lifecycle state machine diagram (source: https://design.ros2.org/articles/node_lifecycle.html)

The diagram above shows the complete picture of all the possible states (shown as blue boxes) and transitions (shown as yellow boxes).

Lifecycle management is useful in many different scenarios. For instance, when a node crashes due to an unexpected error, the LifecycleNode enters the ErrorProcessing transition, in which it can go to the Unconfigured state or the Finalized state depending on how the error was handled. If the error was handled nicely, the node goes back to the Unconfigured state from which it can be restarted. Thus, the node can be restarted even after an unexpected crash, making the system more robust.

The reason why I decided to use LifecycleNodes instead of the regular Node is because I wanted to shut down specific nodes when their functionality is no longer needed in order to free up resources. Specifically, I wanted to shut down the node responsible for SLAM when the mapping is completed, then switch to the AMCL node for localization using the map.

Both SLAM and AMCL are resource heavy nodes as they play a big part of allowing the robot to know where it is in the world. However, SLAM has the extra functionality to generate the map while it is localizing the robot onto the map. Hence, it doesn't make sense to use SLAM when you already have the map of the environment.

Logic flow of the system

Overall Design:

The Navigation2 stack is already lifecycle-complient as all of its nodes inherit from the LifecycleNode class instead of the Node class. Additionally, it also has a lifecycle manager class. However, there were still some functionalities missing to fully implement the above system.

Firstly, the SLAM node (which is actually part of the Slam-Toolbox package and not the Navigation2, but they are often used in conjunction) was not a LifecycleNode but a regular Node. Thus, I had to make it into a LifecycleNode. This meant that I had to define the on_configure(), on_activate(), on_deactivate(), and on_cleanup() functions as well as making the node inherit from the LifecycleNode class.

The main change I had to make in the SLAM node was to move the lines that starts the thread that publishes the map->odom transform into the on_activate() function. Then stop the threads in the on_deactivate() function.

nav2_util::CallbackReturn 
Slam::on_activate(const rclcpp_lifecycle::State & state)
{
  RCLCPP_INFO(get_logger(), "Activating slam...");

  sst_->on_activate();
  sstm_->on_activate();
  pose_pub_->on_activate();
  closure_assistant_->marker_publisher_->on_activate();
  closure_assistant_->scan_publisher_->on_activate();

  // start the thread that publishes map->odom transform and the visualizations
  activated_ = true;
  threads_.push_back(std::make_unique<boost::thread>(
      boost::bind(&Slam::publishTransformLoop,
      this, transform_publish_period)));
  threads_.push_back(std::make_unique<boost::thread>(
      boost::bind(&Slam::publishVisualizations, this)));

  // create bond connection
  createBond();

  return nav2_util::CallbackReturn::SUCCESS;
}

nav2_util::CallbackReturn 
Slam::on_deactivate(const rclcpp_lifecycle::State & state)
{
  RCLCPP_INFO(get_logger(), "Deactivating slam...");

  activated_ = false;
  for (int i = 0; i != threads_.size(); i++) {
    threads_[i]->join(); // join() blocks until the threads have terminated
  }
  threads_.clear();

  sst_->on_deactivate();
  sstm_->on_deactivate();
  pose_pub_->on_deactivate();
  closure_assistant_->marker_publisher_->on_deactivate();
  closure_assistant_->scan_publisher_->on_deactivate();

  // destroy bond connection
  destroyBond();

  return nav2_util::CallbackReturn::SUCCESS;
}

Additionally, I had to add the functionality to add and remove nodes from lifecycle manager. The lifecycle manager of Navigation2 keeps a list of node names as strings that it manages and if you want to shut down a particular node, you have to shut down all of the nodes inside of that list. This makes sense as the nodes of Navigation2 are dependent on each other. For example, the planner and the controller nodes needs the node to provide them with the map data. Without the map data, they are useless.

Thus, the design change that I decided on was to add the functionality to add or remove nodes from the list of node names to allow for singular shut down/start up.

The overall flow of execution would be:

Start the Navigation2 nodes along with the SLAM node
Once the map is generated, remove the SLAM node from lifecycle manager
Add the AMCL node to lifecycle manager

The first and most obvious requirement I saw was that the all the nodes had to be shut down before adding or removing a node. This is because lifecycle manager keeps a bond to all of the node it manages. A bond is a simple connection between two nodes and the bond breaks when one of the nodes crashes. Lifecycle manager has a function called checkBondConnections(), which checks to see if any of the bond between itself and its nodes is broken. If a broken bond is detected, all of its nodes are reset to their Unconfigured state then a complete restart of the system is attempted.

void
LifecycleManager::checkBondConnections()
{
  if (!system_active_ || !rclcpp::ok() || bond_map_.empty()) {
    return;
  }

  for (auto & node_name : managed_node_names_) {
    if (!rclcpp::ok()) {
      return;
    }

    if (bond_map_[node_name]->isBroken()) {
      message(
        std::string(
          "Have not received a heartbeat from " + node_name + "."));

      // if one is down, bring them all down
      RCLCPP_ERROR(
        get_logger(),
        "CRITICAL FAILURE: SERVER %s IS DOWN after not receiving a heartbeat for %i ms."
        " Shutting down related nodes.",
        node_name.c_str(), static_cast<int>(bond_timeout_.count()));
      reset(true);  // hard reset to transition all still active down
      // if a server crashed, it won't get cleared due to failed transition, clear manually
      bond_map_.clear();

      // Initialize the bond respawn timer to check if server comes back online
      // after a failure, within a maximum timeout period.
      if (attempt_respawn_reconnection_) {
        bond_respawn_timer_ = this->create_wall_timer(
          1s,
          std::bind(&LifecycleManager::checkBondRespawnConnection, this),
          callback_group_);
      }
      return;
    }
  }
}

This function gets called by a timer every 200ms.

bond_timer_ = this->create_wall_timer(
    200ms,
    std::bind(&LifecycleManager::checkBondConnections, this),
    callback_group_);

Thus, even if the bond is broken for a fraction of a second, the system resets. This is not a problem when adding a node since no bond is not broken, but it is when removing a node as it involves breaking the bond with the node to be removed. Additionally, the bonds is kept in an map called bond_map_ and modifying this variable while it is in getting called in a timer loop is not thread-safe (although I could use a mutex to lock the bond_map_ variable). A better way to go about this problem was to stop the entire system, which breaks all the bonds, then restart it, which initializes the bond_map_ variable with the new list of nodes.

The second requirement, which was not obvious at first, was that the order of the node names in the list of nodes had to stay consistent. This is because when starting or stopping the entire system, transition is called on each node sequentially. So, if node A requires the data published by node B, node A must be shut down before node B to prevent any errors.

In my particular case, the AMCL and SLAM nodes had to come before the planner node in the list of nodes. This is because the planner needs to map->odom transform published by the two nodes. Thus, it had to be started after the AMCL or SLAM nodes are already running and shut down before the two nodes are shut down.

This is evident in the implementation of lifecycle manager.

As shown in the code, the transitions configure and activate are called on each node inside the managed_node_names vector from left to right. But for other transitions, which are deactivate and clean up, the managed_node_names_ vector is traversed backwards from right to left using a reverse iterator. This shows the importance of the order of nodes in the list.

To fulfill this requirement, I decided to have two lists of nodes.

A list of all possible nodes that can be added or removed from lifecycle manager during the runtime of the program in correct order
A list of currently lifecycle-managed nodes

The first list provides the correct ordering of the nodes, which can be used by the second list to sort the nodes in the correct order after new nodes are added.

// append the nodes to the requested nodes to managed_node_names_
  managed_node_names_.insert(managed_node_names_.end(), request->node_names.begin(), request->node_names.end());

// and sort it based on the order of nodes in all_node_names_
std::sort(managed_node_names_.begin(), managed_node_names_.end(),
  [&](const std::string& str1, const std::string& str2)
  { return all_node_names_index_map_[str1] < all_node_names_index_map_[str2]; });

This is the algorithm used to add new nodes into lifecycle manager using the .insert() function, which appends the requested node list into the managed_node_names_ list. After the insertion, the nodes are not in the correct order. So, the sort() algorithm is used to sort the managed_node_names_ into the order of nodes in all_node_names_. This is done using an index map, which stores the indices of each node inside the all_node_names_.

all_node_names_ = get_parameter("all_node_names").as_string_array();
for (size_t i=0; i<all_node_names_.size(); ++i)
{
    all_node_names_index_map_[all_node_names_[i]] = i;
}

Search This Blog

Juhyung's Projects

Autonomous Mobile Bot Part 2: Node Lifecycle Management

Comments

Post a Comment

Popular posts from this blog

RGB-D Visual Odometry

Bird's eye view for wheelchair

Iterative Closest Point Algorithm using Octree (CUDA Accelerated)