Iterative Closest Point Algorithm using Octree (CUDA Accelerated)

github: https://github.com/Juhyung-L/cuda_icp

Background

Scan matching algorithms are ubiquitous in the field of mobile robotics. Although there are countless variations of the algorithm with each one having its benefits and drawbacks, the overarching goal of the algorithms is the same.

Given two sets of points, the algorithm finds the optimal rigid body transformation (rotation and translation) that aligns one set of points onto the other.

They are specifically useful in localization (estimating the robot's pose) algorithms as finding the rigid transformation between two consecutive laser scans from the same robot means finding the accurate change in position and orientation of the robot.

One of the most commonly used variation of the scan matching algorithm is the iterative closest point (ICP) algorithm.

ICP takes a 2-step approach to scan matching

Finding the corresponding pairs of points in the two point sets
Finding the rigid body transformation

Finding corresponding pairs means finding two points (one in each point set) that correspond to the same point in space.

It turns out that if the optimal correspondence is known, that is if all points in one set is matched to their correct pair in the other set, finding the optimal rigid body transformation is closed form. However, this is extremely unlikely to occur in real life scenarios as we are trying to match two laser scans taken from two different points of views.

In most cases, the number of points in the two point sets will not be equal in the first place, making it impossible to find the optimal correspondence.

To overcome this limitation, the ICP algorithm estimates the correspondence by choosing the closest point pairs to be the corresponding pairs (a point in set A is matched to the closest point in set B). This method will obviously fail to return the optimal rigid body transformation in one try as the correspondence is not correct. Thus, the algorithm iterates until the error is under a predefined threshold, hence the name Iterative Closest Point. Essentially, the problem is turned into a least squares optimization problem.

The summary of the entire algorithm is:

Given two point sets X and P and the goal is to find the transformation that will align point set P to point set X

For each point in set P, find the closest point in set X and declare them as a corresponding pair
Find the optimal transformation for the given correspondence using singular value decomposition
Apply the transformation to point set P
Calculate the error
Iterate steps 1 to 4 until the error is under a predefined threshold

This algorithm has been implemented many time and the biggest performance bottleneck proved to the be finding the closest points in the two point sets. The brute-force way to obtain the following information is to calculate the distance to every point in set X for each point in set P and selecting the point pair with the smallest distance.

The approach has a complexity of O(NM), where N is the number of points in set X and M in the number of points in set P, and it is not feasible for application in robotics because the number of points in each laser scan can be extremely large and the algorithm must complete in orders of milliseconds to be used in practical localization algorithms. In short, optimizations must be made.

Octree-based Parallel ICP Implementation

To optimize finding the closest point pairs, I employed two methods of optimization: Octree and Parallel Programming.

Octree is a tree data structure for recursively partitioning a space into 8 smaller regions. It is essentially a multi-dimensional binary search tree. Each point in set X is stored in the leaf nodes of the tree and searching for the closest point in set X for a point in set P means traversing down the octree until a leaf node then performing the brute-force distance comparison.

(this is a quadtree not an octree, but the idea is the exact same)

Although you are doing the brute-force distance comparison at the leaf node, the number of points to compare is much smaller because you are only comparing between the points inside the region that the leaf node represents.

The other optimization technique I used was parallel programming. The essence of parallel programming is to perform computations on multiple threads at once, thus decreasing the total completion time of the algorithm compared to the sequential computation method where a single computation is done one after the other. The GPU is the preferred hardware to perform parallel programming as it is comprised of much more processing units, though less powerful, compared to the CPU. For this project, I used Nvidia's CUDA toolkit to make my code parallel.

However, parallelization comes with the heavy cost of making the code much more complicated to write and debug. Part of the reason is due to the need for synchronization. Take for example, summing an array of integers. Coding this sequentially is extremely easy: iterate through the elements one by one and add its value to the total. The parallel version on the other hand, is much more complicated.

A method called parallel reduction must be used where the array is iteratively divided in half and added to the other half where each addition operation is done on a individual thread to speed up the process. Between each step labeled in the diagram, all threads must be synchronized, meaning if a thread finishes its addition operation faster than the other threads, it must wait until all threads have completed their addition task before moving onto the next step. Omitting the synchronization step can lead to all kinds of issues common to multithreading, such as a race condition.

However, summing up an array of integers is an inherently sequential, which is why it is much more difficult to make it parallel. The reason I showed this example is to prove the point that even the simplest algorithms can become difficult to parallelize if it is not meant to be parallelized. Thus, when optimizing an algorithm through parallel programming, it is important to first consider the inherent nature of the algorithm.

Naturally, the algorithms that benefit the most from parallelization are ones that do not need to be synchronized, which is the case for searching through an octree. The process of searching the closest point in set X for each point in set P from an octree is an entirely independent process, meaning the search operation of one point does not depend on the search operation of another point, and thus they do not need to be synchronized. Exploiting this inherently parallel nature of the nearest point search, the process of finding the closest point for each point in set P can be dramatically optimized. Optimizing this step is a massive performance boost for the entire ICP algorithm as it is the performance bottleneck of the algorithm.

Result

3 videos showing the algorithm in motion:

Red points = target point cloud
Green points = source point cloud (the transformation is applied to this point cloud)
Blue lines = nearest neighbor correspondence

Showing the algorithm in slowed down and showing the correspondence lines:

Full speed of the algorithm without the correspondence lines (actually this is slowed down too because the visualization software that I used (RViz) lagged when it was run at full speed)

Actual full speed footage:

Search This Blog

Juhyung's Projects