RGB-D Visual Odometry

Github: https://github.com/Juhyung-L/RGB-D_visual_odometry

Background

In a previous blog, I wrote an explanation on my implementation of a visual odometry system and I developed (https://juhyungsprojects.blogspot.com/2023/10/visual-odometry-using-sift-feature.html). The system used RGB image frames as its input to calculate the change in robot pose. A major shortcoming of this system was that the estimated change in the position of the robot was scale-ambiguous, meaning only the direction of motion was estimate-able, but not the magnitude. This ambiguity was a physical limitation of trying to obtain 3D information (change in x, y, z position) using a 2D input (image frames).

In this blog, I will give an explanation of another visual odometry system I developed, which uses a RGB-D camera instead of a RGB camera. A RGB-D camera is basically a RGB camera with depth sensors, which allows it to obtain 3D point cloud of the scene in addition to the image. By utilizing the point cloud, the scale ambiguity can be overcome.

Feature Detection and Matching

Just like the RGB visual odometry, the RGB-D visual odometry puts each image frame through an image processing pipeline and the first few steps of this pipeline is exactly the same as the RGB visual odometry. Firstly, feature points and their descriptors are computed for each image stored. Secondly, the image features from two consecutive image frames are matched to calculate the movement of the same image features. I will not go into the details of these steps as I wrote a thorough explanation in the previous blog.

Relevant part of the code:

// convert to gray scale
cv::Mat gray_img;
cv::cvtColor(img, gray_img, cv::COLOR_BGR2GRAY);

// extract features
sift_->detectAndCompute(gray_img, cv::noArray(), frame_.kps_, frame_.desc_);

if (min_kps > frame_.kps_.size())
{
    std::cout << "Not enough key points detected in current frame. Skipping.\n";
    return;
}

// match features
std::vector<cv::Point2f> matched_kps, prev_matched_kps;
std::vector<std::vector<cv::DMatch>> matches;
std::vector<uchar> inlier_mask;
std::vector<cv::Point2f> filt_kps, prev_filt_kps;

if (prev_frame_.desc_.empty())
{
    return;
}

// ensure descriptors are of type CV_32F because knnMatch only takes CV_32F
if (frame_.desc_.type() != CV_32F) {
    frame_.desc_.convertTo(frame_.desc_, CV_32F);
}
if (prev_frame_.desc_.type() != CV_32F) {
    prev_frame_.desc_.convertTo(prev_frame_.desc_, CV_32F);
}

fbm_->knnMatch(frame_.desc_, prev_frame_.desc_, matches, 5);

if (min_matched_kps > matches.size())
{
    std::cout << "Not enough matched key points between current frame and previous frame. Skipping.\n";
    return;
}

// Lowe's ratio test to remove bad matches
for (int i=0; i<matches.size(); ++i)
{
    if (matches[i][0].distance < 0.8*matches[i][1].distance)
    {
        // distance is the measure of likelihood for correct match between a pair of key points in the search space
        // there are multiple matches per key point ordered in increasing distance (greater distance = less likely for match)
        // 1st closest match = match[i][0], 2nd closest match = match[i][1], ...
        // for a match to be considered "good" under Lowe's ratio test,
        // the distance to the first matched key point should be significantly less than the distance to the second matched key point
        matched_kps.emplace_back(frame_.kps_[matches[i][0].queryIdx].pt);
        prev_matched_kps.emplace_back(prev_frame_.kps_[matches[i][0].trainIdx].pt);
    }
}
// calculate the mask that filters out the bad matches
cv::findEssentialMat(prev_matched_kps, matched_kps, K_, cv::FM_RANSAC, 0.999, 1.0, 1000, inlier_mask);

// apply the mask
for (int i=0; i<inlier_mask.size(); ++i)
{
    if (inlier_mask[i])
    {
        filt_kps.emplace_back(matched_kps[i]);
        prev_filt_kps.emplace_back(prev_matched_kps[i]);
    }
}

Rigid-Body Transformation

After finding the correspondence between the features, their corresponding 3D points in the point clouds are found.

// get the depth information for each key points
for (int i=0; i<filt_kps.size(); ++i)
{
    int index = static_cast<int>(filt_kps[i].y) * img_size_.width + static_cast<int>(filt_kps[i].x);
    int prev_index = static_cast<int>(prev_filt_kps[i].y) * img_size_.width + static_cast<int>(prev_filt_kps[i].x);
    if (pcl::isFinite(frame_.pc_->at(index)) && pcl::isFinite(prev_frame_.pc_->at(prev_index))) // both points need to be finite to be stored inside filtered point cloud
    {
        filt_pc.emplace_back(
            frame_.pc_->at(index).x, 
            frame_.pc_->at(index).y, 
            frame_.pc_->at(index).z);
        prev_filt_pc.emplace_back(
            prev_frame_.pc_->at(prev_index).x, 
            prev_frame_.pc_->at(prev_index).y, 
            prev_frame_.pc_->at(prev_index).z);
    }
}

The 2D feature points are stored in the std::vector<cv::Point2f> filt_kps and prev_filt_kps (for filtered key points of the current frame and previous frame respectively) and their corresponding 3D points are found and stored in std::vector<Eigen::Vector3f> filt_pc and prev_filt_pc;

With two point clouds (filt_pc and prev_filt_pc) and their correspondence known (albeit not 100% accurate), the problem becomes finding the rigid body transformation that will align the two point clouds with the least alignment error. Luckily, there is a closed-form solution to my problem using Singular Value Decomposition.

The mathematics behind finding the rigid body transformation:

Let's define the two point clouds as X and P (we are trying to find the rigid body transformation that will move point cloud P on top of point cloud X).

Firstly, get the center of masses of the two point clouds.

// calcualte the centroid of both pc
Eigen::Vector3f tgt_mean = Eigen::Vector3f::Zero();
Eigen::Vector3f src_mean = Eigen::Vector3f::Zero();

// tgt_pc and src_pc have the same number of points
for (int i=0; i<tgt_pc.size(); ++i)
{
    tgt_mean.x() += tgt_pc[i].x();
    tgt_mean.y() += tgt_pc[i].y();
    tgt_mean.z() += tgt_pc[i].z();

    src_mean.x() += src_pc[i].x();
    src_mean.y() += src_pc[i].y();
    src_mean.z() += src_pc[i].z();
}
tgt_mean /= tgt_pc.size();
src_mean /= src_pc.size();

Secondly, calculate the W matrix.

// calculate W matrix
Eigen::Matrix3f W = Eigen::Matrix3f::Zero();
Eigen::Vector3f X, P;
for (int i=0; i<tgt_pc.size(); ++i)
{
    X.x() = tgt_pc[i].x() - tgt_mean.x();
    X.y() = tgt_pc[i].y() - tgt_mean.y();
    X.z() = tgt_pc[i].z() - tgt_mean.z();

    P.x() = src_pc[i].x() - src_mean.x();
    P.y() = src_pc[i].y() - src_mean.y();
    P.z() = src_pc[i].z() - src_mean.z();

    W += X * P.transpose();
}

Lastly, take the SVD of the W matrix.

(make sure σ1 >= σ2 >= σ3)

Then, the rotation matrix and translation vectors can be computed like so:

Eigen::JacobiSVD<Eigen::Matrix3f> svd(W, Eigen::ComputeFullU | Eigen::ComputeFullV);
T.linear() = svd.matrixU() * svd.matrixV().transpose(); // calculate rotation matrix
T.translation() = tgt_mean - T.linear() * src_mean; // calcualte translation vector

With this method, the accurate change in pose of the robot can be computed with scale-accurate translation.

Results

Comparing the estimated odometry to the ground truth robot pose:

As the odometry evaluation shows, this visual odometry system still suffers from odometric drift because the robot pose is calculated based on the previous pose, so the error in each pose estimation builds up over time.

Search This Blog

Juhyung's Projects