ARM-VO: 8 FPS monocular visual odometry on Raspberry Pi 3

My master thesis was to design and implement a camera-based system for localization of a six-wheeled robot. The computer was a Raspberry Pi 3 which took me a lot of effort to achieve a reasonable performance. This post explains how I got 8-10 FPS localization on KITTI dataset.

1. SLAM or Odometry?

For those who are involved with camera-based localization, it is clear that visual-SLAM (VSLAM) is often more expensive than visual-odometry (VO). This is mainly because VSLAM is concerned about both localization and mapping while VO is just aimed to provide localization. The mapping task in VSLAM requires map-maintenance and loop-closing which loads more computation as the camera exploits newer regions. Although VSLAM is potentially more accurate, I selected VO due to my resource-constrained processor.

2. Monocular or Stereo?

Recent studies show that the community is more interested in single camera solutions than dual cameras. The reasons are:

A single camera device is much more ubiquitous than a dual camera device. This opens a wider application area for monocular algorithms.
Dual camera systems are certainly more expensive and require more calibration procedure.
Stereo algorithms are often heavier than monocular ones because more data is being processed.

But stereo algorithms aren’t such bad fairly. They give more accuracy (if the ratio of the object distance to baseline is sufficient) and most importantly, they give a true-scaled pose in which a monocular system can’t handle easily.

Anyway, I decided to go with a monocular algorithm. There are some solutions for scale recovery which I’ll mention later.

3. Direct or Indirect?

The most common theme for camera egomotion estimation is indirect. The image is represented by a set of keypoints which are then matched with the keypoints in the previous frame(s) via descriptors comparison. This is opposed to a direct method which directly uses pixels in a photometric optimization. Direct methods can better handle low-textured environments and are faster than indirect approaches that use complex detector and descriptors (such as SIFT, KAZE, SURF). But I decided to go with indirect strategy because:

The robot was intended to work in outdoor environments. So there was no concern about the lack of keypoints.
My camera was a consumer-grade CMOS rolling-shutter sensor which was prone to severe distortions. Studies have shown that direct methods require global-shutter cameras and will easily degenerate with rolling-shutter models (especially with slow sensor readout time).

4. Which detector?

There are plenty of different keypoint detectors. Accuracy, distinctiveness, repeatability, cost, and robustness to noise are the factors that must be regarded when choosing a keypoint detector. I used FAST algorithm, mainly because of its superior speed over other detectors.

As I will explain later, ARM-VO doesn’t require too many keypoints to be detected. A maximum of 300 keypoints are uniformly selected.

5. Matcher or Tracker?

Keypoint matching is the bottleneck of most visual odometry algorithms. The reason goes back to the way it works:

“the majority of detected keypoints in the current frame must be detected again in the next frame so that a nearest neighbor search in their descriptors could tell us which ones are corresponding.”

The necessity to detect common keypoints in the frames leaves two possible solutions for us:

using a highly repeatable detector that guarantees such condition, or
using a conventional fast detector that increases the chance of occurring repeated keypoints by detecting many points.

Both solutions induce a high computational burden. The first one requires blob detectors (such as SIFT, SURF or MSER) which are often slow, and the second one requires a lot of descriptors to be computed and compared which is also slow.

Apart from detection, calculating descriptors is also bottleneck and this goes back to the reason they were designed for. Most descriptors were designed to handle large translation, rotation and scale changes between the images. We don’t need such level of invariance in applications like visual odometry in which a stream of frames is coming with few inter-frame changes.

Fortunately, matching is not the only solution to find the association between frames. Tracking is a good alternative that can work with much fewer points. This will greatly boost up all upcoming stages in the algorithm. I used KLT algorithm to track the detected keypoints.

6. Fundamental or Homography matrix?

The tracker is not perfect. It may find false tracks which will lead to wrong motion estimation. To reject outliers by RANSAC, we must feed it a model that describes the inter-frame geometry. Fundamental matrix is the most general model but it fails when there is little or no translation between frames. Also, coplanar points will lead to a degenerate case when using the fast 8-point algorithm. These two cases are the situations where homography is exactly designed for. The Geometric Robust Information Content (GRIC) is a good criterion for such model selection. ARM-VO ignores incoming frames where GRIC chooses homography to better represent the geometry.

7. Scale estimation

Basically, the scale ambiguity in monocular systems requires an external information to be solved. This information might be the observations from a gyroscope, camera height, or any constraint that simplifies the problem. I used the LibViso2 approach which estimates the scale factor if it knows the camera height and its pitch angle.

8. Parallel-processing + Neon

While the above algorithm is both faster and more accurate than LibViso2 (read the discussion in the paper), it still operates at 3-4 FPS on Raspberry Pi 3. To accelerate the code, I had to customize it for ARM CPUs. The Cortex-A53 was giving me 3 more cores and VPU for vectorization. Also, the 32bit architecture was warning me to avoid using double precision operations as much as possible. A combination of TBB and OpenMP libraries as well as Neon C intrinsics with edited OpenCV’s source codes gave me an average of 8 FPS which is several times faster than LibViso2, ORB-SLAM2 and almost all other forward-looking monocular algorithms.

9. Code and Paper

There are more details about the algorithm and the implementation which I didn’t mention all here. If you are interested, refer to the paper and the code.

21 replies

Dario says:
March 19, 2020 at 12:41 pm

great article!
Reply
- Zana Zakaryaie says:
  March 20, 2020 at 6:32 pm
  
  Hello Dario
  Thanks
  Reply
sagar eknath dhatrak says:
June 21, 2020 at 8:41 am

great work sir.
I am facing some while running code..
Can please help me..
when I run it shows errror..

./ARM_VO /home/pi/03/image_0 /home/pi/ARM-VO/params/Seq03.yaml

Processing Frame 0
Can’t read 000000.png!
Reply
- Zana Zakaryaie says:
  June 21, 2020 at 7:12 pm
  
  Hi
  ARM-VO expects the image names to be in the KITTI odometry data format. To solve the issue, rename your images to this format. The first image must be 000000.png, the second 000001.png, and so on.
  If you had other issues with the code, please report in the Github repository so that other people can view it.
  Reply
  - David says:
    April 5, 2022 at 6:44 pm
    
    Have built now with tbb.
    
    ./ARM_VO 01/image_2 params/Seq00-02.yaml
    Processing image 000000.png: Can’t read the image
    
    ….
    
    ~/ARM-VO/01/image_2 $ ls 000000.png
    000000.png
    
    so png exists…
    
    Any help appreciated
    Reply
  - David says:
    April 5, 2022 at 10:11 pm
    
    also had same problem…. need a / at the end of the data path
    Reply
    - Zana Zakaryaie says:
      April 6, 2022 at 9:04 am
      
      Hi David
      I’m happy that you solved your problem. Please report code-related issues in ARM-VO’s Github repo
      Reply
Kuan Wei Liu says:
June 23, 2020 at 1:02 pm

Great article!
I’m trying to modify your ARM-VO into ROS version.
However, ROS built-in opencv is not installed with TBB, should I install another opencv with TBB outside the ROS to obtain the greatest performance?
Reply
- Zana Zakaryaie says:
  June 25, 2020 at 2:01 am
  
  Hi Kuan
  I’m going to push the ROS version for the next 2-3 days. There will be some other small modifications too. In case you can’t wait, then print cv::getBuildInformation() and make sure that OpenCV uses pthreads or similar parallel framework. If there were no parallel framework, then uninstall the ROS-installed-OpenCV, and build your own OpenCV from source using TBB
  Reply
  - Kuan Wei Liu says:
    July 2, 2020 at 10:58 am
    
    Thanks for your reply!
    I have a question about the algorithm and will be grateful if you can answer me.
    Your paper says that the detector is used only when the ratio of successfully
    tracked points drops below a certain threshold (ex: 0.6), however, in the main.cpp I found that the detector.detect will be invoked everytime when Fcriteria < Hcriteria.
    I don't know if I have misunderstood something.
    Reply
    - Zana Zakaryaie says:
      July 2, 2020 at 12:09 pm
      
      That’s right Kuan. The code that is now available in GitHub has not all the mentioned details in the paper. Avoiding keypoint detection in each frame, plus some other things (like cleaner code, ROS node, supporting OpenCV4 and 64bit OS) will be added soon.
      Reply
Wei says:
May 6, 2021 at 1:52 pm

Hello Zana, thanks for your fantastic work, I have a question regarding the ARM-VO/params/Seq04-10.yaml, how can I got the value of “height” and “pitch_angle”? Are these parameters from the original KITTI datasets?
Reply
- Zana Zakaryaie says:
  May 6, 2021 at 2:19 pm
  
  Hi Wei
  Yes, these are the parameters of the KITTI odometry dataset. If you want to run ARM-VO in a custom setup, then you should calibrate the camera with respect to the ground. You can do this by putting a chessboard on the ground, measure its corner coordinates, and solve a PnP problem to find the pose of the camera with respect to the chessboard. There you can calculate the pitch angle and camera height.
  Reply
  - Wei says:
    May 21, 2021 at 6:26 am
    
    Hey, Zana
    
    Thanks for your reply, I have another question regarding the function tracker::calcSharrDeriv, what’s the usage of this function? as lots of implementation are arm neon code based, it is not easy to read, is there any tips/hints for this part that I can learn from? any suggestions are appreciated.
    Reply
    - Zana Zakaryaie says:
      May 22, 2021 at 12:27 pm
      
      Optical flow assumes that the intensity of image pixels remain unchanged when the camera moves. So, mathematically I(x,y, t) = I(x+dx, y+dy, t+dt). In other words, changes along x and y directions should not induce intensity change. Scharr is a famous filter to measure x and y changes (derivatives) which is numerically more stable than Sobel. It also works better in rotations. For more information regarding OpenCV’s implementation of optical flow, check this paper. For comparing Sobel and Scharr, see this page.
      Reply
      - wei says:
        May 24, 2021 at 12:58 pm
        
        Thank you so much!
      - Zana Zakaryaie says:
        May 24, 2021 at 1:18 pm
        
        You’re welcome :)
Emad says:
June 7, 2021 at 10:20 pm

Amazing article, thank you.
I have a question regarding KITTI data.
Which one should we install as we have multiple options?
Reply
- Zana Zakaryaie says:
  June 8, 2021 at 8:56 am
  
  Thanks Emad. You should download KITTI’s odometry dataset from here. There are multiple options (greyscale, color, and Velodyne laser data) but because ARM-VO works with greyscale images, you can download the one that contains only the greyscale images (link).
  Reply
  - EMAD AL OMARI says:
    June 8, 2021 at 6:44 pm
    
    Thank you Zana for the reply!
    If I want to use other data, should I consider any specification for the pixels and the name format?
    Reply
    - Zana Zakaryaie says:
      June 9, 2021 at 8:49 am
      
      Honestly, I don’t remember the image names in color or Velodyne laser data. But no worries about the pixels. Even if you feed RGB images to ARM-VO, it will convert them to greyscale images.
      Reply