ARM-VO: 8 FPS monocular visual odometry on Raspberry Pi 3
My master thesis was to design and implement a camera-based system for localization of a six-wheeled robot. The computer was a Raspberry Pi 3 which took me a lot of effort to achieve a reasonable performance. This post explains how I got 8-10 FPS localization on KITTI dataset.
1. SLAM or Odometry?
For those who are involved with camera-based localization, it is clear that visual-SLAM (VSLAM) is often more expensive than visual-odometry (VO). This is mainly because VSLAM is concerned about both localization and mapping while VO is just aimed to provide localization. The mapping task in VSLAM requires map-maintenance and loop-closing which loads more computation as the camera exploits newer regions. Although VSLAM is potentially more accurate, I selected VO due to my resource-constrained processor.
2. Monocular or Stereo?
Recent studies show that the community is more interested in single camera solutions than dual cameras. The reasons are:
A single camera device is much more ubiquitous than a dual camera device. This opens a wider application area for monocular algorithms.
Dual camera systems are certainly more expensive and require more calibration procedure.
Stereo algorithms are often heavier than monocular ones because more data is being processed.
But stereo algorithms aren’t such bad fairly. They give more accuracy (if the ratio of the object distance to baseline is sufficient) and most importantly, they give a true-scaled pose in which a monocular system can’t handle easily.
Anyway, I decided to go with a monocular algorithm. There are some solutions for scale recovery which I’ll mention later.
3. Direct or Indirect?
The most common theme for camera egomotion estimation is indirect. The image is represented by a set of keypoints which are then matched with the keypoints in the previous frame(s) via descriptors comparison. This is opposed to a direct method which directly uses pixels in a photometric optimization. Direct methods can better handle low-textured environments and are faster than indirect approaches that use complex detector and descriptors (such as SIFT, KAZE, SURF). But I decided to go with indirect strategy because:
The robot was intended to work in outdoor environments. So there was no concern about the lack of keypoints.
My camera was a consumer-grade CMOS rolling-shutter sensor which was prone to severe distortions. Studies have shown that direct methods require global-shutter cameras and will easily degenerate with rolling-shutter models (especially with slow sensor readout time).
4. Which detector?
There are plenty of different keypoint detectors. Accuracy, distinctiveness, repeatability, cost, and robustness to noise are the factors that must be regarded when choosing a keypoint detector. I used FAST algorithm, mainly because of its superior speed over other detectors.
As I will explain later, ARM-VO doesn’t require too many keypoints to be detected. A maximum of 300 keypoints are uniformly selected.
5. Matcher or Tracker?
Keypoint matching is the bottleneck of most visual odometry algorithms. The reason goes back to the way it works:
“the majority of detected keypoints in the current frame must be detected again in the next frame so that a nearest neighbor search in their descriptors could tell us which ones are corresponding.”
The necessity to detect common keypoints in the frames leaves two possible solutions for us:
using a highly repeatable detector that guarantees such condition, or
using a conventional fast detector that increases the chance of occurring repeated keypoints by detecting many points.
Both solutions induce a high computational burden. The first one requires blob detectors (such as SIFT, SURF or MSER) which are often slow, and the second one requires a lot of descriptors to be computed and compared which is also slow.
Apart from detection, calculating descriptors is also bottleneck and this goes back to the reason they were designed for. Most descriptors were designed to handle large translation, rotation and scale changes between the images. We don’t need such level of invariance in applications like visual odometry in which a stream of frames is coming with few inter-frame changes.
Fortunately, matching is not the only solution to find the association between frames. Tracking is a good alternative that can work with much fewer points. This will greatly boost up all upcoming stages in the algorithm. I used KLT algorithm to track the detected keypoints.
6. Fundamental or Homography matrix?
The tracker is not perfect. It may find false tracks which will lead to wrong motion estimation. To reject outliers by RANSAC, we must feed it a model that describes the inter-frame geometry. Fundamental matrix is the most general model but it fails when there is little or no translation between frames. Also, coplanar points will lead to a degenerate case when using the fast 8-point algorithm. These two cases are the situations where homography is exactly designed for. The Geometric Robust Information Content (GRIC) is a good criterion for such model selection. ARM-VO ignores incoming frames where GRIC chooses homography to better represent the geometry.
7. Scale estimation
Basically, the scale ambiguity in monocular systems requires an external information to be solved. This information might be the observations from a gyroscope, camera height, or any constraint that simplifies the problem. I used the LibViso2 approach which estimates the scale factor if it knows the camera height and its pitch angle.
8. Parallel-processing + Neon
While the above algorithm is both faster and more accurate than LibViso2 (read the discussion in the paper), it still operates at 3-4 FPS on Raspberry Pi 3. To accelerate the code, I had to customize it for ARM CPUs. The Cortex-A53 was giving me 3 more cores and VPU for vectorization. Also, the 32bit architecture was warning me to avoid using double precision operations as much as possible. A combination of TBB and OpenMP libraries as well as Neon C intrinsics with edited OpenCV’s source codes gave me an average of 8 FPS which is several times faster than LibViso2, ORB-SLAM2 and almost all other forward-looking monocular algorithms.
9. Code and Paper
There are more details about the algorithm and the implementation which I didn’t mention all here. If you are interested, refer to the paper and the code.