Our pioneering work on learned approaches to Ego Motion estimation (Visual Odometry) using CNNs.
Abstract— Visual Ego-Motion Estimation, or briefly Visual Odometry (VO), is one of the key building blocks of modern SLAM systems. In the last decade, impressive results have been demonstrated in the context of visual navigation, reaching very high localization performance. However, all ego-motion estimation systems require careful parameter tuning procedures for the specific environment they have to work in. Furthermore, even in ideal scenarios, most state-of-the-art approaches fail to handle image anomalies and imperfections, which results in less robust estimates. VO systems that rely on geometrical approaches extract sparse or dense features and match them to perform Frame to Frame (F2F) motion estimation. However, images contain much more information that can be used to further improve the F2F estimation. To learn new feature representation a very successful approach is to use deep Convolutional Neural Networks. Inspired by recent advances in Deep Networks and by previous work on learning methods applied to VO, we explore the use of Convolutional Neural Networks to learn both the best visual features and the best estimator for the task of visual Ego-Motion Estimation. With experiments on publicly available datasets we show that our approach is robust with respect to blur, luminance and contrast anomalies and outperforms most state-of-the-art approaches even in nominal conditions.
Description
Ego-motion estimation is a fundamental building block of any robotic system, needed for localization and route planning, and for the more complex task of mapping an unknown environment. Most visual odometry approaches are grounded on the estimate of the camera motion between pairs of consecutive frames. This frame to frame (F2F) estimate is in most cases computed with geometric methods, i.e. through the use of projective geometry relations between 3D points of the scene and their projection on the image plane, or by minimizing the gradient of the pixel intensities across consecutive images. The initial F2F estimate is then refined with different strategies, such as bundle adjustment on a sliding window of previous frames, or loop closure detection.
However, geometric methods lack robustness to a number of very common effects, such as:
- Illumination changes
- dynamic objects
- different camera calibrations
- low-textured environments
- noise and blur
In
previous work we showed how learning based techniques could be used to learn VO from data. The features were hand crafted extracting Optical Flow and computing their histograms in different part of the images. In this work we show how a CNN architecture can be used to learn the input space representation to be used for the Ego-Motion estimationboth task.
In particular we show how robust is the results, applying huge amounts of blure and contrast changes.
Our contribution summarizes to this points:
-
We explore feature selection for ego-motion estimation using different CNN architectures. The CNN architectures are used to extract new input features starting from dense optical flow. Three different architectures are proposed in the paper: two of them investigate the influence of global and local optical flow fields with respect to the ego-motion estimation (i.e., considering both the full flow image and its different sub-blocks); the last one combine the advantages of the others in a parallel CNN that exploits both global and local information.
-
We show that the presented learned estimators are able to estimate motion outperforming other SotA geometrical and learned methods. In addition the proposed methods are able to use global information to extract camera motion and scale information, while dealing with noise in input.
-
Finally, we show the performances of the presented method in difficult scenarios, using images with very different contrast and blur parameters, to show the robustness of the new features extracted by the CNN.
Network Architecture
In tasks such as object recognition and people detection the input images are smaller than the ones typically used in VO. Simply applying one of the already proposed architectures is not straightforward. Down-sampling the image could discard important information for motion estimate. For this reason, we tested three different architectures and compared their performances:
- CNN-1b VO: As a basic exploratory approach we train
a deep network on the entire OF image after down sampling it 8 times with average pooling to reach a dimension of 155 × 48.
- CNN-4b VO: The first alternative configuration tries to exploit local information. We divide the OF image into four sub-images. Each quadrant is down-sampled 4 times and then passed through a series of CNN filters analogous to CNN-1b ones. The final layer is trained to use the output of the four CNN networks to give a global F2F estimate.
- P-CNN VO: The last architecture uses the CNN filters of both CNN-1b and CNN-4b feeding their output to a fully connected network. We do so to explore the performances of a network that merges the global information of CNN-1b with the local information of CNN-4b.
The architecture of CNN-4b network is shown in Figure 1
. The first section of the network is composed of four branches, identical in complexity, but trained separately, that perform the first two convolutional steps (CNN1 and CNN2). Note that each of the four quadrants of the image contains some motion information to compute a motion estimate, with ambiguity between simple turns and forward moving motion. We then link the output of the first CNN-pooling pair with the second one. We do so because exploratory experiments on a down-sampled version of the OF images showed that VO estimators using only CNN1 output, or only the cascade of the two CNNs, were both able to learn good estimators, but the VO estimator learned on the concatenation of the two outputs performed better. This result shows that CNN1 and CNN2 extract different information from the OF images. We presume that CNN1 extracts finer details, while the CNN2 extracts coarser ones, and that this information is not completely overlapping. After this stage the four convoluted features are put back together to form an image that contains the global information and thus is able to solve the motion ambiguities with symmetry information. The last layer computes a fully connected network that uses the information of all four quadrants at both resolutions, as shown in the upper part of Figure 2.
The last architecture, P-CNN, is a composition of the other two networks as shown in Figure 2.
Results
We tested our architectures on the KITTI dataset. The details of the experiments are shown in the paper, here we show some additional material that could not be included in the original paper.
Darkened Sequences
Darkened sequences simulate dusk conditions. Lowering contrast makes more difficult for feature extractors to find corners, still with these values of contrast and gamma crisp shadows are still recognizable. Compared to standard sequences, the darkened PCNN sequence error on average is 1.39% higher, while the SVR-S is 6.53% higher and the VISO2 is 4.99% higher. When we look at the single trajectories, it is possible to notice that the decrease in performances is higher in the 09 and 10 sequences. We suppose that this is due to the higher field-depth and linear speeds, that make these sequences more challenging.
| VISO2-M | SVR VO Sparse | SVR VO Dense | PCNN |
|
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
08 |
26.33 |
0.0389 |
41.81 |
0.1114 |
18.06 |
0.0490 |
8.45 |
0.0249 |
09 |
13.64 |
0.0357 |
19.88 |
0.0669 |
34.30 |
0.0550 |
11.03 |
0.0338 |
10 |
22.74 |
0.0352 |
29.28 |
0.0670 |
25.49 |
0.0646 |
20.03 |
0.0458 |
Avg |
23.54 |
0.0387 |
35.04 |
0.1005 |
20.34 |
0.0545 |
10.28 |
0.0300 |
Darkened 2 Sequences
Darkened 2 sequences simulate night conditions. At these levels of contrast and gamma the shadows in the images are very dark and a lot of small details are lost. Clearly these transforms are only an approximation of what happens with low-light vision, but they give an insight of what the estimator algorithms do when there is a comparable decrease of detail. With these sequences we see a stark difference between PCNN and other methods. SVR-S has the lowest performance, probably becouse of a very simple Lucas-Kanade sparse feature extraction. However SVR-D and VISO2 have a traslational error that is near doubled in respect to the PCNN's one, and a rotational error that is between 30-40% higher.
| VISO2-M | SVR VO Sparse | SVR VO Dense | PCNN |
|
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
08 |
37.82 |
0.0493 |
52.96 |
0.1479 |
30.18 |
0.0784 |
14.53 |
0.0366 |
09 |
30.18 |
0.0537 |
26.26 |
0.0842 |
23.66 |
0.0773 |
15.82 |
0.0458 |
10 |
25.97 |
0.1305 |
38.88 |
0.1024 |
24.36 |
0.0546 |
18.53 |
0.0464 |
Avg |
35.28 |
0.0610 |
44.61 |
0.1340 |
28.10 |
0.0792 |
15.25 |
0.0413
|
 
lightened Sequences
Lightened sequences simulate high light conditions thanks to low value of gamma correction. These images have also very low contrast (min 0.2, max0.7), so they are particularly challenging. The highest performance issues are for VISO2 on sequence 10, where it fails to extract enough features in many frames, so the error is huge. As with the preceiding examples, the behaviour of PCNN is better than SVM and VISIO2.
| VISO2-M | SVR VO Sparse | SVR VO Dense | PCNN |
|
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
08 |
32.24 |
0.0378 |
46.51 |
0.1072 |
19.90 |
0.0595 |
10.16 |
0.0294 |
09 |
18.71 |
0.0268 |
22.85 |
0.0752 |
24.36 |
0.0491 |
20.08 |
0.0391 |
10 |
91.36 |
0.0541 |
43.73 |
0.1294 |
22.47 |
0.0734 |
21.02 |
0.0460 |
Avg |
36.83 |
0.0380 |
40.45 |
0.1059 |
21.31 |
0.0617 |
13.51 |
0.0343 |
blurred Sequences - radius 3 pixels
These sequences are blurred with a small radius of 3 pixels. The motre striking result is that the effects of this blur are slightly beneficial to PCNN for the translational errors, while for the other methods are not. In detail, the decrease in error in respect to standard sequences is <b>-0.33%</b>, while VISO2 and SVR-S have an increase of <b>+4.99%</b> and <b>+8.07%</b>. The average rotational errors are <b>+11%</b> <b>+106%</b> <b>2.92%</b> respectivlely for PCNN, SVR-S and VISO2, showing a higher under-performance of sparse SVR on this sequences.
| VISO2-M | SVR VO Sparse | SVR VO Dense | PCNN |
|
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
08 |
23.70 |
0.0431 |
25.23 |
0.0674 |
14.41 |
0.0386 |
7.41 |
0.0229 |
09 |
10.99 |
0.0317 |
12.43 |
0.0438 |
21.99 |
0.0332 |
6.74 |
0.0253 |
10 |
25.83 |
0.0454 |
20.09 |
0.0463 |
26.74 |
0.0590 |
19.35 |
0.0380 |
Avg |
23.54 |
0.0387 |
21.88 |
0.0623 |
17.52 |
0.0414 |
8.63 |
0.0262 |
blurred Sequences - radius 10 pixels
These sequences are blurred with a radius of 10 pixels. The results show only a slight increase in error for the dense methods, and again PCNN performs better, showing that the feature it learns are robust to high levels of blur. However, SVR-S and SVR-D are very similar in performances, and they are better than the blur s3 case. This suggest that there is something in SVR that helps in reducing the effects of high blur. The errors of VISO2 on highly blurred images are more than doubled.
| VISO2-M | SVR VO Sparse | SVR VO Dense | PCNN |
|
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
Trans [%] |
Rot [deg/m] |
08 |
53.25 |
0.0694 |
20.00 |
0.0493 |
14.81 |
0.0395 |
7.41 |
0.0229 |
09 |
38.02 |
0.0593 |
13.87 |
0.0503 |
22.06 |
0.0372 |
11.80 |
0.0350 |
10 |
82.37 |
0.2021 |
19.15 |
0.0514 |
26.61 |
0.0621 |
19.87 |
0.0416 |
Avg |
54.32 |
0.0856 |
18.66 |
0.0519 |
17.96 |
0.0433 |
9.82 |
0.0286 |
Download
The following link contains the matlab/python code, the caffe net definitions and the datasets in HDF5 format:
Code
Related Publications and Citations
Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation
@ARTICLE{Costante2016,
author={G. Costante and M. Mancini and P. Valigi and T. A. Ciarfuglia},
journal={IEEE Robotics and Automation Letters},
title={Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation},
year={2016},
volume={1},
number={1},
pages={18-25},
doi={10.1109/LRA.2015.2505717},
month={Jan}
}
Evaluation of non-geometric methods for visual odometry
@ARTICLE{Ciarfuglia2014,
author = "Thomas A. Ciarfuglia and Gabriele Costante and Paolo Valigi and Elisa Ricci"
title = "Evaluation of non-geometric methods for visual odometry ",
journal = "Robotics and Autonomous Systems ",
volume = "62",
number = "12",
pages = "1717 - 1730",
year = "2014",
note = "",
issn = "0921-8890",
doi = "http://dx.doi.org/10.1016/j.robot.2014.08.001",
}