Our pioneering work on learned approaches to Ego Motion estimation (Visual Odometry) using CNNs.

overview cnn eps converted to

Abstract— Visual Ego-Motion Estimation, or briefly Visual Odometry (VO), is one of the key building blocks of modern SLAM systems. In the last decade, impressive results have been demonstrated in the context of visual navigation, reaching very high localization performance. However, all ego-motion estimation systems require careful parameter tuning procedures for the specific environment they have to work in. Furthermore, even in ideal scenarios, most state-of-the-art approaches fail to handle image anomalies and imperfections, which results in less robust estimates. VO systems that rely on geometrical approaches extract sparse or dense features and match them to perform Frame to Frame (F2F) motion estimation. However, images contain much more information that can be used to further improve the F2F estimation. To learn new feature representation a very successful approach is to use deep Convolutional Neural Networks. Inspired by recent advances in Deep Networks and by previous work on learning methods applied to VO, we explore the use of Convolutional Neural Networks to learn both the best visual features and the best estimator for the task of visual Ego-Motion Estimation. With experiments on publicly available datasets we show that our approach is robust with respect to blur, luminance and contrast anomalies and outperforms most state-of-the-art approaches even in nominal conditions.


Ego-motion estimation is a fundamental building block of any robotic system, needed for localization and route planning, and for the more complex task of mapping an unknown environment. Most visual odometry approaches are grounded on the estimate of the camera motion between pairs of consecutive frames. This frame to frame (F2F) estimate is in most cases computed with geometric methods, i.e. through the use of projective geometry relations between 3D points of the scene and their projection on the image plane, or by minimizing the gradient of the pixel intensities across consecutive images. The initial F2F estimate is then refined with different strategies, such as bundle adjustment on a sliding window of previous frames, or loop closure detection.
However, geometric methods lack robustness to a number of very common effects, such as:
  1. Illumination changes
  2. dynamic objects
  3. different camera calibrations
  4. low-textured environments
  5. noise and blur
In previous work we showed how learning based techniques could be used to learn VO from data. The features were hand crafted extracting Optical Flow and computing their histograms in different part of the images. In this work we show how a CNN architecture can be used to learn the input space representation to be used for the Ego-Motion estimationboth task.  In particular we show how robust is the results, applying huge amounts of blure and contrast changes.
Our contribution summarizes to this points:
  1. We explore feature selection for ego-motion estimation using different CNN architectures. The CNN architectures are used to extract new input features starting from dense optical flow. Three different architectures are proposed in the paper: two of them investigate the influence of global and local optical flow fields with respect to the ego-motion estimation (i.e., considering both the full flow image and its different sub-blocks); the last one combine the advantages of the others in a parallel CNN that exploits both global and local information.
  2. We show that the presented learned estimators are able to estimate motion outperforming other SotA geometrical and learned methods. In addition the proposed methods are able to use global information to extract camera motion and scale information, while dealing with noise in input.
  3. Finally, we show the performances of the presented method in difficult scenarios, using images with very different contrast and blur parameters, to show the robustness of the new features extracted by the CNN.

 Network Architecture

In tasks such as object recognition and people detection the input images are smaller than the ones typically used in VO. Simply applying one of the already proposed architectures is not straightforward. Down-sampling the image could discard important information for motion estimate. For this reason, we tested three different architectures and compared their performances:
  1. CNN-1b VO: As a basic exploratory approach we train quadrant cnna deep network on the entire OF image after down sampling it 8 times with average pooling to reach a dimension of 155 × 48.
  2. CNN-4b VO: The first alternative configuration tries to exploit local information. We divide the OF image into four sub-images. Each quadrant is down-sampled 4 times and then passed through a series of CNN filters analogous to CNN-1b ones. The final layer is trained to use the output of the four CNN networks to give a global F2F estimate.
  3. P-CNN VO: The last architecture uses the CNN filters of both CNN-1b and CNN-4b feeding their output to a fully connected network. We do so to explore the performances of a network that merges the global information of CNN-1b with the local information of CNN-4b.

The architecture of CNN-4b network is shown in Figure 1 global network2. The first section of the network is composed of four branches, identical in complexity, but trained separately, that perform the first two convolutional steps (CNN1 and CNN2). Note that each of the four quadrants of the image contains some motion information to compute a motion estimate, with ambiguity between simple turns and forward moving motion. We then link the output of the first CNN-pooling pair with the second one. We do so because exploratory experiments on a down-sampled version of the OF images showed that VO estimators using only CNN1 output, or only the cascade of the two CNNs, were both able to learn good estimators, but the VO estimator learned on the concatenation of the two outputs performed better. This result shows that CNN1 and CNN2 extract different information from the OF images. We presume that CNN1 extracts finer details, while the CNN2 extracts coarser ones, and that this information is not completely overlapping. After this stage the four convoluted features are put back together to form an image that contains the global information and thus is able to solve the motion  ambiguities with symmetry information. The last layer computes a fully connected network that uses the information of all four quadrants at both resolutions, as shown in the upper part of Figure 2.

The last architecture, P-CNN, is a composition of the other two networks as shown in Figure 2.


 We tested our architectures on the KITTI dataset. The details of the experiments are shown in the paper, here we show some additional material that could not be included in the original paper.
Darkened Sequences

Darkened sequences simulate dusk conditions. Lowering contrast makes more difficult for feature extractors to find corners, still with these values of contrast and gamma crisp shadows are still recognizable. Compared to standard sequences, the darkened PCNN sequence error on average is 1.39% higher, while the SVR-S is 6.53% higher and the VISO2 is 4.99% higher. When we look at the single trajectories, it is possible to notice that the decrease in performances is higher in the 09 and 10 sequences. We suppose that this is due to the higher field-depth and linear speeds, that make these sequences more challenging.

  Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m]
08 26.33 0.0389 41.81 0.1114 18.06 0.0490 8.45 0.0249
09 13.64 0.0357 19.88 0.0669 34.30 0.0550 11.03 0.0338
10 22.74 0.0352 29.28 0.0670 25.49 0.0646 20.03 0.0458
Avg 23.54 0.0387 35.04 0.1005 20.34 0.0545 10.28 0.0300
seq 08 darkened baseline seq 09 darkened baseline
seq 10 darkened baseline
tl darkened baseline  rl darkened baseline
ts darkened baseline rs darkened baseline
08 darkened 09 darkened

10 darkened

Darkened 2 Sequences

Darkened 2 sequences simulate night conditions. At these levels of contrast and gamma the shadows in the images are very dark and a lot of small details are lost. Clearly these transforms are only an approximation of what happens with low-light vision, but they give an insight of what the estimator algorithms do when there is a comparable decrease of detail. With these sequences we see a stark difference between PCNN and other methods. SVR-S has the lowest performance, probably becouse of a very simple Lucas-Kanade sparse feature extraction. However SVR-D and VISO2 have a traslational error that is near doubled in respect to the PCNN's one, and a rotational error that is between 30-40% higher.

  Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m]
08 37.82 0.0493 52.96 0.1479 30.18 0.0784 14.53 0.0366
09 30.18 0.0537 26.26 0.0842 23.66 0.0773 15.82 0.0458
10 25.97 0.1305 38.88 0.1024 24.36 0.0546 18.53 0.0464
Avg 35.28 0.0610 44.61 0.1340 28.10 0.0792 15.25


seq 08 darkened 2 baseline seq 09 darkened 2 baseline
seq 10 darkened 2 baseline
tl darkened 2 baseline  rl darkened 2 baseline
ts darkened 2 baseline rs darkened 2 baseline


08 darkened 2 09 darkened 2

10 darkened 2



lightened Sequences

Lightened sequences simulate high light conditions thanks to low value of gamma correction. These images have also very low contrast (min 0.2, max0.7), so they are particularly challenging. The highest performance issues are for VISO2 on sequence 10, where it fails to extract enough features in many frames, so the error is huge. As with the preceiding examples, the behaviour of PCNN is better than SVM and VISIO2.

  Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m]
08 32.24 0.0378 46.51 0.1072 19.90 0.0595 10.16 0.0294
09 18.71 0.0268 22.85 0.0752 24.36 0.0491 20.08 0.0391
10 91.36 0.0541 43.73 0.1294 22.47 0.0734 21.02 0.0460
Avg 36.83 0.0380 40.45 0.1059 21.31 0.0617 13.51 0.0343
seq 08 lightened baseline seq 09 lightened baseline
seq 10 lightened baseline
tl lightened baseline  rl lightened baseline
ts lightened baseline rs lightened baseline
08 lightened 09 lightened

10 lightened

blurred Sequences - radius 3 pixels

These sequences are blurred with a small radius of 3 pixels. The motre striking result is that the effects of this blur are slightly beneficial to PCNN for the translational errors, while for the other methods are not. In detail, the decrease in error in respect to standard sequences is <b>-0.33%</b>, while VISO2 and SVR-S have an increase of <b>+4.99%</b> and <b>+8.07%</b>. The average rotational errors are <b>+11%</b> <b>+106%</b> <b>2.92%</b> respectivlely for PCNN, SVR-S and VISO2, showing a higher under-performance of sparse SVR on this sequences.

  Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m]
08 23.70 0.0431 25.23 0.0674 14.41 0.0386 7.41 0.0229
09 10.99 0.0317 12.43 0.0438 21.99 0.0332 6.74 0.0253
10 25.83 0.0454 20.09 0.0463 26.74 0.0590 19.35 0.0380
Avg 23.54 0.0387 21.88 0.0623 17.52 0.0414 8.63 0.0262
seq 08 blurred s3 baseline seq 09 blurred s3 baseline
seq 10 blurred s3 baseline
tl blurred s3 baseline  rl blurred s3 baseline
ts blurred s3 baseline rs blurred s3 baseline
08 blurred s3 09 blurred s3

10 blurred s3

blurred Sequences - radius 10 pixels

These sequences are blurred with a radius of 10 pixels. The results show only a slight increase in error for the dense methods, and again PCNN performs better, showing that the feature it learns are robust to high levels of blur. However, SVR-S and SVR-D are very similar in performances, and they are better than the blur s3 case. This suggest that there is something in SVR that helps in reducing the effects of high blur. The errors of VISO2 on highly blurred images are more than doubled.

  Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m] Trans [%] Rot [deg/m]
08 53.25 0.0694 20.00 0.0493 14.81 0.0395 7.41 0.0229
09 38.02 0.0593 13.87 0.0503 22.06 0.0372 11.80 0.0350
10 82.37 0.2021 19.15 0.0514 26.61 0.0621 19.87 0.0416
Avg 54.32 0.0856 18.66 0.0519 17.96 0.0433 9.82 0.0286
seq 08 blurred s10 baseline seq 09 blurred s10 baseline
seq 10 blurred s10 baseline
tl blurred s10 baseline  rl blurred s10 baseline
ts blurred s10 baseline rs blurred s10 baseline
08 blurred s10 09 blurred s10

10 blurred s10


 The following link contains the matlab/python code, the caffe net definitions and the datasets in HDF5 format:


Related Publications and Citations

Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation

author={G. Costante and M. Mancini and P. Valigi and T. A. Ciarfuglia},
journal={IEEE Robotics and Automation Letters},
title={Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation},


Evaluation of non-geometric methods for visual odometry

author = "Thomas A. Ciarfuglia and Gabriele Costante and Paolo Valigi and Elisa Ricci"
title = "Evaluation of non-geometric methods for visual odometry ",
journal = "Robotics and Autonomous Systems ",
volume = "62",
number = "12",
pages = "1717 - 1730",
year = "2014",
note = "",
issn = "0921-8890",
doi = "http://dx.doi.org/10.1016/j.robot.2014.08.001",