Learn an embedded representation of optical flow to increase Visual Odometry performances.

vo embedding overview

 

Abstract— This work proposes a novel deep network architecture to solve the camera Ego-Motion estimation problem. A motion estimation network generally learns features similar to Optical Flow (OF) fields starting from sequences of images. This OF can be described by a lower dimensional latent space. Previous research has shown how to find linear approximations of this space. We propose to use an Auto-Encoder network to find a non-linear representation of the OF manifold. In addition, we propose to learn the latent space jointly with the estimation task, so that the learned OF features become a more robust description of the OF input. We call this novel architecture Latent Space Visual Odometry (LS-VO). The experiments show that LS-VO achieves a considerable increase in performances with respect to baselines, while the number of paravmeters of the estimation network only slightly increases.


Brief Introduction

In this work we explore a novel network architecture (LS-VO) that enhances the VO estimation learning a multi-task objective. We jointly learn the Ego-motion estimation and an Optical Flow embedded space representation. This joint learning hels in generalizing the knwledge of the network and make it more robust to domain covariate shifts, e.g. changes in appearance, illumination condition, noise etc.
To test the properties of this architecture, we perform some experiments using KITTI and Malaga 2013 datasets, that are two datasets with similar set up, but different appearance (Malaga images have an higher field depth and different illumination with respect to KITTI). Some sample of the two datasets are shown below.

The experiments we perform are increasingly challenging:

  1. Standard experiments: train on some sequences, test on others
  2. Change of car speed: we simulate the change in speed downsampling the datasets. In particular we downsample by 2 and 3 (dns2 and dns3 respectively). We train on the normapl images (dns1) plus the dns3 and test on the dns2. This means that the network has no examples of dns2 on the training set.
  3. We change the test set of the previous experiment: we add gaussian blur to the test set only.

 

Description of the problem

Learning based Visual Odometry (L-VO) in the last few years has seen an increasing attention of the robotics community because of its desirable properties of robustness to image noise and camera calibration independence, mostly thanks to Convolutional Neural Networks (CNNs) representational power, which can complement current geometric solutions.
While current results are very promising, making these solutions easily applicable to different environments still presents challenges. One of them is that most of the approaches so far explored have not shown strong domain independence and suffer from high dataset bias, i.e. the performances considerably degrade when tested on sequences with motion dynamics and scene depth significantly different from the training data. In the context of L-VO this bias is expressed in different Optical Flow (OF) field distribution in training and test data, due to differences in scene depth and general motion of the camera sensor.

One possible explanation for the poor performances of learned methods on unseen contexts is that most current learning architectures try to extract both visual features and motion estimate as a single training problem, coupling the appearance and scene depth with the actual camera motion information contained in the OF input. At the same time, previous research has shown how OF fields have a bilinear dependence on motion and inverse scene depth. We suggest that this is the main reason for the low generalization properties shown by learned algorithms so far.  
Past research has shown that the high dimensional OF field, when scene depth can be considered locally constant, can be projected on a much lower dimensional linear space. However, when these conditions do not hold, the OF field subspace exists but is highly non-linear.

In this work we propose to exploit this knowledge, estimating the latent OF representation using an Auto-Encoder (AE) Neural Network architecture as a non-linear subspace approximator. AE networks are able to extract latent variable representation of high dimensional inputs. Since our aim is to make the Ego-Motion estimation more robust to OF fields that show high variability in their distribution, we do not simply use this subspace to directly produce motion prediction. Instead, we propose a novel architecture that jointly trains the subspace estimation and Ego-Motion estimation so that the two network tasks are mutually reinforcing and at the same time able to better generalize OF field representation.

In the following the network architecture and the experimental results are described. For more thechnical details, please check our LS-VO paper.


 Network Architecture

 Multi-task architecture to learn the VO embedding


Experimental Setup

Datasets

The datasets used are KITTI and Malaga 2013. Some example of the images contained in these datase are shown below. Both datasets consists of image taken from a moving car in the countryside and suburbs of Karlsruhe and Malaga cities. Clearly, the camera opticsa are different, and also the appearance of the setings are different. For the Malaga dataset there is no RTK-GPS ground truth, so we use as ground truth the odometry estimate of the Stereo ORBSLAM2, that has an accuracy an order of magnitude higher than most frame-to-frame monocular methods. 

000000

 

000048
img CAMERA1 1261228866.669642 left

 

000093

 

000046
img CAMERA1 1261228991.070820 left

 

The input to the nework is the optical flow computed from these images. However, since the images are quite large, they are downsaled to 300x94 pixels for KITTI and 224x170 pixels for Malaga. The Optical Flow can be computed with geometrical methods (see Brox method) or with other deep networks architectures (see Flownet). We test LS-VO both with the flownet input (see Figure above) and with Brox computed OF. We indicate these two variants with LS-VO Flownet and LS-VO BF.
These are examples of the OF used:

 

Baselines

To test the increased robustness of this architecture, we use a single branch network, based on [??]. We call this network Single Task (ST). The details of the network can be found in the paper. The weights for the Tensorflow implementation can be found at the end of this report.
In addition to the Learned methods, we use two geometrical baselines to give context to our work. The two State-of-the-Art methods are VISO2 Mono (VISO2-M) and ORBSLAM2 Mono (ORBSLAM2-M).  Both these methods are not Fram-to-Frame estimators, since they perform some kind of optimization in addition to F2F VO. VISO2-M estimates the scale using the knowledge of the height of the camera from the ground plane, while ORBSLAM performs local bundle adjustment, keypoint tracking, and global bundle adjustment with loop closing. We leave VISO as is, but we remove the Loop closing feature from ORBSLAM,  since it is a mapping feature ad to make the comparison more meaningful. In addition, we tune the parameters of ORBSLAM to the best we can find for the given experiments. The modified ORBSLAM version can be found in the code section below.

Experiments

All the experiments are done with the same splits. For KITTI the training sequences are 00 to 07, and test is done on 08-10. For Malaga the training sequences are 01, 04, 06, 07, 08, 10 and 11. The test sequences are 02, 03 and 09. The data alteration on experiments2 and 3 are explained in the following. The performance metrics and plots are produced using the KITTI devkit. The plots are reprocessed with python plotting utilities.

1-  No downsampling

These are the standard experiments. In the following the average performances of all methods are shown.

KITTI dns1 resultsThe gemetrica methods perform very well on angular rate (deg/m), but have less edge on translation, due to scale uncertainty in monocular VO. On the learning side, it is evident that the LS-VO architecures are better than the ST ones and that have performances similar to the geometrical methods. Consider that the inpu to the learned methods is a low resolution image.

The detailed plots are the following:

KITTI DNS1

ts arch
rs arch
tl arch
rl arch

 


 MALAGA DNS1

ts arch
rs arch
tl arch
rl arch

 

As in the case with KITTI images, the LS architectures perform better than the ST ones, while performing much better than geometrical methods, except for rotational rate. Again, this is expected, since the geometrical methods have problems in estimating scale. In addition to this, for this particular dataset, good keypoijnt extraction was a more difficult task with respect to KITTI, becouse of contrast and long scene depth. Even without any mdification to the images, this already shows how the perfromances of the learned methods are more constant than geometrical ones.

 
2-  With downsampling

In this set of experiments the images are downsampled 2 (5Hz) and 3 (3.33Hz) times. This implies an increase of the dynamics of the optical flow (i.e. a domain shift). To make a challenging test we decided to train on the dns1 and dns3 versions of the training sequences, and test on dns2 version of the test sequences. This makes the test sequences non only different in appearance, but also in OF dynamics. The results for both datasets are summarized in the following table and plots:

 KITTI DNS2

ts arch
rs arch
ts arch
rs arch
ts arch
rs arch
tl arch

 MALAGA DNS2

ts arch
rs arch
tl arch
rl arch
ts arch
rs arch
tl arch

 

 It is evident that the LS-VO architecture is more able to cope with change in OF dynamics. Note that the axis of the lower plots have higher speeds becouse the reduced frame rate is perceived as an increas in vehicoular speed from the KITTI evaluation software.

 

For the geometrical methods, only VISO2-M was able to give results. ORBSLAM2 failed to find keypoints, due to the increased distance between frames.  We show the results in the final table after describing the last experiment.

 
2-  With downsampling and blur

We took the same setup of the previous experiments and blurred only the test sets. So, during the training phase the networks were not able to get any iblurried sample. Examples of blurred images are these:

004031
img CAMERA1 1261228959.070495 left

 

It is clear that these images pose a huge challenge to geometrical methods. And also to the learned methods, since the resulting optical flow is much different from the one experimented during the training phase.

The results of the experiments is the following (we add the tables summarizing all the experiements here, for clarity):

KITTI DNS2 + BLUR

results kitti all

 

ts arch
rs arch
ts arch
rs arch
ts arch
rs arch
ts arch
rs arch
ts arch
rs arch
tl arch

 

MALAGA DNS2 + BLUR

results malaga all

 

ts arch
rs arch
tl arch
rl arch
seq 02 arch
seq 03 arch
seq 09 arch

Tese experiments make more evident that the LS-VO architecture is more robust to domain changes than the ST variant. This meas that learning a embedded representation of the OF space is helpful in generalising the model. In addition, we see that the geometrical methods have huge difficulties with the domain shifts, while the learned methods are always able to propose a coherent solutions.

 

Conclusions

This work presented LS-VO, a novel network architecture for estimating monocular camera Ego-Motion. The architecture is composed by two branches that jointly learn a latent space representation of the input OF field, and the camera motion estimate. The joint training allows for the learning of OF features that take into account the underlying structure of a lower dimensional OF manifold. 

The proposed architecture has been tested on the KITTI and Malaga datasets, with challenging alterations, in order to test the robustness to domain variability in both appearance and OF dynamic range. Compared to the data-driven architectures, LS-VO network outperformed the single branch network on most benchmarks, and in the others performed at the same level. Compared to geometrical methods, the learned methods show outstanding robustness to non-ideal conditions and reasonable performances, given that they work only on a frame to frame estimation and on smaller input images.

 The new architecture is lean and easy to train and shows good generalization performances. The results provided here are promising and encourage further exploration of OF field latent space learning for the purpose of estimating camera Ego-Motion.

 

Downloads

The code and the dataset we used are available at https://github.com/isarlab-department-engineering/LSVO

 

Citation

The article preprint can be cited as:

@ARTICLE{costante-ciarfuglia-lsvo,
author={G. Costante and T. A. Ciarfuglia},
journal={IEEE Robotics and Automation Letters},
title={LS-VO: Learning Dense Optical Subspace for Robust Visual Odometry Estimation},
year={2018},
volume={3},
number={3},
pages={1735-1742},
keywords={Cameras;Computer architecture;Estimation;Optical flow;Robustness;Visual odometry;Computer vision for transportation;deep learning in robotics and automation;visual learning;visual-based navigation},
doi={10.1109/LRA.2018.2803211},
ISSN={},
month={July},}