Our on domain independent approaches for monocular depth estimation.
Abstract— Modern autonomous mobile robots require a strong understanding of their surroundings in order to safely operate in cluttered and dynamic environments. Monocular depth estimation offers a geometry-independent paradigm to detect free, navigable space with minimum space and power consumption. These represent highly desirable features, especially for micro aerial vehicles. In order to guarantee robust operation in real world scenarios, the estimator is required to generalize well in diverse environments. Most of the existent depth estimators do not consider generalization, and only benchmark their performance on publicly available datasets after specific finetuning. Generalization can be achieved by training on several heterogeneous datasets, but their collection and labeling is costly. In this work, we propose a Deep Neural Network for scene depth estimation that is trained on synthetic datasets, which allow inexpensive generation of ground truth data. We show how this approach is able to generalize well across different scenarios. In addition, we show how the addition of Long Short Term Memory (LSTM) layers in the network helps to alleviate, in sequential image streams, some of the intrinsic limitations of monocular vision, such as global scale estimation, with low computational overhead. We demonstrate that the network is able to generalize well with respect to different real world environments without any fine-tuning, achieving comparable performance to state-ofthe- art methods on the KITTI dataset..
As autonomous vehicles become more common in many applications outside the research laboratory, the requirements for safe and optimal operation of such systems become more challenging. In particular, the ability to detect and avoid still or mobile obstacles is crucial for field operations of the vast majority of ground and low altitude flight vehicles. Depth information can be used to estimate proximity to obstacles and, in general, to obtain an understanding of the surrounding 3D space. Vision-based systems, both mono and stereo, can provide dense depth maps and are more suitable for deploying on small vehicles. A primary shortcoming, though, is that the detection range and accuracy of stereo cameras are limited by the camera set-up and baseline. Differently from stereo systems, monocular systems do not make specific assumptions on camera motion or set- up. However, on the downside, these systems relyon the quality and variety of the training set and ground truth provided, and often are not able to adapt to unseen environments. In this work, by using a deeper architecture and an extended synthetic dataset able to generalize from synthetic data to real unseen sequences, we take an important step towards domain independence for vision-based depth esti- mation applications. With robotic-based operations in mind, we reduce the computational complexity of the network by removing the network dependence on optical flow, even if it often acts as a environment-invariant feature. To balance this loss of information, we exploit the input stream sequentiality by using Long Short Term Memory (LSTM) layers, a specific form of Recurrent Neural Networks (RNNs).
We propose as a baseline method a fully convolutional architecture, structured in a encoder-decoder fashion. This a very popular architectural choice for several pixel-wise prediction tasks, as optical flow estimation or semantic segmentation. In our proposed network, the encoder section corresponds to the popular VGG network, pruned of its fully connected layers. We initialize the encoder weights with the VGG pre-trained model for image classification. Models trained on huge image classification datasets proved to act as a great generic-purpose feature extractor: low-level features are extracted by convolutional layers closer to the input layer of the net, while layers closer to the output of the net extract high- level, more task-dependent descriptors. During training, out of the 16 convolutional layers of the VGG net, the weights of the first 8 layers are kept fixed; remaining layers are fine-tuned. The decoder section of the network is composed by 2 deconvolutional layers and a final convolutional layer which outputs the predicted depth at original input resolution. These layers are trained from scratch, using random weight initialization.
Any monocular, single image depth estimation method suffers from the infeasibility of correctly estimating the global scale of the scene. Learning-based methods try to infer global scale from the learned proportions between depicted objects in the training dataset. This paradigm inevitably fails when previously unseen environments are evaluated or when the camera focal length is modified. We can try to correct these failures by exploiting the sequential nature of the image stream captured by a vision module mounted on a deployed robot. Recurrent neural net- works (RNN) are typically used in tasks where long-term temporal dependencies between inputs matter when it comes to performing estimation: text/speech analysis, action recognition in a video stream, person re-identification. Their output is a function of both the current input fed into the network and the past output, so that memory is carried forward through time as the sequence progresses. Long Short Term Memory networks (LSTM) are a special kind of recurrent neural network introduced by Hochreiter & Schmidhuber in 1997 to overcome some of the RNN main issues, as vanishing gradients during training, which made them very challenging to use in practical applications. Memory in LSTMs is maintained as a gated cell where information can be read, written or deleted. During training, the cell learns autonomously how to treat incoming and stored information. We insert two LSTM layers between the encoder and decoder section of the previously introduced FCN network. Our motivation is to refine features extracted by the encoder according to the information stored in the LSTM cells, so that the decoder section can return a more coherent depth estimation. The proposed LSTM network is depicted in Image 3. Dropout is applied before, after and in the middle of the two LSTM layers to improve regularization during training.
We test generalization capability of our proposed networks on the KITTI dataset, and on two datasets we gathered in a dense forest in the surroundings of Zurich, Switzerland and in the countryside near Perugia, Italy, respectively.
The following readme contains instructions and details about our code
Zurich Forest Benchmark
While the Optical Flow images and labels in HDF5 format can be found here:
Related Publications and Citations
Towards Domain Indipendence for Learning-Based Monocular Depth Estimation
M.Mancini, G.Costante, P.Valigi, T.A. Ciarfuglia, J.Delmerico, D.Scaramuzza
IEEE Robotics and Automation Letter (2017)
Fast Robust Monocular Depth Estimation for Obstacle Detection with Fully Convolutional Networks
M.Mancini, G.Costante, P.Valigi, T.A. Ciarfuglia
Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on