Our work on robust Visual Place Recognition.

CCl5Abstract - Visual Self-localization in unknown environments is a crucial capability for an autonomous robot. The place recognition module is a fundamental component in SLAM systems, as incorrect loop closures may result in severe errors in trajectory estimation. Real life scenarios often present critical challenges for autonomous vision-based localization, such as robustness to viewpoint and appearance changes and robustness to perceptual aliasing. In the context of vision-based place recognition, we aim to face those issues. For the perceptual aliasing problem, we propose the adoption of a set of visual words weights, learned offline accordingly to a discriminative criterion (the large margin paradigm) for appearance-based methods that employ the bag-of-words approach for recognizing locations. To address the issues due to appearance and viewpoint changes, we propose strategies that model the scene by preserving its geometric and semantic structure and, at the same time, improve appearance invariance through a robust visual representation. Our methods rely on high level visual landmarks consisting of appearance invariant descriptors that are extracted by a pre-trained Convolutional Neural Network (CNN) on the basis of image patches. In addition, during the exploration, the landmarks are organized by incrementally building a graph that, at query time, is exploited to retrieve candidate matching location improving the robustness in terms of viewpoint


In the last decade, vision-based navigation systems have achieved impressive results, considerably extending the application area of many robotic platforms. However, it is well known that, during long term operations, the localization performance may drop due to the drift of the estimation procedures, which can lead to a critical failure of most state-of-the-art systems. As a consequence, place recognition capabilities are crucial functions for loop closure detection and to increase the robustness of the overall estimation process.

Most of the existing place recognition strategies have been developed considering image sequences characterized by small viewpoint and lighting variations and, within these scenarios, the results obtained are very promising. However, these simplified conditions do not hold in real life autonomous exploration contexts, where the visual scene is typically affected by a number of challenging problems. For instance, seasonal or weather changes, natural or artificial daily illumination variations may severely affect the global appearance of the scene; further, dynamic elements, e.g., pedestrians, vehicles or new static objects may cause appearance changes, since they can occlude or alter portions of the scene. In addition, traversing the same environment with different orientations can change the scene viewpoint, which may alter significantly the relative position of objects in the scene. Place recognition algorithms that exploit low level visual features are typically very sensitive to perceptual aliasing and strong image variations and, therefore, they do not provide good place recognition performance.

Many State-of- the-art visual place recognition systems are currently based on the bag-of-words (BoW) paradigm. Typically in the BoW representation pairs of images are compared using the term frequency-inverse document frequency (TF-IDF) weighting scheme, i.e. words of the dictionary contribute to the similarity score in a way inversely proportional to their occurrence. In practice, when comparing two histograms very common words are down-weighted since they are considered not informative with respect to specific locations.While the TF-IDF scheme provides good place recognition performance at low computational cost, it relies on a basic intuition: the occurrence of a word in the vocabulary reflects its importance. However, this assumption is often violated in practice: for example, in the case of a robot moving in an office scenario, local features associated with walls and doors can be very common but still important to recognize a specific location (e.g. a corridor). In these situation a TF-IDF approach fails. To overcome this drawback, in this paper we propose a different strategy to compute the word weights.

Other recent works have shown that high level visual features, i.e., semantic cues, provide a more robust representation of the scene since they also encode information about object categories and their mutual relations. In fact, semantic features provide a better characterization of the scene, which may facilitate the place recognition process by an autonomous robot. However, the detection of different objects may not be enough to unequivocally identify a specific place (e.g., cars and buildings could be not discriminative in an urban environment). In these scenarios, the capability to discriminate between different spatial configurations and different views of the objects is crucial.Motivated by the previous considerations, we design vision-based place recognition systems that rely on a graph of semantic visual objects that is built incrementally during navigation.  

 Systems Architecture


To overcome the typical TF-IDF drawback, we propose a different strategy to compute the word weights. Given a set of images taken as a training set we aim to learn the word weights such that images depicting the same locations are constrained to be more similar to each other than to images corresponding to different places. Inspired by recent works on distance learning, we formalize this task as a convex optimization problem and we propose an efficient algorithm to solve it. We further show that the learned weights can be successfully used for detecting loop closures in the context of visual SLAM.

In addition, we work on viewpoint and appearance robustness of the Place recognition module. In order to improve the robustness with respect to appearance changes, a graph is built in such a way that the nodes collect similar image patches that are represented by high level descriptors extracted by the inner convolutional layer of a public CNN trained specifically for object recognition purposes.

Furthermore, to handle viewpoint changes and to ease the place recognition task, the edges of the graph are used to encode covisibility information, that is edges are created to connect the objects that have been observed together from the same point of view. The result is a covisibility graph that takes into account mutual object arrangements.Place Recognition is then addressed as a graph-matching problem exploiting both the CNN features and the graph structure.

In addition, the graph is exploited to build virtual locations in a new strategy that relies only on graph algebraic properties. Virtual locations represent synthetic views of the scene that are not present in the image database. As a consequence, the algorithm has the potential ability to recognize places even in the presence of strong viewpoint changes.  


Here we show some of the results obtained with our approaches. Further details can be found in the related publications below.

Learned weights

We applied the proposed discriminative BoW strategy to detect loop closures and compared its performance with those obtained with BoW and TF-IDF weighting scheme. The learned weights greatly increase precision with only a slight decrease in recall in all the considered test datasets. It is noteworthy that our approach guarantees better performance and in particular higher precision at similar recall even using a smaller vocabulary.  

im1tfidf im2tfidf im3lws
The most important features according to TF-IDF weighting scheme. The most important features according to TF-IDF weighting scheme. The most important features found  by our learning algorithm
covisibility Graph of semi-semantic features

Experimental validation carried out on four public datasets has shown that, with regard to precision and recall, our approach provides performance that is comparable (or superior) with respect to state-of-the-art place recognition techniques that do not rely on any graph representation of the environment.In addition, the construction of virtual locations is useful in specific but critical situations such as turning near 90° corners or traversing a stretch in opposite directions. In these scenarios, virtual locations construction provides an improvement in terms of precision of the loop closure detection system.  

NCvl CCvl MALvl IDOLvl
GPS positions of created virtual locations in the New College dataset. GPS positions of created virtual locations in the City Centre dataset. GPS positions of created virtual locations in the Malaga 6L dataset. GPS positions of created virtual locations in the IDOL (dum_sunny+dum_cloudy) dataset.



Coming soon.

Related Publications and Citations

Robust visual semi-semantic loop closure detection by a covisibility graph and CNN features


title={Robust visual semi-semantic loop closure detection by a covisibility graph and CNN features},

author={Cascianelli, Silvia and Costante, Gabriele and Bellocchio, Enrico and Valigi, Paolo and Fravolini, Mario L and Ciarfuglia, Thomas A},

journal={Robotics and Autonomous Systems},







A Robust Semi-Semantic Approach For Visual Localization In Urban Environment


title={A robust semi-semantic approach for visual localization in urban environment},

author={Cascianelli, Silvia and Costante, Gabriele and Bellocchio, Enrico and Valigi, Paolo and Fravolini, Mario L and Ciarfuglia, Thomas A},

booktitle={Smart Cities Conference (ISC2), 2016 IEEE International},






Transferring knowledge across robots: A risk sensitive approach


title={Transferring knowledge across robots: A risk sensitive approach},

author={Costante, Gabriele and Ciarfuglia, Thomas A and Valigi, Paolo and Ricci, Elisa},

journal={Robotics and Autonomous Systems},







A Transfer Learning Approach for Multi-Cue Semantic Place Recognition


title={A transfer learning approach for multi-cue semantic place recognition},

author={Costante, Gabriele and Ciarfuglia, Thomas A and Valigi, Paolo and Ricci, Elisa},

booktitle={Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on},






A discriminative approach for appearance based loop closing


title={A discriminative approach for appearance based loop closing},

author={Ciarfuglia, Thomas A and Costante, Gabriele and Valigi, Paolo and Ricci, Elisa},

booktitle={Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on},