The Intelligent Systems, Automation and Robotics Laboratory

Artificial Perception: from current state of the art in research and industry to the next frontiers

Workshop at the 4th Italian Conference for Robotics and Intelligent Machines (I-RIM)

Istituto Superiore Antincendi (Room 21) – Rome, Italy

October 7, 2022 (9.30 – 13.00)



Workshop Organizers


Gabriele Costante ( 1

Matteo Matteucci ( 2

Ettore Stella ( 3


1 Department of Engineering, University of Perugia

2 Department of Information, Electronics and Bioengineering – Politecnico di Milano




The capability to extract information from raw data collected by contactless sensors is one of the founding stones to develop AI-based solutions. This workshop aims to gather knowledge and expertise from the world of research and industry on technologies for interpreting data from sensors and extracting information for the application context of interest. The data sources can be heterogeneous and the methodologies may include model-based, data-driven and deep learning-based strategies, also considering multi-sensor setups.

List of Speakers (Title and Abstract)

Luca di Giammarino (Sapienza Università di Roma, Italy)


MD-SLAM: Multi-cue Direct SLAM

Simultaneous Localization and Mapping (SLAM) systems are fundamental building blocks for any autonomous robot navigating in unknown environments. The SLAM implementation heavily depends on the sensor modality employed on the mobile platform. For this reason, assumptions on the scene’s structure are often made to maximize estimation accuracy. This talk presents a novel direct 3D SLAM pipeline that works independently for RGB-D and LiDAR sensors. Building upon prior work on multi-cue photometric frame-to-frame alignment, our proposed approach provides an easy-to-extend and generic SLAM system. Our pipeline requires only minor adaptations within the projection model to handle different sensor modalities. We couple a position tracking system with an appearance-based relocalization mechanism that handles large loop closures. Loop closures are validated by the same direct registration algorithm used for odometry estimation. We present comparative experiments with state-of-the-art approaches on publicly available benchmarks using RGB-D cameras and 3D LiDARs. Our system performs well in heterogeneous datasets compared to other sensor-specific methods while making no assumptions about the environment. Finally, we release an open-source C++ implementation of our system.

Alberto Del Bimbo (University of Florence, Italy)


Learning social behaviors for trajectory prediction with Memory Augmented Networks

Effective modeling of human interactions is of utmost importance when forecasting their future motion trajectories. Typically, in social contexts, people obey to non-written rules such as collision avoidance or group following and their behaviors determine the motion trajectories of the surrounding agents. In this talk, we address trajectory prediction of people in social contexts by exploiting the augmented memory neural network model. We present an end-to-end trainable neural network model augmented with an external memory module, where the relevant items of information for each agent are stored and usefully retrieved. Following this approach, the network is able to effectively learn the set of social rules that determine social behaviors of pedestrians, and therefore predict their motion trajectories. We report state-of-the art results on ETH/UCY and SDD datasets and show that this approach also provides explainable predictions by design.

Elisa Ricci (Department of Information Engineering and Computer Science, University of Trento, Italy - Fondazione Bruno Kessler (FBK), Trento, Italy)


Multi-modal human behaviour analysis for social robotics

Automated analysis of social interactions is critical for a number of applications such as surveillance, robotics and social signal processing. In particular, the automatic analysis of conversational groups is of fundamental importance for developing technologies for Human- Robot Interaction and require addressing simultaneously different tasks (e.g. people tracking, voice recognition, head and body pose estimation, F-formation detection, emotion estimation) processing multimodal data gathered from different sensors. In this talk I will present some of our recent works in this area, focusing on the methods and the technologies developed during the EU project SPRING.

Alessandro Betti (University of Côte d'Azur, France) - Stefano Melacci (University of Siena, Italy)


Lifelong Learning from Video Streams

The remarkable progress in computer vision on object recognition in the last few years achieved by deep convolutional neural networks is strongly connected with the availability of huge labeled data paired with strong and suitable computational resources. Clearly, the corresponding supervised communication protocol between machines and the visual environments is far from being natural. Current deep learning approaches based on supervised images mostly neglect the crucial role of temporal coherence: When computer scientists began to cultivate the idea of interpreting natural video, in order to simplify the problem they removed time, the connecting wire between frames. As we decide to frame visual learning processes in their own natural video environment, we soon realize that perceptual visual skills cannot emerge from massive supervision on different object categories. Foveated animals move their eyes, which means that even still images are perceived as patterns that change over time. Since information is interwound with motion, we propose to explore the consequences of stressing the assumption that the focus on motion is in fact nearly “all that we need”. When trusting this viewpoint, one realizes that time plays a crucial role, and the underlying computational model must refer to single pixels at a certain time.

Davide Scaramuzza (University of Zurich, Switzerland)

Learning Autonomous Vision-based Flight

I will summarize our latest research in learning deep sensorimotor policies for agile vision-based quadrotor flight. Learning sensorimotor policies represents a holistic approach that is more resilient to noisy sensory observations and imperfect world models. However, training robust policies requires a large amount of data. I will show that simulation data is enough to train policies that transfer to the real world without fine-tuning. We achieve one-shot sim-to-real transfer through the appropriate abstraction of sensory observations and control commands. I will show that these learned policies enable autonomous quadrotors to fly faster and more robustly than before, using only onboard cameras and computation. Applications include acrobatics, high-speed navigation in the wild, and autonomous drone racing.

Riccardo Imana (IMAGE S s.p.a., Italy)

Monitoring the territory and the environment with Drones and Hyperspectral cameras

Monitoring the territory and the environment requires advanced solutions and complex processes. The combination of UAS (both quadcopters and fixed-wing drones) and hyperspectral cameras, makes it possible to classify species in the forest and agricultural fields, obtain multiple information on the health of the vegetation and locate and monitor parasites. Regarding environmental monitoring, the hyperspectral technology offers an important help in the localization of heavy metals or pollutants in soils or waters. Last but not least, the possibility to help the prevention of fires and the safeguarding and/or protection of particular species. The presentation will show application examples and how this combination of technologies represents the future for the analysis of large and complex areas. (Demo on Site).

Andrea Prati (Dipartimento di Ingegneria e Architettura, Università degli Studi di Parma, Italy)


Deep Learning and Generative Models for images: are they ready for industrial applications?

Thanks to advancements in hardware, software and research, current Deep learning (DL) architectures are obtaining unprecedented results in many applications, especially when applied to images and videos. Generative models (GM) are at an earliest stage, but yet can produce terrific and realistic (fake) results. This talk will briefly present them and discuss on how much these techniques are ready to be deployed in real industrial applications.