LiSA – Listen, See and Act: fusing audio-video cues to perceive visible and invisible events and develop perception-to-action solutions for autonomous vehicles
Keywords: Multi-modal systems – driverless vehicles – audio/visual perception – perception-to-action – deep learning – deep reinforcement learning
Project Summary: The development of a more circular economy is a key stepstone to reduce the environmental impact of the production and distribution systems, guarantee supply chains of strategic material and favour economic growth. In this context, transportation services play a crucial role and enhancing them to favour sustainability, inclusiveness and safety is fundamental. To this aim, transportation systems are experiencing a deep revolution and new mobility paradigms are currently being devised, with several successful solutions that have already been demonstrated. Most of the advancements in transportation systems have been made possible by Artificial Intelligence (AI) methodologies, which facilitated the development of autonomous vehicles (AVs). However, despite the impressive results demonstrated, there are still many open challenges that prevent AVs to be fully integrated into mobility services. Most of the limitations that affect these systems are related to the capability to effectively perceive the entities and the events of the environment and compute timely navigation and reaction commands. Besides hindering the robustness of the overall navigation routines and, therefore, reducing the actual possibility to deploy AVs, these limitations have a significant impact on how people perceive their reliability. Inspired by the previous considerations, this proposal aims to empower AVs with novel perception-to-action capabilities that rely on multiple and heterogeneous data sources. In particular, the combination of visual and audio information allows computing a more robust, efficient, and descriptive representation of the vehicle’s surroundings. The key intuition behind the proposed solution draws inspiration from human behaviour. While driving, humans greatly rely on auditory perception to predict and localise events which cannot be directly observed (e.g., a siren of an ambulance approaching an intersection). Hearing, indeed, is able to provide omnidirectional perception, overcoming the limitations imposed by occlusions, and, thus, enhancing the driver’s awareness of the scene. Being able to predict the occurrence of visual events long before they can actually be seen means increasing the driver’s “time allowance” to properly react to those events. Such allowance is even more significant in the case of AVs, operating both in a full or semi-autonomy regime. Yet, the use of sound is greatly under-investigated in the intelligent transportation systems community and driverless vehicles are, still, largely deaf.
This project aims to tackle this problem by:
- Detecting and localising acoustic events in urban scenarios.
- Generating joint representation of audio/visual events to enable the development of multi-modal systems that model spatiotemporal relationships of audio/visual inputs.
- Developing perception-to-action methodologies to map audio-visual cues to vehicle control commands and improve autonomous navigation capabilities.