Our attempt to enable effective HRI via joining Computer Vision and Natural Language Processing.

Our research community is the Robotics one and we mainly work on Machine Learning and Computer Vision to enable robots to understand and interact with their environment. In recent years, we developed a strong interest in themes connected to Human-Robot Interaction.
 

”We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with? [...] It can [also] be maintained that it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English.”
 
Following the inspiring Alan M. Turings quote above, we believe that this conjecture of machines understanding is worth exploring now, when computational capabilities of GPUs make the development of algorithms to solve this problem feasible.

This topic poses that interesting, that challenging questions. The first one refers to the form of intelligent behaviour to be investigated, i.e., based on what one can assess that a robot is understanding what is happening in its environment. To us, a reasonable way is testing the ability to produce a natural language description of generic visual sequences. The description can be seen as a manifestation of what the agent learned from the visual and textual data it processed during training and what the agent learned being important to be described. In addition, a natural language description is a good basis for natural language question answering about the events that the agent saw. Hence, this offers a friendly interface also for non-expert people which would then be allowed to effectively interact with their home robot in the near future.

 

Download

Our collected ISARLab-VD dataset

The complete results corpus for the RA-L submitted paper "Towards effective Human-Robot Interaction via Full-GRU Natural Language Video Description"

The code for the RA-L submitted paper "Towards effective Human-Robot Interaction via Full-GRU Natural Language Video Description": coming soon