Multimodal Semantic Understanding and Navigation in Outdoor Scenes

Vasudevan, Arun Balajee

doi:10.3929/ethz-b-000509287

Download

Full text (PDF, 61.22Mb)

Open access

Author

Vasudevan, Arun Balajee

Date

2021

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 61.22Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

From indoor robotics to automated cars, there is a tremendous growth in the number of robots in our day-to-day life. For instance, products such as smart speakers, wearable technologies, home robots, self-driving cars, and many more smart assistants are to come in the next few years. These robotic systems interact with humans and surrounding environments to perform their designated tasks. Research on robotic perception, visual language navigation, speech recognition, and others drive the aforementioned applications, and there has been significant progress in the past decade. The focus of this thesis is to develop models to tackle some of the challenges and enable better robot perception and navigation systems. Perception system demands to be complex with a multitude of tasks such as understanding human cues and visual perception of the environment. To this end, we propose an approach to address the problem of Object referring (OR) task using spoken language, human gaze, and natural language text. We train and evaluate our method on Cityscapes dataset, which is augmented with human gaze and speech captured in an indoor setup. We observe that the language-guided OR task performance improves with the addition of human-side gaze and speech modalities and with visual scene modalities of RGB, depth, and motion. Next, the thesis focuses on the challenge of the robot navigation system. The vast majority of research is targeted towards indoor or simulated outdoor navigation. Here, we define the problem of language-based robot navigation in the real outdoor environment, which has the first person view to understand and execute the natural language instructions. We create a large-scale dataset with verbal navigation instructions based on Google Street View. Experiments on our dataset show that the proposed approach helps the language-guided automatic wayfinding. Finally what happens to the visual perception system of robots when encountered with poor lighting conditions or camera malfunctioning. Robots can then hear the environment to perceive as humans do. There are limited works in the literature related to sound perception in outdoor environments. We develop an approach to focus on dense semantic object labelling based on binaural sounds from the environment. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight binaural microphones and a \ang{360} camera. We also propose two auxiliary tasks, namely, a) a novel task on Spatial Sound Super-resolution, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end multi-tasking network, and the evaluation on our dataset shows how all three tasks are mutually beneficial. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000509287

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Van Gool, Luc
Examiner: Stiefelhagen, Rainer
Examiner: Odobez, Jean-Marc

Publisher

ETH Zurich

Subject

Computer Vision; Natural Language Processing; Sound perception; Vision and Language Navigation; Gaze Estimation

Organisational unit

03514 - Van Gool, Luc / Van Gool, Luc

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Multimodal Semantic Understanding and Navigation in Outdoor Scenes Mendeley CSV RIS BibTeX

Multimodal Semantic Understanding and Navigation in Outdoor Scenes

Mendeley

CSV

RIS

BibTeX