Open access
Author
Date
2021Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
From indoor robotics to automated cars, there is a tremendous growth in the number of robots in our day-to-day life. For instance, products such as smart speakers, wearable technologies, home robots, self-driving cars, and many more smart assistants are to come in the next few years. These robotic systems interact with humans and surrounding environments to perform their designated tasks. Research on robotic perception, visual language navigation, speech recognition, and others drive the aforementioned applications, and there has been significant progress in the past decade.
The focus of this thesis is to develop models to tackle some of the challenges and enable better robot perception and navigation systems. Perception system demands to be complex with a multitude of tasks such as understanding human cues and visual perception of the environment. To this end, we propose an approach to address the problem of Object referring (OR) task using spoken language, human gaze, and natural language text. We train and evaluate our method on Cityscapes dataset, which is augmented with human gaze and speech captured in an indoor setup. We observe that the language-guided OR task performance improves with the addition of human-side gaze and speech modalities and with visual scene modalities of RGB, depth, and motion.
Next, the thesis focuses on the challenge of the robot navigation system. The vast majority of research is targeted towards indoor or simulated outdoor navigation. Here, we define the problem of language-based robot navigation in the real outdoor environment, which has the first person view to understand and execute the natural language instructions. We create a large-scale dataset with verbal navigation instructions based on Google Street View. Experiments on our dataset show that the proposed approach helps the language-guided automatic wayfinding.
Finally what happens to the visual perception system of robots when encountered with poor lighting conditions or camera malfunctioning. Robots can then hear the environment to perceive as humans do. There are limited works in the literature related to sound perception in outdoor environments. We develop an approach to focus on dense semantic object labelling based on binaural sounds from the environment. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight binaural microphones and a \ang{360} camera. We also propose two auxiliary tasks, namely, a) a novel task on Spatial Sound Super-resolution, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end multi-tasking network, and the evaluation on our dataset shows how all three tasks are mutually beneficial. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000509287Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Computer Vision; Natural Language Processing; Sound perception; Vision and Language Navigation; Gaze EstimationOrganisational unit
03514 - Van Gool, Luc / Van Gool, Luc
More
Show all metadata
ETH Bibliography
yes
Altmetrics