Abstract:
We propose a novel audio-visual simultaneous and localization (SLAM) framework that exploits human pose and acoustic speech of human sound sources to allow a robot equipp...Show MoreMetadata
Abstract:
We propose a novel audio-visual simultaneous and localization (SLAM) framework that exploits human pose and acoustic speech of human sound sources to allow a robot equipped with a microphone array and a monocular camera to track, map, and interact with human partners in an indoor environment. Since human interaction is characterized by features perceived in not only the visual modality, but the acoustic modality as well, SLAM systems must utilize information from both modalities. Using a state-of-the-art beamforming technique, we obtain sound components correspondent to speech and noise; and estimate the Direction-of-Arrival (DoA) estimates of active sound sources as useful representations of observed features in the acoustic modality. Through estimated human pose by a monocular camera, we obtain the relative positions of humans as representation of observed features in the visual modality. Using these techniques, we attempt to eliminate restrictions imposed by intermittent speech, noisy periods, reverberant periods, triangulation of sound-source range, and limited visual field-of-views; and subsequently perform early fusion on these representations. We develop a system that allows for complimentary action between audio-visual sensor modalities in the simultaneous mapping of multiple human sound sources and the localization of observer position.
Published in: 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
Date of Conference: 14-18 October 2019
Date Added to IEEE Xplore: 13 January 2020
ISBN Information: