Elsevier

Signal Processing

Volume 129, December 2016, Pages 137-149
Signal Processing

Robust indoor speaker recognition in a network of audio and video sensors

https://doi.org/10.1016/j.sigpro.2016.04.014Get rights and content
Under a Creative Commons license
open access

Highlights

  • Joint audio–video system to recognise a speaker in acoustic/visual clutter noise proposed.

  • Audio–video signals are combined at low and higher-level of data abstraction.

  • Audio–video signals are combined to detect, track and recognise speaker.

  • Cooperative audio–video data lead to a robust tracking system in a realistic scenario.

  • Required hardware is less complex and scenario less constrained than other approaches.

Abstract

Situational awareness is achieved naturally by the human senses of sight and hearing in combination. Automatic scene understanding aims at replicating this human ability using microphones and cameras in cooperation. In this paper, audio and video signals are fused and integrated at different levels of semantic abstractions. We detect and track a speaker who is relatively unconstrained, i.e., free to move indoors within an area larger than the comparable reported work, which is usually limited to round table meetings. The system is relatively simple: consisting of just 4 microphone pairs and a single camera. Results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talk. System evaluation is performed on both single and multi-modality tracking. The performance improvement given by the audio–video integration and fusion is quantified in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision–recall (recognition). Improvements vs. the closest works are evaluated: 56% sound source localisation computational cost over an audio only system, 8% speaker diarisation error rate over an audio only speaker recognition unit and 36% on the precision–recall metric over an audio–video dominant speaker recognition method.

Keywords

Surveillance
Speaker diarisation
Security biometric
Audio–video speaker tracking
Multimodal fusion

Cited by (0)

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) Grant number EP/J015180/1 and the MOD University Defence Research Collaboration in Signal Processing.