ISCA Archive Interspeech 2018
ISCA Archive Interspeech 2018

Improving Response Time of Active Speaker Detection Using Visual Prosody Information Prior to Articulation

Fasih Haider, Saturnino Luz, Carl Vogel, Nick Campbell

Natural multi-party interaction commonly involves turning one's gaze towards the speaker who has the floor. Implementing virtual agents or robots who are able to engage in natural conversations with humans therefore requires enabling machines to exhibit this form of communicative behaviour. This task is called active speaker detection. In this paper, we propose a method for active speaker detection using visual prosody (lip and head movements) information before and after speech articulation to decrease the machine response time; and also demonstrate the discriminating power of visual prosody before and after speech articulation for active speaker detection. The results show that the visual prosody information one second before articulation is helpful in detecting the active speaker. Lip movements provide better results than head movements and fusion of both improves accuracy. We have also used visual prosody information of the first second of the speech utterance and found that it provides more accurate results than one second before articulation. We conclude that the fusion of lip movements from both regions (the first one second of speech and the one second before articulation) improves the accuracy of active speaker detection.


doi: 10.21437/Interspeech.2018-2310

Cite as: Haider, F., Luz, S., Vogel, C., Campbell, N. (2018) Improving Response Time of Active Speaker Detection Using Visual Prosody Information Prior to Articulation. Proc. Interspeech 2018, 1736-1740, doi: 10.21437/Interspeech.2018-2310

@inproceedings{haider18b_interspeech,
  author={Fasih Haider and Saturnino Luz and Carl Vogel and Nick Campbell},
  title={{Improving Response Time of Active Speaker Detection Using Visual Prosody Information Prior to Articulation}},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1736--1740},
  doi={10.21437/Interspeech.2018-2310}
}