ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter

Jinzheng Zhao, Peipei Wu, Xubo Liu, Shidrokh Goudarzi, Haohe Liu, YONG XU, Wenwu Wang

Audio and visual signals can be used jointly to provide complementary information for multi-speaker tracking. Face detectors and color histogram can provide visual measurements while Direction of Arrival (DOA) lines and global coherence field (GCF) maps can provide audio measurements. GCF, as a traditional sound source localization method, has been widely used to provide audio measurements in audio-visual speaker tracking by estimating the positions of speakers. However, GCF cannot directly deal with the scenarios of multiple speakers due to the emergence of spurious peaks on the GCF map, making it difficult to find the non-dominant speakers. To overcome this limitation, we propose a phase-aware VoiceFilter and a separation-before-localization method, which enables the audio mixture to be separated into individual speech sources while retaining their phases. This allows us to calculate the GCF map for multiple speakers, thereby their positions accurately and concurrently. Based on this method, we design an adaptive audio measurement likelihood for audio-visual multiple speaker tracking using Poisson multi-Bernoulli mixture (PMBM) filter. The experiments demonstrate that our proposed tracker achieves state-of-the-art results on the AV16.3 dataset.


doi: 10.21437/Interspeech.2022-10190

Cite as: Zhao, J., Wu, P., Liu, X., Goudarzi, S., Liu, H., XU, Y., Wang, W. (2022) Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter. Proc. Interspeech 2022, 3704-3708, doi: 10.21437/Interspeech.2022-10190

@inproceedings{zhao22j_interspeech,
  author={Jinzheng Zhao and Peipei Wu and Xubo Liu and Shidrokh Goudarzi and Haohe Liu and YONG XU and Wenwu Wang},
  title={{Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={3704--3708},
  doi={10.21437/Interspeech.2022-10190}
}