Skip to main content

Advertisement

Log in

Unsupervised speaker segmentation and tracking in real-time audio content analysis

  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract.

This paper addresses the problem of real-time speaker segmentation and speaker tracking in audio content analysis in which no prior knowledge of the number of speakers and the identities of speakers is available. Speaker segmentation is to detect the speaker change boundaries in a speech stream. It is performed by a two-step algorithm, which includes potential change detection and refinement. Speaker tracking is then performed based on the results of speaker segmentation by identifying the speaker of each segment. In our approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing. A Bayesian fusion method is also proposed to fuse multiple audio features to obtain a more reliable result, and different noise levels are utilized to compensate for background mismatch. Experiments show that the proposed algorithm can recall 89% of speaker change boundaries with 15% false alarms, and 76% of speakers can be unsupervised identified with 20% false alarms. Compared with previous works, the algorithm also has low computation complexity and can perform in 15% of real time with a very limited delay in analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Campbell JP (1997) JR. Speaker recognition: a tutorial. Proc IEEE 85(9):1437-1462

    Article  Google Scholar 

  2. Brummer JNL (1994) Speaker recognition over HF radio after automatic speaker segmentation. In: Proc. IEEE South African symposium on communications and signal processing, COMSIG-94 pp 171-176

  3. Sugiyama M, Murakami J, Watanabe H (1993) Speech segmentation and clustering based on speaker features. In: Proc. IEEE international conference on acoustics, speech, and signal processing

  4. Wilcox L, Chen F, Kumber D, Balasubramanian V (1994) Segmentation of speech using speaker identification. In: Proc. IEEE international conference on acoustics, speech, and signal processing

  5. Siu MH, Yu G, Gish H (1992) An unsupervised, sequential learning algorithm for the segmentation of speech waveform with multiple speakers. In: Proc. IEEE international conference on acoustics, speech, and signal processing, pp 189-192

  6. Cohen A, Lapidus V (1996) Unsupervised speaker segmentation in telephone conversations. In: Proc. 19th convention of electrical and electronics engineers, Israel, pp 102--105

  7. Gish H, Schmidt M (1994) Text-independent speaker identification. IEEE Signal Process Mag 11(4):18-32

    Article  Google Scholar 

  8. Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. In: Proc. IEEE international conference on acoustics, speech, and signal processing, pp 873-876

  9. Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578-589

    Article  Google Scholar 

  10. Mammone RJ, Zhang XY, Ramachandran RP (1996) Robust speaker recognition: a feature-based approach. IEEE Signal Process Mag 13(5):58-71

    Article  Google Scholar 

  11. Murthy HA, Beaufays F, Heck LP, Weintraub M (1999) Robust text-independent speaker identification over telephone channels. IEEE Trans Speech Audio Process 7(5):554-568

    Article  Google Scholar 

  12. Mori K, Nakagawa S (2002) Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: Proc IEEE international conference on acoustics, speech, and signal processing

  13. Chen S, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proc. DARPA workshopt on broadcast news transcription and understanding

  14. Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6:461-464

    MATH  Google Scholar 

  15. Lu L, Jiang H, Zhang H.J (2001) A robust audio classification and segmentation method. In: Proc 9th ACM Multimedia, pp 203-211

  16. Couvreur L, Boite JM (1999) Speaker tracking in broadcast audio material in the framework of the THISL project. In: Proc. ESCA ETRW workshop on accessing information in spoken audio, pp 84-89

  17. Sonmez K, Heck L, Weintraub M (1999) Speaker tracking and detection with multiple speakers. In: Proc Eurospeech ‘1999, Budapest, 5:2219-2222

  18. Bonastre JF, Delacourt P, Fredouille C, Merlin T, Wellekens C (2000) A speaker tracking system based on speaker turn detection for NIST evaluation. In: Proc IEEE international conference on acoustics, speech, and signal processing, pp 1177-1180

  19. Fredouille C, Bonastre JF, Merlin T (1999) Segmental normalization for robust speaker verification. In: Workshop on robust method for speech recognition in adverse conditions, pp 103-106

  20. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Dig Signal Process 10:19-41

    Google Scholar 

  21. Padmanabhan M, Bahl LR, Nahamoo D, Picheny MA (1998) Speaker clustering and transformation for speaker adaptation in speech recognition systems. IEEE Trans Speech Audio Process 6(1):71-77

    Article  Google Scholar 

  22. Berg BL, Beex AA (1999) Investigating speaker features from very short speech records. In: Proc. IEEE international symposium on circuits and systems (ISCAS’99), 3:102-105

  23. Lu, L, Li SZ, Zhang H-J (2001) Content-based audio segmentation using support vector machines. In: Proc ICME01, pp 956-959

  24. Roy D, Malamud C (1997) Speaker identification based text to audio alignment for an audio retrieval system. In: Proc IEEE international conference on acoustics, speech, and signal processing, pp 1099-1102

  25. Kimber DG, Wilcox LD, Chen FR, Moran TP (1995) Speaker segmentation for browsing recorded audio. In: ACM CHI’95 Mosaic of Creativity, pp 212-213

  26. Wang L, Chan KL (2000) Bayesian fusion: an approach for image retrieval using multiple features. In: Proc. international conference on image and vision computing, Hamilton, New Zealand

  27. Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic, Boston 1992

  28. Wang D, Lu L, Zhang H-J (2003) Speech segmentation without speech recognition. In: Proc. IEEE international conference on acoustics, speech and signal processing, 1:468-471

  29. Lu L, Zhang H-J (2002) Speaker change detection and tracking in real-time news broadcasting analysis. In: Proc. 10th ACM international conference on multimedia, pp 602-610

  30. Patel NV, Sethi IK (1997) Video classification using speaker identification. In: Proc. IS&T/SPIE conference on storage and retrieval for image and video databases, San Jose, CA, 5:218-225

  31. Li D, Sethi IK, Dimitrova N, McGee T (2001) Classification of general audio data for content-based retrieval. Pattern Recog Lett 22(5):533-544

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lie Lu.

Additional information

Published online: 12 January 2005

Part of the work presented in this paper was published in the 10th ACM International Conference on Multimedia, 1-6 December 2002

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, L., Zhang, HJ. Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems 10, 332–343 (2005). https://doi.org/10.1007/s00530-004-0160-5

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-004-0160-5

Keywords: