Abstract.
This paper addresses the problem of real-time speaker segmentation and speaker tracking in audio content analysis in which no prior knowledge of the number of speakers and the identities of speakers is available. Speaker segmentation is to detect the speaker change boundaries in a speech stream. It is performed by a two-step algorithm, which includes potential change detection and refinement. Speaker tracking is then performed based on the results of speaker segmentation by identifying the speaker of each segment. In our approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing. A Bayesian fusion method is also proposed to fuse multiple audio features to obtain a more reliable result, and different noise levels are utilized to compensate for background mismatch. Experiments show that the proposed algorithm can recall 89% of speaker change boundaries with 15% false alarms, and 76% of speakers can be unsupervised identified with 20% false alarms. Compared with previous works, the algorithm also has low computation complexity and can perform in 15% of real time with a very limited delay in analysis.
Similar content being viewed by others
References
Campbell JP (1997) JR. Speaker recognition: a tutorial. Proc IEEE 85(9):1437-1462
Brummer JNL (1994) Speaker recognition over HF radio after automatic speaker segmentation. In: Proc. IEEE South African symposium on communications and signal processing, COMSIG-94 pp 171-176
Sugiyama M, Murakami J, Watanabe H (1993) Speech segmentation and clustering based on speaker features. In: Proc. IEEE international conference on acoustics, speech, and signal processing
Wilcox L, Chen F, Kumber D, Balasubramanian V (1994) Segmentation of speech using speaker identification. In: Proc. IEEE international conference on acoustics, speech, and signal processing
Siu MH, Yu G, Gish H (1992) An unsupervised, sequential learning algorithm for the segmentation of speech waveform with multiple speakers. In: Proc. IEEE international conference on acoustics, speech, and signal processing, pp 189-192
Cohen A, Lapidus V (1996) Unsupervised speaker segmentation in telephone conversations. In: Proc. 19th convention of electrical and electronics engineers, Israel, pp 102--105
Gish H, Schmidt M (1994) Text-independent speaker identification. IEEE Signal Process Mag 11(4):18-32
Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. In: Proc. IEEE international conference on acoustics, speech, and signal processing, pp 873-876
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578-589
Mammone RJ, Zhang XY, Ramachandran RP (1996) Robust speaker recognition: a feature-based approach. IEEE Signal Process Mag 13(5):58-71
Murthy HA, Beaufays F, Heck LP, Weintraub M (1999) Robust text-independent speaker identification over telephone channels. IEEE Trans Speech Audio Process 7(5):554-568
Mori K, Nakagawa S (2002) Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: Proc IEEE international conference on acoustics, speech, and signal processing
Chen S, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proc. DARPA workshopt on broadcast news transcription and understanding
Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6:461-464
Lu L, Jiang H, Zhang H.J (2001) A robust audio classification and segmentation method. In: Proc 9th ACM Multimedia, pp 203-211
Couvreur L, Boite JM (1999) Speaker tracking in broadcast audio material in the framework of the THISL project. In: Proc. ESCA ETRW workshop on accessing information in spoken audio, pp 84-89
Sonmez K, Heck L, Weintraub M (1999) Speaker tracking and detection with multiple speakers. In: Proc Eurospeech ‘1999, Budapest, 5:2219-2222
Bonastre JF, Delacourt P, Fredouille C, Merlin T, Wellekens C (2000) A speaker tracking system based on speaker turn detection for NIST evaluation. In: Proc IEEE international conference on acoustics, speech, and signal processing, pp 1177-1180
Fredouille C, Bonastre JF, Merlin T (1999) Segmental normalization for robust speaker verification. In: Workshop on robust method for speech recognition in adverse conditions, pp 103-106
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Dig Signal Process 10:19-41
Padmanabhan M, Bahl LR, Nahamoo D, Picheny MA (1998) Speaker clustering and transformation for speaker adaptation in speech recognition systems. IEEE Trans Speech Audio Process 6(1):71-77
Berg BL, Beex AA (1999) Investigating speaker features from very short speech records. In: Proc. IEEE international symposium on circuits and systems (ISCAS’99), 3:102-105
Lu, L, Li SZ, Zhang H-J (2001) Content-based audio segmentation using support vector machines. In: Proc ICME01, pp 956-959
Roy D, Malamud C (1997) Speaker identification based text to audio alignment for an audio retrieval system. In: Proc IEEE international conference on acoustics, speech, and signal processing, pp 1099-1102
Kimber DG, Wilcox LD, Chen FR, Moran TP (1995) Speaker segmentation for browsing recorded audio. In: ACM CHI’95 Mosaic of Creativity, pp 212-213
Wang L, Chan KL (2000) Bayesian fusion: an approach for image retrieval using multiple features. In: Proc. international conference on image and vision computing, Hamilton, New Zealand
Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic, Boston 1992
Wang D, Lu L, Zhang H-J (2003) Speech segmentation without speech recognition. In: Proc. IEEE international conference on acoustics, speech and signal processing, 1:468-471
Lu L, Zhang H-J (2002) Speaker change detection and tracking in real-time news broadcasting analysis. In: Proc. 10th ACM international conference on multimedia, pp 602-610
Patel NV, Sethi IK (1997) Video classification using speaker identification. In: Proc. IS&T/SPIE conference on storage and retrieval for image and video databases, San Jose, CA, 5:218-225
Li D, Sethi IK, Dimitrova N, McGee T (2001) Classification of general audio data for content-based retrieval. Pattern Recog Lett 22(5):533-544
Author information
Authors and Affiliations
Corresponding author
Additional information
Published online: 12 January 2005
Part of the work presented in this paper was published in the 10th ACM International Conference on Multimedia, 1-6 December 2002
Rights and permissions
About this article
Cite this article
Lu, L., Zhang, HJ. Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems 10, 332–343 (2005). https://doi.org/10.1007/s00530-004-0160-5
Issue Date:
DOI: https://doi.org/10.1007/s00530-004-0160-5