Unsupervised speaker segmentation and tracking in real-time audio content analysis

Lu, Lie; Zhang, Hong-Jiang

doi:10.1007/s00530-004-0160-5

Unsupervised speaker segmentation and tracking in real-time audio content analysis

Published: April 2005

Volume 10, pages 332–343, (2005)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Lie Lu¹ &
Hong-Jiang Zhang¹

234 Accesses
Explore all metrics

Abstract.

This paper addresses the problem of real-time speaker segmentation and speaker tracking in audio content analysis in which no prior knowledge of the number of speakers and the identities of speakers is available. Speaker segmentation is to detect the speaker change boundaries in a speech stream. It is performed by a two-step algorithm, which includes potential change detection and refinement. Speaker tracking is then performed based on the results of speaker segmentation by identifying the speaker of each segment. In our approach, incremental speaker model updating and segmental clustering is proposed, which makes the unsupervised speaker segmentation and tracking feasible in real-time processing. A Bayesian fusion method is also proposed to fuse multiple audio features to obtain a more reliable result, and different noise levels are utilized to compensate for background mismatch. Experiments show that the proposed algorithm can recall 89% of speaker change boundaries with 15% false alarms, and 76% of speakers can be unsupervised identified with 20% false alarms. Compared with previous works, the algorithm also has low computation complexity and can perform in 15% of real time with a very limited delay in analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Audio Segmentation for Audio Diarization

Speaker Change Detection Using Binary Key Modelling with Contextual Information

Speaker Detection in Audio Stream via Probabilistic Prediction Using Generalized GEBI

References

Campbell JP (1997) JR. Speaker recognition: a tutorial. Proc IEEE 85(9):1437-1462
Article Google Scholar
Brummer JNL (1994) Speaker recognition over HF radio after automatic speaker segmentation. In: Proc. IEEE South African symposium on communications and signal processing, COMSIG-94 pp 171-176
Sugiyama M, Murakami J, Watanabe H (1993) Speech segmentation and clustering based on speaker features. In: Proc. IEEE international conference on acoustics, speech, and signal processing
Wilcox L, Chen F, Kumber D, Balasubramanian V (1994) Segmentation of speech using speaker identification. In: Proc. IEEE international conference on acoustics, speech, and signal processing
Siu MH, Yu G, Gish H (1992) An unsupervised, sequential learning algorithm for the segmentation of speech waveform with multiple speakers. In: Proc. IEEE international conference on acoustics, speech, and signal processing, pp 189-192
Cohen A, Lapidus V (1996) Unsupervised speaker segmentation in telephone conversations. In: Proc. 19th convention of electrical and electronics engineers, Israel, pp 102--105
Gish H, Schmidt M (1994) Text-independent speaker identification. IEEE Signal Process Mag 11(4):18-32
Article Google Scholar
Gish H, Siu MH, Rohlicek R (1991) Segregation of speakers for speech recognition and speaker identification. In: Proc. IEEE international conference on acoustics, speech, and signal processing, pp 873-876
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Process 2(4):578-589
Article Google Scholar
Mammone RJ, Zhang XY, Ramachandran RP (1996) Robust speaker recognition: a feature-based approach. IEEE Signal Process Mag 13(5):58-71
Article Google Scholar
Murthy HA, Beaufays F, Heck LP, Weintraub M (1999) Robust text-independent speaker identification over telephone channels. IEEE Trans Speech Audio Process 7(5):554-568
Article Google Scholar
Mori K, Nakagawa S (2002) Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition. In: Proc IEEE international conference on acoustics, speech, and signal processing
Chen S, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proc. DARPA workshopt on broadcast news transcription and understanding
Schwarz G (1978) Estimating the dimensions of a model. Ann Stat 6:461-464
MATH Google Scholar
Lu L, Jiang H, Zhang H.J (2001) A robust audio classification and segmentation method. In: Proc 9th ACM Multimedia, pp 203-211
Couvreur L, Boite JM (1999) Speaker tracking in broadcast audio material in the framework of the THISL project. In: Proc. ESCA ETRW workshop on accessing information in spoken audio, pp 84-89
Sonmez K, Heck L, Weintraub M (1999) Speaker tracking and detection with multiple speakers. In: Proc Eurospeech ‘1999, Budapest, 5:2219-2222
Bonastre JF, Delacourt P, Fredouille C, Merlin T, Wellekens C (2000) A speaker tracking system based on speaker turn detection for NIST evaluation. In: Proc IEEE international conference on acoustics, speech, and signal processing, pp 1177-1180
Fredouille C, Bonastre JF, Merlin T (1999) Segmental normalization for robust speaker verification. In: Workshop on robust method for speech recognition in adverse conditions, pp 103-106
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Dig Signal Process 10:19-41
Google Scholar
Padmanabhan M, Bahl LR, Nahamoo D, Picheny MA (1998) Speaker clustering and transformation for speaker adaptation in speech recognition systems. IEEE Trans Speech Audio Process 6(1):71-77
Article Google Scholar
Berg BL, Beex AA (1999) Investigating speaker features from very short speech records. In: Proc. IEEE international symposium on circuits and systems (ISCAS’99), 3:102-105
Lu, L, Li SZ, Zhang H-J (2001) Content-based audio segmentation using support vector machines. In: Proc ICME01, pp 956-959
Roy D, Malamud C (1997) Speaker identification based text to audio alignment for an audio retrieval system. In: Proc IEEE international conference on acoustics, speech, and signal processing, pp 1099-1102
Kimber DG, Wilcox LD, Chen FR, Moran TP (1995) Speaker segmentation for browsing recorded audio. In: ACM CHI’95 Mosaic of Creativity, pp 212-213
Wang L, Chan KL (2000) Bayesian fusion: an approach for image retrieval using multiple features. In: Proc. international conference on image and vision computing, Hamilton, New Zealand
Abidi MA, Gonzalez RC (1992) Data fusion in robotics and machine intelligence. Academic, Boston 1992
Wang D, Lu L, Zhang H-J (2003) Speech segmentation without speech recognition. In: Proc. IEEE international conference on acoustics, speech and signal processing, 1:468-471
Lu L, Zhang H-J (2002) Speaker change detection and tracking in real-time news broadcasting analysis. In: Proc. 10th ACM international conference on multimedia, pp 602-610
Patel NV, Sethi IK (1997) Video classification using speaker identification. In: Proc. IS&T/SPIE conference on storage and retrieval for image and video databases, San Jose, CA, 5:218-225
Li D, Sethi IK, Dimitrova N, McGee T (2001) Classification of general audio data for content-based retrieval. Pattern Recog Lett 22(5):533-544
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research Asia, 5F Beijing Sigma Center, No. 49 Zhichun Road, Hai Dian District, 100080, Beijing, China
Lie Lu & Hong-Jiang Zhang

Authors

Lie Lu
View author publications
You can also search for this author inPubMed Google Scholar
Hong-Jiang Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Lie Lu.

Additional information

Published online: 12 January 2005

Part of the work presented in this paper was published in the 10th ACM International Conference on Multimedia, 1-6 December 2002

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, L., Zhang, HJ. Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems 10, 332–343 (2005). https://doi.org/10.1007/s00530-004-0160-5

Download citation

Issue Date: April 2005
DOI: https://doi.org/10.1007/s00530-004-0160-5

Keywords:

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised speaker segmentation and tracking in real-time audio content analysis

Abstract.

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Audio Segmentation for Audio Diarization

Speaker Change Detection Using Binary Key Modelling with Contextual Information

Speaker Detection in Audio Stream via Probabilistic Prediction Using Generalized GEBI

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords:

Subscribe and save

Buy Now