Fusion of Acoustic and Prosodic Features for Speaker Clustering

Žibert, Janez; Mihelič, France

doi:10.1007/978-3-642-04208-9_31

Fusion of Acoustic and Prosodic Features for Speaker Clustering

Janez Žibert²¹ &
France Mihelič²²

Conference paper

863 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5729))

Abstract

This work focus on a speaker clustering methods that are used in speaker diarization systems. The purpose of speaker clustering is to associate together segments that belong to the same speakers. It is usually applied in the last stage of the speaker-diarization process. We concentrate on developing of proper representations of speaker segments for clustering and explore different similarity measures for joining speaker segments together. We realize two different competitive systems. The first is a standard approach using a bottom-up agglomerative clustering principle with the Bayesian Information Criterion (BIC) as a merging criterion. In the next approach a fusion speaker clustering system is developed, where the speaker segments are modeled by acoustic and prosody representations. The idea here is to additionally model the speaker prosody characteristics and add it to basic acoustic information estimated from the speaker segments. We construct 10 basic prosody features derived from the energy of the audio signals, the estimated pitch contours, and the recognized voiced and unvoiced regions in speech. In this way we impose higher-level information in the representations of the speaker segments, which leads to improved clustering of the segments in the case of similar speaker acoustic characteristics or poor acoustic conditions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, S., Gopalakrishnan, P.S.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of the DARPA Speech Recognition Workshop, Lansdowne, Virginia, USA, pp. 127–132 (1998)
Google Scholar
Delacourt, P., Bonastre, J., Fredouille, C., Merlin, T., Wellekens, C.: A Speaker Tracking System Based on Speaker Turn Detection for NIST Evaluation. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), Istanbul, Turkey (June 2006)
Google Scholar
Fiscus, J.G., Garofolo, J.S., Le, A., Martin, A.F., Pallett, D.S., Przybocki, M.A., Sanders, G.: Results of the Fall 2004 STT and MDE Evaluation. In: Proceedings of the Fall 2004 Rich Transcription Workshop, Palisades, NY, USA (2004)
Google Scholar
Gallwitz, F., Niemann, H., Noth, E., Warnke, V.: Integrated recognition of words and prosodic phrase boundaries. Speech Communication 36(1-2), 81–95 (2002)
Article Google Scholar
Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognition 38(12), 2270–2285 (2005)
Article Google Scholar
Matsoukas, S., Schwartz, R., Jin, H., Nguyen, L.: Practical Implementations of Speaker-Adaptive Training. In: Proceedings of the 1997 DARPA Speech Recognition Workshop, Chantilly VA, USA (February 1997)
Google Scholar
Meignier, S., Bonastre, J.-F., Fredouille, C., Merlin, T.: Evolutive HMM for Multi-Speaker Tracking System. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Istanbul, Turkey (2000)
Google Scholar
Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., Stolcke, A.: Modeling prosodic feature sequences for speaker recognition. Speech Communication 46(3-4), 455–472 (2005)
Article Google Scholar
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 2nd edn. Academic Press, Elsevier, USA (2003)
Google Scholar
Tranter, S., Reynolds, D.: An Overview of Automatic Speaker Diarisation Systems. IEEE Transactions on Speech, Audio and Language Processing, Special Issue on Rich Transcription 14(5), 1557–1565 (2006)
Article Google Scholar
Woodland, P.C.: The development of the HTK Broadcast News transcription system: An overview. Speech Communication 37(1-2), 47–67 (2002)
Article Google Scholar
Žibert, J., Mihelič, F.: Development of Slovenian Broadcast News Speech Database. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, May 2004, pp. 2095–2098 (2004)
Google Scholar
Žibert, J.: et al.: The COST278 Broadcast News Segmentation and Speaker Clustering Evaluation - Overview, Methodology, Systems, Results. In: Proceedings of Interspeech 2005, Lisbon, Portugal, pp. 629–632 (2005)
Google Scholar
Žibert, J., Vesnicer, B., Mihelič, F.: Novel Approaches to Speech Detection in the Processing of Continuous Audio Streams. In: Grimm, M., Kroschel, K. (eds.) Robust Speech Recognition and Understanding, pp. 23–48. I-Tech Education and Publishing, Croatia (2007)
Google Scholar
Žibert, J., Mihelič, F.: Novel approaches to speaker clustering for speaker diarization in audio broadcast news data. In: Mihelič, F., Žibert, J. (eds.) Speech recognition: technologies and applications. Artificial intelligence series, pp. 341–362 (2008) ISBN 978-953-7619-29-9
Google Scholar
Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining Speaker Identification and BIC for Speaker Diarization. In: Proceedings of Interspeech 2005 - Eurospeech, Lisbon, Portugal, pp. 2441–2444 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Primorska Institute of Natural Sciences and Technology, University of Primorska, Muzejski trg 2, Koper, SI, 6000, Slovenia
Janez Žibert
Faculty of Electrical Engineering, University of Ljubljana, Tržaška 25, Ljubljana, SI, 1000, Slovenia
France Mihelič

Authors

Janez Žibert
View author publications
You can also search for this author in PubMed Google Scholar
France Mihelič
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Wet Bohemia at Pilsen, Czech Republic
Václav Matoušek
Department of Computer Science, University of West Bohemia in Pilsen, Univerzitni 8, 30614, Plzen, Czech Republic
Pavel Mautner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Žibert, J., Mihelič, F. (2009). Fusion of Acoustic and Prosodic Features for Speaker Clustering. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2009. Lecture Notes in Computer Science(), vol 5729. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04208-9_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-04208-9_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04207-2
Online ISBN: 978-3-642-04208-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics