Abstract
This paper describes the Speech Activity Detection (SAD) and Speaker Diarization (SPKR) systems that were developed by the Athens Information Technology in the scope of the NIST RT-06S evaluations. The SAD system performs classification of recorded frames into speech and non-speech, using Linear Discriminant Analysis (LDA), while the SPKR one initially segments recordings into speech intervals based on the Bayesian Information Criterion (BIC), and then applies a two-step clustering strategy to group segments from the same speaker together. Following a discussion of the intrinsics of the two systems, we report and comment on our results on the RT-06S corpus [20].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Weiser, M.: The Computer for the 21st Century. Scientific American 265(3), 66–75 (1991)
Waibel, A., Steusloff, H., Stiefelhagen, R., et al.: CHIL: Computers in the Human Interaction Loop. In: 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Lisbon, Portugal (April 2004)
Pnevmatikakis, A., Talantzis, F., Soldatos, J., Polymenakos, L.: Robust Multimodal Audio-Visual Processing for Advanced Context Awareness in Smart Spaces. In: Maglogiannis, I., Karpouzis, K., Bramer, M. (eds.) Artificial Intelligence Applications and Innovations (AIAI 2006), pp. 290–301. Springer, Heidelberg (2006)
Katsarakis, N., Souretis, G., Talantzis, F., Pnevmatikakis, A., Polymenakos, L.: 3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 45–54. Springer, Heidelberg (2007)
Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: A Decision Fusion System across Time and Classifiers for Audio-visual Person Identification. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122. Springer, Heidelberg (2007)
Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: Enhancing the Performance of a GMM-based Speaker Identification System in a Multi-Microphone Setup. In: INTERSPEECH 2006, Pittsburgh (accepted, September 2006)
Rabiner, L.R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal 54, 297 (1975)
Li, K., Swamy, N.S., Ahmad, M.O.: An Improved Voice Activity Detection Using Higher Order Statistics. IEEE Transactions on Speech and Audio Processing 13(5) (September 2005)
Stegmann, J., Schroeder, G.: Robust Voice Activity Detection Based on the Wavelet Transform. In: Proc. IEEE Workshop on Speech Coding For Telecommunications, Pocono Manor, Pennsylvania, USA, pp. 99–100 (September 1997)
Reynolds, D.A., Rose, R.C., Smith, M.J.T.: PC-Based TMS320C30 Implementation of the Gaussian Mixture Model Text-Independent Speaker Recognition System. In: International Conference on Signal Processing Applications and Technology, Hyatt Regency, Cambridge, Massachusetts, pp. 967–973 (November 1992)
Martin, A., Charlet, C., Mauary, L.: Robust Speech/Non- Speech Detection Using LDA Applied to MFCC. IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City (2001)
Duda, R., Hart, R., Stork, D.: Pattern Classification. Wiley-Interscience, New York (2001)
Rabiner, L., Schafer, R.: Digital Processing of Speech Signals. Prentice Hall Series in Signal Processing (September 1978)
Wu, T.-Y., Lu, L., Chen, K., Zhang, H.-J.: Universal Background Models for Real-Time Speaker Change Detection. In: MMM 2003, pp. 135–149 (2003)
Moraru, D., Meignier, S., Fredouille, C., Besacier, L., Bonastre, J.-F.: The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation. In: Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP 2004), Montreal, Canada (2004)
Gauvain, J.L., Lamel, L., Adda, G.: Partitioning and transcription of broadcast news data. In: International Conference on Speech and Language Processing, Sydney, Australia, vol. 4, pp. 1335–1338 (December 1998)
Tritschler, A., Gopinath, R.: Improved speaker segmentation and segments clustering using the Bayesian Information Criterion. In: Proc. of Eurospeech, pp. 679–682 (1999)
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 3(1), 72–83 (1995)
Fiscus, J.: Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation Plan (v2) (2006), http://www.nist.gov/speech/tests/rt/rt2006/spring/docs/rt06s-meeting-eval-plan-V2.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rentzeperis, E., Stergiou, A., Boukis, C., Pnevmatikakis, A., Polymenakos, L.C. (2006). The 2006 Athens Information Technology Speech Activity Detection and Speaker Diarization Systems. In: Renals, S., Bengio, S., Fiscus, J.G. (eds) Machine Learning for Multimodal Interaction. MLMI 2006. Lecture Notes in Computer Science, vol 4299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11965152_34
Download citation
DOI: https://doi.org/10.1007/11965152_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69267-6
Online ISBN: 978-3-540-69268-3
eBook Packages: Computer ScienceComputer Science (R0)