Abstract
In this paper the authors present the UPC speaker diarization system for the NIST Rich Transcription Evaluation (RT07s) [1] conducted on the conference environment. The presented system is based on the ICSI RT06s system, which employs agglomerative clustering with a modified Bayesian Criterion (BIC) measure to decide which pairs of clusters to merge and to determine when to stop merging clusters [2]. This is the first participation of the UPC in the RT Speaker Diarization Evaluation and the purpose of this work has been the consolidation of a baseline system which can be used in the future for further research in the field of diarization. We have introduced, as prior modules before the diarization system, an Speech/Non-Speech detection module based on a Support Vector Machine from UPC and a Wiener Filtering from an implementation of the QIO front-end. In the speech parameterization a Frequency Filtering (FF) of the filter-bank energies is applied instead the classical Discrete Cosine Transform in the Mel-Cepstrum analysis. In addition, it is introduced a small changes in the complexity selection algorithm and a new post-processing technique which process the shortest clusters at the end of each Viterbi segmentation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
NIST: Rich transcription meeting recognition evaluation plan. RT-07s (2007)
Anguera, X., Wooters, C., Hernando, J.: Robust speaker diarization for meetings: Icsi rt06s evaluation system. In: ICSLP (2006)
Gauvain, J., Lamel, L., Adda, G.: Partitioning and transcription of broadcast news data. In: ICSLP, pp. 1335–1338 (1998)
Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: DARPA BNTU Workshop (1998)
Gish, H., Siu, M., Rohlicek, R.: Segregation of speakers for speech recognition and speaker identification. In: ICASSP (1991)
Adami, A., et al.: Qualcomm-icsi-cgi features for asr. In: ICSLP, pp. 21–24 (2002)
Anguera, X.: The acoustic robust beamforming toolkit (2005)
Temko, A., Macho, D., Nadeu, C.: Enhanced SVM Training for Robust Speech Activity Detection. In: Proc. ICCASP (2007)
Nadeu, C., Paches-Leal, P., Juang, B.H.: Filtering the time sequence of spectral parameters for speech recognition. Speech Communication 22, 315–332 (1997)
Flanagan, J., Johnson, J., Kahn, R., Elko, G.: Computer-steered microphone arrays for sound transduction in large rooms. ASAJ 78(5), 1508–1518 (1985)
Knapp, C., Carter, G.: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustic, Speech and Signal Processing 24(4), 320–327 (1976)
Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002)
Fung, G., Mangasarian, O.: Proximal Support Vector Machine Classifiers. In: Proc. KDDM, pp. 77–86 (2001)
Lebrun, G., Charrier, C., Cardot, H.: SVM Training Time Reduction using Vector Quantization. In: Proc. ICPR, pp. 160–163 (2004)
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions ASSP (28), 357–366 (1980)
Luque, J., Hernando, J.: Robust Speaker Identification for Meetings: UPC CLEAR-07 Meeting Room Evaluation System. In: The same book (2007)
Nadeu, C., Macho, D., Hernando, J.: Time and Frequency Filtering of Filter-Bank Energies for Robust Speech Recognition. Speech Communication 34, 93–114 (2001)
Macho, D., Nadeu, C.: On the interaction between time and frequency filterinf of speech parameters for robust speech recognition. In: ICSLP, 1137 (1999)
Anguera, X., Hernando, J., Anguita, J.: Xbic: nueva medida para segmentación de locutor hacia el indexado automático de la señal de voz. JTH, 237–242 (2004)
Nadeu, C., Hernando, J., Gorricho, M.: On the Decorrelation of filter-Bank Energies in Speech Recognition. In: EuroSpeech, vol. 20, p. 417 (1995)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luque, J., Anguera, X., Temko, A., Hernando, J. (2008). Speaker Diarization for Conference Room: The UPC RT07s Evaluation System. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_50
Download citation
DOI: https://doi.org/10.1007/978-3-540-68585-2_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68584-5
Online ISBN: 978-3-540-68585-2
eBook Packages: Computer ScienceComputer Science (R0)