Abstract
A synthetic corpus of dialogs was constructed from the LibriSpeech corpus, and is made freely available for diarization research. It includes over 90 h of training data, and over 9 h each of development and test data. Both 2-person and 3-person dialogs, with and without overlap, are included. Timing information is provided in several formats, and includes not only speaker segmentations, but also phoneme segmentations. As such, it is a useful starting point for general, particularly early-stage, diarization system development.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anguera Miró, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Politècnica de Catalunya (2006)
Anguera Miró, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)
Anguera Miró, X., Hernando Pericás, F.: Evolutive speaker segmentation using a repository system. In: Proceedings of ICSLP, pp. 605–608. ISCA (2004)
Anguera, X., Wooters, C., Peskin, B., Aguiló, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482_34
Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012)
Burger, S., MacLaren, V., Yu, H.: The ISL meeting corpus: the impact of meeting type on speech style. In: Proceedings of ICSLP, pp. 301–304. ISCA (2002)
Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010)
Delacourt, P., Kryze, D., Wellekens, C.: Speaker-based segmentation for audio data indexing. In: Proceedings of ESCA Tutorial and Research Workshop, pp. 78–83. ISCA (1999)
Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018)
Gangadharaiah, R., Narayanaswamy, B.: A novel method for two-speaker segmentation. In: Proceedings of ICSLP, pp. 2337–2340. ISCA (2004)
Garofolo, J., Laprun, C., Michel, M., Stanford, V., Tabassi, E.: The NIST meeting room pilot corpus. In: Proceedings of LREC, p. 4. ELRA (2004)
Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997)
Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identification. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991)
Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: Proceedings of ICASSP, vol. 1, pp. 517–520. IEEE (1992)
Hain, T., et al.: The development of the AMI system for the transcription of speech in meetings. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 344–356. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482_30
Heldner, M., Edlund, J.: Pauses, gaps and overlaps in conversations. J. Phon. 38(4), 555–568 (2010)
Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008)
Ikbal, S., Visweswariah, K.: Learning essential speaker sub-space using hetero-associative neural networks for speaker clustering. In: Proceedings of INTERSPEECH, pp. 28–31. ISCA (2008)
Janin, A., et al.: The ICSI meeting corpus. In: Proceedings of ICASSP, vol. 1, pp. 364–367. IEEE (2003)
Jothilakshmi, S., Ramalingam, V., Palanivel, S.: Speaker diarization using autoassociative neural networks. Eng. Appl. Artif. Intell. 22(4–5), 667–675 (2009)
Kim, K., Kim, M.: Robust speaker recognition against background noise in an enhanced multi-condition domain. IEEE Trans. Consum. Electron. 56(3), 1684–1688 (2010)
Liu, C., Yan, Y.: Speaker change detection using minimum message length criterion. In: Proceedings of ICSLP, pp. 514–517. ISCA (2000)
Meinedo, H., Neto, J.: A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models. In: Proceedings of INTERSPEECH, pp. 237–240. ISCA (2005)
Metzger, Y.: Blind segmentation of a multi-speaker conversation using two different sets of features. In: Proceedings of Odyssey Workshop, pp. 157–162. ISCA (2001)
Moattar, M., Homayounpour, M.: A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012)
Mohammadi, S., Sameti, H., Langarani, M., Tavanaei, A.: KNNDIST: a non-parametric distance measure for speaker segmentation. In: Proceedings of INTERSPEECH, pp. 2282–2285. ISCA (2012)
NIST: Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation plan. Report RT-06S, National Institute of Standards and Technology, Spring 2006
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: Proceedings of Workshop ASRU, Waikoloa Village, HI, p. 4. IEEE (2011)
Rohlicek, J., et al.: Gisting conversational speech. In: Proceedings of ICASSP, vol. 2, pp. 113–116. IEEE (1992)
Schindler, C., Draxler, C.: Using spectral moments as a speaker specific feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013)
Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliffs (1980)
Siegler, M., Jain, U., Raj, B., Stern, R.: Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of DARPA Speech Recognition Workshop, pp. 97–99. DARPA (1997)
Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992)
Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014)
Sönmez, M., Heck, L., Weintraub, M.: Speaker tracking and detection with multiple speakers. In: Proceedings of EUROSPEECH, pp. 2219–2222. ISCA (1999)
Stivers, T., et al.: Universals and cultural variation in turn-taking in conversation. Proc. Natl. Acad. Sci U.S.A. 106(26), 10587–10592 (2009)
Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993)
Takagi, K., Itahashi, S.: Segmentation of spoken dialogue by interjections, disfluent utterances and pauses. In: Proceedings of ICSLP, pp. 697–700. ISCA (1996)
Valente, F., Wellekens, C.: Scoring unknown speaker clustering: VB vs. BIC. In: Proceedings of ICSLP, pp. 593–596. ISCA (2004)
Viñals, I., Villalba, J., Ortega, A., Miguel, A., Lleida, E.: Bottleneck based front-end for diarization systems. In: Abad, A., et al. (eds.) IberSPEECH 2016. LNCS (LNAI), vol. 10077, pp. 276–286. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49169-1_27
Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker verification to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010)
Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identification. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994)
Yella, S., Motlícek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597–601. ISCA (2014)
Yella, S., Stolcke, A., Slaney, M.: Artificial neural network features for speaker diarization. In: Proceedings of SLT Workshop, pp. 402–406. IEEE (2014)
Zâo, L., Coelho, R.: Colored noise based multicondition training technique for robust speaker identification. IEEE Signal Process. Lett. 18(11), 675–678 (2011)
Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Edwards, E. et al. (2018). A Free Synthetic Corpus for Speaker Diarization Research. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-99579-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)