Abstract
A top-down approach to speaker diarization is developed using a modified Baum-Welch algorithm. The HMM states combine phonemes according to structural positions under syllabic phonological theory. By nature of the structural phonology, there are at most 16 states, and the transition matrix is sparse, allowing efficient decoding to structural phones. This addresses the issue of phoneme specificity in speaker diarization – that speaker similarities/differences are confounded by phonetic similarities/differences. We address this here without the expensive use of a complete set of individual phonemes. The voice activity detection (VAD) issue is likewise addressed, giving a new approach to VAD.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anguera Miró, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Politècnica de Catalunya (2006)
Anguera Miró, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)
Anguera, X., Wooters, C., Peskin, B., Aguiló, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482_34
Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012)
Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010)
Cooper, F., Delattre, P., Liberman, A., Borst, J., Gerstman, L.: Some experiments on the perception of synthetic speech sounds. J. Acoust. Soc. Am. 24(6), 597–606 (1952)
Edwards, E., et al.: Medical speech recognition: reaching parity with humans. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 512–524. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_51
Fakotakis, N., Tsopanoglou, A., Kokkinakis, G.: A text-independent speaker recognition system based on vowel spotting. Speech Commun. 12(1), 57–68 (1993)
Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018)
Fudge, E.: Branching structure within the syllable. J. Linguist. 23(2), 359–377 (1987)
Fujimura, O.: Syllable as a unit of speech recognition. IEEE Trans. Acoust. 23(1), 82–87 (1975)
Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997)
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Khudanpur, S.: A pitch extraction algorithm tuned for automatic speech recognition. In: Proceedings of ICASSP, pp. 2494–2498. IEEE (2014)
Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identification. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991)
Goldsmith, J.: The syllable. In: Goldsmith, J., Riggle, J., Yu, A. (eds.) The Handbook of Phonological Theory, 2nd edn., pp. 165–196. Wiley, Malden (2011)
Guest, E.: A History of English Rhythms. W. Pickering, London (1838)
Hansen, E., Slyh, R., Anderson, T.: Speaker recognition using phoneme-specific GMMs. In: Proceedings of Odyssey Workshop, pp. 179–184. ISCA (2004)
Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008)
Kessler, B., Treiman, R.: Syllable structure and the distribution of phonemes in English syllables. J. Mem. Lang. 37(3), 295–311 (1997)
Kozhevnikov, V., Chistovich, L.: Speech: articulation and perception. Translation JPRS 30543, Joint Public Research Service, U.S. Department of Commerce (1965)
Levinson, S., Rabiner, L., Sondhi, M.: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J. 62(4), 1035–1074 (1983)
Liberman, A., Ingemann, F., Lisker, L., Delattre, P., Cooper, F.: Minimal rules for synthesizing speech. J. Acoust. Soc. Am. 31(11), 1490–1499 (1959)
Martin, T., Wong, E., Baker, B., Mason, M., Sridharan, S.: Pitch and energy trajectory modelling in a syllable length temporal framework for language identification. In: Proceedings of Odyssey Workshop, pp. 289–296. ISCA (2004)
Mary, L., Yegnanarayana, B.: Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)
Mitford, W.: An Inquiry into the Principles of Harmony in Language, and of the Mechanism of Verse, Modern and Antient, 2nd edn. L. Hansard, London (1804)
Olson, H., Belar, H.: Phonetic typewriter. J. Acoust. Soc. Am. 28(6), 1072–1081 (1956)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015)
Rudnicky, A.: CMUdict 0.7b: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2015). https://github.com/Alexir/CMUdict
Sadjadi, S., Hansen, J.: Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process. Lett. 20(3), 197–200 (2013)
Saussure, F.: Cours de linguistique générale. Payot, Lausanne, Paris (1916)
Schindler, C., Draxler, C.: Using spectral moments as a speaker specific feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013)
Selkirk, E.: The syllable. In: van der Hulst, H., Smith, N. (eds.) The Structure of Phonological Representations, vol. 2, pp. 337–384. Foris, Dordrecht (1982)
Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliffs (1980)
Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., Stolcke, A.: Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3–4), 455–472 (2005)
Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992)
Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014)
Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993)
Wallis, J.: Grammatica linguae Anglicanae. L. Lichfield, Oxford (1674)
Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker verification to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010)
Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identification. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994)
Yamada, M., Pezeshki, A., Azimi-Sadjadi, M.: Relation between kernel CCA and kernel FDA. In: Proceedings of IJCNN, pp. 226–231. IEEE (2005)
Yella, S., Motlícek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597–601. ISCA (2014)
Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Edwards, E. et al. (2018). Speaker Diarization: A Top-Down Approach Using Syllabic Phonology. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-99579-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)