Speaker Diarization: A Top-Down Approach Using Syllabic Phonology

Edwards, Erik; Robinson, Amanda; Sadoughi, Najmeh; Finley, Greg P.; Korenevsky, Maxim; Brenndoerfer, Michael; Axtmann, Nico; Miller, Mark; Suendermann-Oeft, David

doi:10.1007/978-3-319-99579-3_14

Erik Edwards¹⁶,
Amanda Robinson¹⁶,
Najmeh Sadoughi¹⁶,
Greg P. Finley¹⁶,
Maxim Korenevsky¹⁶,
Michael Brenndoerfer¹⁷,
Nico Axtmann¹⁸,
Mark Miller¹⁶ &
…
David Suendermann-Oeft¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

International Conference on Speech and Computer

1414 Accesses

Abstract

A top-down approach to speaker diarization is developed using a modified Baum-Welch algorithm. The HMM states combine phonemes according to structural positions under syllabic phonological theory. By nature of the structural phonology, there are at most 16 states, and the transition matrix is sparse, allowing efficient decoding to structural phones. This addresses the issue of phoneme specificity in speaker diarization – that speaker similarities/differences are confounded by phonetic similarities/differences. We address this here without the expensive use of a complete set of individual phonemes. The voice activity detection (VAD) issue is likewise addressed, giving a new approach to VAD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anguera Miró, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Politècnica de Catalunya (2006)
Google Scholar
Anguera Miró, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)
Article Google Scholar
Anguera, X., Wooters, C., Peskin, B., Aguiló, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482_34
Chapter Google Scholar
Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012)
Google Scholar
Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010)
Google Scholar
Cooper, F., Delattre, P., Liberman, A., Borst, J., Gerstman, L.: Some experiments on the perception of synthetic speech sounds. J. Acoust. Soc. Am. 24(6), 597–606 (1952)
Article Google Scholar
Edwards, E., et al.: Medical speech recognition: reaching parity with humans. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 512–524. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_51
Chapter Google Scholar
Fakotakis, N., Tsopanoglou, A., Kokkinakis, G.: A text-independent speaker recognition system based on vowel spotting. Speech Commun. 12(1), 57–68 (1993)
Article Google Scholar
Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018)
Google Scholar
Fudge, E.: Branching structure within the syllable. J. Linguist. 23(2), 359–377 (1987)
Article Google Scholar
Fujimura, O.: Syllable as a unit of speech recognition. IEEE Trans. Acoust. 23(1), 82–87 (1975)
Article Google Scholar
Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997)
Google Scholar
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Khudanpur, S.: A pitch extraction algorithm tuned for automatic speech recognition. In: Proceedings of ICASSP, pp. 2494–2498. IEEE (2014)
Google Scholar
Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identification. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991)
Google Scholar
Goldsmith, J.: The syllable. In: Goldsmith, J., Riggle, J., Yu, A. (eds.) The Handbook of Phonological Theory, 2nd edn., pp. 165–196. Wiley, Malden (2011)
Google Scholar
Guest, E.: A History of English Rhythms. W. Pickering, London (1838)
Google Scholar
Hansen, E., Slyh, R., Anderson, T.: Speaker recognition using phoneme-specific GMMs. In: Proceedings of Odyssey Workshop, pp. 179–184. ISCA (2004)
Google Scholar
Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008)
Google Scholar
Kessler, B., Treiman, R.: Syllable structure and the distribution of phonemes in English syllables. J. Mem. Lang. 37(3), 295–311 (1997)
Article Google Scholar
Kozhevnikov, V., Chistovich, L.: Speech: articulation and perception. Translation JPRS 30543, Joint Public Research Service, U.S. Department of Commerce (1965)
Google Scholar
Levinson, S., Rabiner, L., Sondhi, M.: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J. 62(4), 1035–1074 (1983)
Article MathSciNet Google Scholar
Liberman, A., Ingemann, F., Lisker, L., Delattre, P., Cooper, F.: Minimal rules for synthesizing speech. J. Acoust. Soc. Am. 31(11), 1490–1499 (1959)
Article Google Scholar
Martin, T., Wong, E., Baker, B., Mason, M., Sridharan, S.: Pitch and energy trajectory modelling in a syllable length temporal framework for language identification. In: Proceedings of Odyssey Workshop, pp. 289–296. ISCA (2004)
Google Scholar
Mary, L., Yegnanarayana, B.: Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)
Article Google Scholar
Mitford, W.: An Inquiry into the Principles of Harmony in Language, and of the Mechanism of Verse, Modern and Antient, 2nd edn. L. Hansard, London (1804)
Google Scholar
Olson, H., Belar, H.: Phonetic typewriter. J. Acoust. Soc. Am. 28(6), 1072–1081 (1956)
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015)
Google Scholar
Rudnicky, A.: CMUdict 0.7b: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2015). https://github.com/Alexir/CMUdict
Sadjadi, S., Hansen, J.: Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process. Lett. 20(3), 197–200 (2013)
Article Google Scholar
Saussure, F.: Cours de linguistique générale. Payot, Lausanne, Paris (1916)
Google Scholar
Schindler, C., Draxler, C.: Using spectral moments as a speaker specific feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013)
Google Scholar
Selkirk, E.: The syllable. In: van der Hulst, H., Smith, N. (eds.) The Structure of Phonological Representations, vol. 2, pp. 337–384. Foris, Dordrecht (1982)
Google Scholar
Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliffs (1980)
Google Scholar
Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., Stolcke, A.: Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3–4), 455–472 (2005)
Article Google Scholar
Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992)
Google Scholar
Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014)
Google Scholar
Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993)
Google Scholar
Wallis, J.: Grammatica linguae Anglicanae. L. Lichfield, Oxford (1674)
Google Scholar
Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker verification to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010)
Google Scholar
Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identification. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994)
Google Scholar
Yamada, M., Pezeshki, A., Azimi-Sadjadi, M.: Relation between kernel CCA and kernel FDA. In: Proceedings of IJCNN, pp. 226–231. IEEE (2005)
Google Scholar
Yella, S., Motlícek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597–601. ISCA (2014)
Google Scholar
Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

EMR.AI Inc., San Francisco, CA, USA
Erik Edwards, Amanda Robinson, Najmeh Sadoughi, Greg P. Finley, Maxim Korenevsky, Mark Miller & David Suendermann-Oeft
University of California Berkeley, Berkeley, CA, USA
Michael Brenndoerfer
DHBW, Karlsruhe, Germany
Nico Axtmann

Authors

Erik Edwards
View author publications
You can also search for this author in PubMed Google Scholar
Amanda Robinson
View author publications
You can also search for this author in PubMed Google Scholar
Najmeh Sadoughi
View author publications
You can also search for this author in PubMed Google Scholar
Greg P. Finley
View author publications
You can also search for this author in PubMed Google Scholar
Maxim Korenevsky
View author publications
You can also search for this author in PubMed Google Scholar
Michael Brenndoerfer
View author publications
You can also search for this author in PubMed Google Scholar
Nico Axtmann
View author publications
You can also search for this author in PubMed Google Scholar
Mark Miller
View author publications
You can also search for this author in PubMed Google Scholar
David Suendermann-Oeft
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erik Edwards .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Edwards, E. et al. (2018). Speaker Diarization: A Top-Down Approach Using Syllabic Phonology. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_14
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics