Abstract
For speech models that depend on sharing between phonological representations an often overlooked issue is that phonological contrasts that are succinctly described language-internally by the phonemes and their respective featurizations are not necessarily robust across languages. This paper extends a recently proposed method for assessing the cross-linguistic consistency of phonological features in phoneme inventories. The original method employs binary neural classifiers for individual phonological contrasts trained solely on audio. This method cannot resolve some important phonological contrasts, such as retroflex consonants, cross-linguistically. We extend this approach by leveraging prior phonological knowledge during classifier training. We observe that since phonemic descriptions are articulatory rather than acoustic the model input space needs to be grounded in phonology to better capture phonemic correlations between the training samples. The cross-linguistic consistency of the proposed method is evaluated in a multilingual setting on held-out low-resource languages and classification quality is reported. We observe modest gains over the baseline for difficult cases, such as cross-lingual detection of aspiration, and discuss multiple confounding factors that explain the dimensions of the difficulty for this task.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of 12th Symposium on Operating Systems Design and Implementation (OSDI), pp. 265–283. USENIX Association (2016)
Chomsky, N., Halle, M.: The Sound Pattern of English. Harper & Row, New York (1968)
Demirsahin, I., Jansche, M., Gutkin, A.: A unified phonological representation of South Asian languages for multilingual text-to-speech. In: Proceedings of 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), pp. 80–84. ISCA, Gurugram, India (2018). https://doi.org/10.21437/SLTU.2018-17
Emeneau, M.: India as a linguistic area. Language 32(1), 3–16 (1956). https://doi.org/10.2307/410649
Fu, T., Gao, S., Wu, X.: Improving minority language speech recognition based on distinctive features. In: Peng, Y., Yu, K., Lu, J., Jiang, X. (eds.) IScIDE 2018. LNCS, vol. 11266, pp. 411–420. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02698-1_36
Ganchev, T., Fakotakis, N., Kokkinakis, G.: Comparative evaluation of various MFCC implementations on the speaker verification task. In: Proceedings of 10th International Conference on Speech and Computer (SPECOM), vol. 1, pp. 191–194, Patras, Greece (2005)
Gussenhoven, C.: Understanding Phonology, 4th edn. Routledge, London (2017). https://doi.org/10.4324/9781315267982
Gutkin, A.: Eidos: an open-source auditory periphery modeling toolkit and evaluation of cross-lingual phonemic contrasts. In: Proceedings of 1st Joint Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) Workshop (SLTU-CCURL 2020), pp. 9–20. European Language Resources Association (ELRA), Marseille (2020)
Hall, T.A.: Distinctive Feature Theory. Mouton de Grutyer, Berlin (2001). https://doi.org/10.1515/9783110886672
Hammarström, H., Forkel, R., Haspelmath, M., Bank, S.: Glottolog 4.2.1. Max Planck Institute for the Science of Human History, Jena, Germany (2020). https://doi.org/10.5281/zenodo.3754591
Haspelmath, M., Dryer, M.S., Gil, D., Comrie, B.: The World Atlas of Language Structures. Oxford University Press, Oxford (2005). https://doi.org/10.5281/zenodo.3731125
He, D., Yang, X., Lim, B.P., Liang, Y., Hasegawa-Johnson, M., Chen, D.: When CTC training meets acoustic landmarks. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5996–6000. IEEE, Brighton (2019). https://doi.org/10.1109/ICASSP.2019.8683607
He, F., et al.: Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems. In: Proceedings of 12th Language Resources and Evaluation Conference (LREC), pp. 6494–6503. European Language Resources Association (ELRA), Marseille (2020)
Hoogervorst, T.: Detecting pre-modern lexical influence from South India in Maritime Southeast Asia. Archipel: Études interdisciplinaires sur le monde insulindien (89), 63–93 (2015). https://doi.org/10.4000/archipel.490
Jakobson, R., Fant, G., Halle, M.: Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates. MIT Press, Cambridge (1952)
Johny, C., Gutkin, A., Jansche, M.: Cross-lingual consistency of phonological features: an empirical study. In: Proceedings of Interspeech 2019, pp. 1741–1745. ISCA, Graz (2019). https://doi.org/10.21437/Interspeech.2019-2184
Karaulov, I., Tkanov, D.: Attention model for articulatory features detection. In: Proceedings of Interspeech 2019, pp. 1571–1575. ISCA, Graz (2019). https://doi.org/10.21437/Interspeech.2019-3020
Kirchhoff, K., Fink, G.A., Sagerer, G.: Conversational speech recognition using acoustic and articulatory input. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1435–1438. IEEE, Istanbul (2000). https://doi.org/10.1109/ICASSP.2000.861883
Kjartansson, O., Sarin, S., Pipatsrisawat, K., Jansche, M., Ha, L.: Crowd-sourced speech corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. In: Proceedings of 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), pp. 52–55. ISCA, Gurugram (2018). https://doi.org/10.21437/SLTU.2018-11
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
Merkx, D., Scharenborg, O.: Articulatory feature classification using convolutional neural networks. In: Proceedings of Interspeech, Hyderabad, India, pp. 2142–2146 (2018). https://doi.org/10.21437/Interspeech.2018-2275
Metze, F., Waibel, A.: A flexible stream architecture for ASR using articulatory features. In: Proceedings of 7th International Conference on Spoken Language Processing (ICSLP), pp. 2133–2136. ISCA, Denver (2002)
Momayyez, P., Waterhouse, J., Rose, R.: Exploiting complementary aspects of phonological features in automatic speech recognition. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 47–52. IEEE, Kyoto (2007). https://doi.org/10.1109/ASRU.2007.4430082
Moran, S., McCloy, D.: PHOIBLE 2.0. Max Planck Institute for Evolutionary Anthropology, Jena, Germany (2019). http://phoible.org/
Mortensen, D.R., et al.: AlloVera: a multilingual allophone database. arXiv preprint arXiv:2004.08031 (2020)
Mortensen, D.R., Littell, P., Bharadwaj, A., Goyal, K., Dyer, C., Levin, L.: PanPhon: a resource for mapping IPA segments to articulatory feature vectors. In: Proceedings of COLING, Osaka, Japan, pp. 3475–3484 (2016)
Phillips, A., Davis, M.: BCP 47 - Tags for Identifying Languages. IETF Trust (2009)
Povey, D.: Open SLR. John Hopkins University, Baltimore (2020). http://www.openslr.org/resources.php
Qu, L., Weber, C., Lakomkin, E., Twiefel, J., Wermter, S.: Combining articulatory features with end-to-end learning in speech recognition. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 500–510. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_49
Rallabandi, S., Black, A.: Variational attention using articulatory priors for generating code mixed speech using monolingual corpora. In: Proceedings of Interspeech, pp. 3735–3739 (2019). https://doi.org/10.21437/Interspeech.2019-1103
Rasipurama, R., Magimai-Doss, M.: Articulatory feature based continuous speech recognition using probabilistic lexical modeling. Comput. Speech Lang. 36, 233–259 (2016). https://doi.org/10.1016/j.csl.2015.04.003
Repp, B.H.: Categorical perception: issues, methods, findings. In: Speech and Language: Advances in Basic Research and Practice, vol. 10, pp. 243–335. Elsevier (1984)
Rose, R., Momayyez, P.: Integration of multiple feature sets for reducing ambiguity in ASR. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. IV-325–IV-328. IEEE, Honolulu (2007). https://doi.org/10.1109/ICASSP.2007.366915
Siniscalchi, S.M., Lee, C.H.: A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Commun. 51(11), 1139–1153 (2009). https://doi.org/10.1016/j.specom.2009.05.004
Siniscalchi, S.M., Svendsen, T., Lee, C.H.: Toward a detector-based universal phone recognizer. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4261–4264. IEEE, Las Vegas (2008). https://doi.org/10.1109/ICASSP.2008.4518596
Smith, S.L., Kindermans, P.J., Ying, C., Le, Q.V.: Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. (JMLR) 15(56), 1929–1958 (2014)
Stüker, S., Schultz, T., Metze, F., Waibel, A.: Integrating multilingual articulatory features into speech recognition. In: Proceedings of EuroSpeech, pp. 1033–1036. ISCA, Geneva (2003)
Stüker, S., Schultz, T., Metze, F., Waibel, A.: Multilingual articulatory features. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. I144–I147. IEEE, Hong Kong (2003). https://doi.org/10.1109/ICASSP.2003.1198737
Stüker, S., Waibel, A.: Porting speech recognition systems to new languages supported by articulatory feature models. In: Proceedings of 13th International Conference on Speech and Computer (SPECOM). St. Petersburg, Russia (2009)
Tolba, H., Selouani, S., O’Shaughnessy, D.: Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. I-837–I-840. IEEE, Orlando (2002). https://doi.org/10.1109/ICASSP.2002.5743869
Tsvetkov, Y., et al.: Polyglot neural language models: a case study in cross-lingual phonetic representation learning. In: Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1357–1366. ACL, San Diego (2016). https://doi.org/10.18653/v1/N16-1161
Wibawa, J.A.E., et al.: Building open Javanese and Sundanese corpora for multilingual text-to-speech. In: Proceedings of 11th Conference on Language Resources and Evaluation (LREC), pp. 1610–1614. European Language Resources Association (ELRA), Miyazaki (2018)
Young, S., et al.: The HTK Book. Cambridge University Engineering Department, Cambridge (2006)
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
Zheng, H., Yang, Z., Liu, W., Liang, J., Li, Y.: Improving deep neural networks using softplus units. In: Proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 1–4. IEEE (2015). https://doi.org/10.1109/IJCNN.2015.7280459
Acknowledgments.
The authors would like to thank Cibu Johny for his help with the experiments, and Işın Demirşahin and Rob Clark for fruitful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Skidmore, L., Gutkin, A. (2020). Does A Priori Phonological Knowledge Improve Cross-Lingual Robustness of Phonemic Contrasts?. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_51
Download citation
DOI: https://doi.org/10.1007/978-3-030-60276-5_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)