Abstract
In this paper, we tackle the challenge of identifying dialects using deep learning for under-resourced languages. Recent advances in spoken dialogue technology have been strongly influenced by the availability of big corpora, while our goal is to work on the spoken interactive application for the North Sami language, which is classified as one of the less-resourced languages spoken in Northern Europe. North Sami has various variations and dialects which are influenced by the majority languages of the areas in which it is spoken: Finnish and Norwegian. To provide reliable and accurate speech components for an interactive system, it is important to recognize the speakers with their Finnish or Norwegian accent. Conventional approaches compute universal statistical models which require a large amount of data to form reliable statistics, and thus they are vulnerable to small data where there is only a limited number of utterances and speakers available. In this paper we will discuss dialect and accent recognition in under-resourced context, and focus on training an attentive network for leveraging unlabeled data in a semi-supervised scenario for robust feature learning. Validation of our approach is done via two DigiSami datasets: conversational and read corpus.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abdel-Hamid O, Mohamed A-R (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22:1533–1545
Amodei D, Anubhai R, Battenberg E et al (2015) Deep speech 2: End-to-end speech recognition in English and Mandarin. CoRR, vol. abs/1512.02595
Bahdanau D, Chorowski J, Serdyuk D, Brakel P, Bengio Y (2015) End-to-end attention-based large vocabulary speech recognition. CoRR, vol. abs/1508.04395
Behravan H, Hautamäki V, Siniscalchi SM, Kinnunen T, Lee C-H (2016) I-vector modeling of speech attributes for automatic foreign accent recognition. Audio, Speech, Lang Process, IEEE/ACM Trans 24(1):29–41
Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: a survey. Speech Commun 56:85–100
Crystal D (2000). English as a global language. Cambridge
Dalyac A, Shanahan M, Kelly J (2014). Tackling class imbalance with deep convolutional neural networks. Thesis, Imperial College London
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Ganapathy S, Han K, Thomas S et al (2014) Robust language identification using convolutional neural network features. In: Proceedings of the fifteenth annual conference of the international speech communication association
Glas DF, Minato T, Ishi CT, Kawahara T, Ishiguro H (2016) Erica: the erato intelligent conversational android. In: 2016 25th IEEE international symposium on robot and human interactive communication (RO-MAN). IEEE, pp 22–29
Gonzalez-Dominguez J, Lopez-Moreno I, Sak H (2014) Automatic language identification using long short-term memory recurrent neural networks. Interspeech
Hiovain K, Jokinen K (2016) Acoustic features of different types of laughter in north sami conversational speech. In: Proceedings of the LREC Workshop Just talking—casual talk among humans and machines, Portorož, Slovenia
Jokinen K (2014) Open-domain interaction and online content in the sami language. In: Proceedings of the language resources and evaluation conference (LREC 2014)
Jokinen K, Trong TN, Hautamäki V (2016) Variation in Spoken North Sami Language. Interspeech-2016, pp. 3299–3303
Jokinen K, Hiovain K, Laxström N, Rauhala I, Wilcock G (2017) DigiSami and digital natives: Interaction technology for the north sami language. In: Jokinen K, Wilcock G (eds) Dialogues with social robots. Springer, pp 3–19
Jokinen K, Wilcock G (2013) Multimodal open-domain conversations with the Nao robot. In: Natural interaction with robots, knowbots and smartphones: putting spoken dialogue systems into practice. Springer, pp 213–224
Jokinen K, Wilcock G (2014) Community-based resource building and data collection. In: Proceedings of the 4th international workshop on spoken language technologies for under-resourced languages (SLTU’14). St Petersburg, Russia, pp 201–206
Kirkpatrick K, Pascanu R, Rabinowitz NC et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA 114(13):3521–3526
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
LeCun Y, Bottou L, Orr GB, Müller KR (1998) Efficient Back-Prop. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 9–50
Lee KA, Li H, Deng L, Hautamäki V et al (2016) The 2015 NIST language recognition evaluation: The shared view of i2r, fantastic4 and singams. Interspeech
Leinonen J (2015) Automatic speech recognition for human-robot interaction using an under-resourced language. Master’s thesis, Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics, Espoo
Li H, Ma B, Lee KA (2013) Spoken language recognition: From fundamentals to practice. Proc IEEE 101(5):1136–1159
Lopez-Moreno I, Gonzalez-Dominguez J, Plchot O (2014) Automatic language identification using deep neural networks. ICASSP
Matrouf D, Scheffer N, Fauve BGB, Bonastre J-F (2007) A straightforward and efficient implementation of the factor analysis model for speaker verification. Interspeech, pp 1242–1245
Mi H, Wang Z, Ittycheriah A (2016) Supervised attentions for neural machine translation, CoRR, vol. abs/1608.00112
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. AIS-TATS05, pp 246–252
Prechelt L (2012) Neural Networks: Tricks of the Trade, 2nd edn. Chapter “Early Stopping—But When?”. Springer, Berlin, Heidelberg, pp 53–67
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, vol. abs/1511.06434
Richardson F, Reynolds DA, Dehak N (2015) A unified deep neural network for speaker and language recognition. CoRR, vol. abs/1504.00923
Sainath TN, Kingsbury B, Saon G, Soltau H, Mohamed A, Dahl G, Ramabhadran B (2014) Deep convolutional neural networks for large-scale speech tasks. Neural Netw, pp 1–10
Sainath T, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. ICASSP, pp 4580–4584
Thomas S, Seltzer ML, Church K, Hermansky H (2013) Deep neural network features and semi-supervised training for low resource speech recognition. In: IEEE international conference on acoustics, speech and signal processing, pp 6704–6708
Trong TN, Hautamäki V, Lee KA (2016) Deep Language: a comprehensive deep learning approach to end-to-end language recognition. Speaker Odyssey, Bilbao, Spain
Trong TN, Hiovain K, Jokinen K (2016) Laughing and co-construction of common ground in human conversations. The 4th European and 7th Nordic symposium on multimodal communication, Copenhagen, Denmark
Wilcock G, Jokinen K (2014) Advances in Wikipedia-based Interaction with Robots. In: Proceedings of the ICMI workshop on multi-modal, multi-party, real-world human-robot interaction, pp 13–18
Wilcock G, Jokinen K (2015) Multilingual WikiTalk: Wikipedia-based talking robots that switch languages. In: Proceedings of the SIGDial conference, pp 162–164
Wilcock G, Laxström N, Leinonen J, Smit P, Kurimo M, Jokinen K (2016) Towards SamiTalk: A sami-speaking robot linked to sami wikipedia. In: Jokinen K, Wilcock G (eds) Dialogues with Social Robots. Springer, pp 343–351
Wilcock G, Jokinen K (2013) WikiTalk human-robot interactions. In: Proceedings of the 15th ACM international conference on multimodal interaction (ICMI), pp 73–74
Xu K, Ba J, Kiros R, Cho K et al (2015) Show, Attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057
Zhang S, Qin Y (2013) Semi-supervised accent detection and modelling. In: IEEE international conference on acoustics, speech and signal processing, pp 7175–7179
Zhang Z, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization
Ó Laoire M (2008) Indigenous language revitalization and globalization. Te Kaharoa 1
Acknowledgements
The paper is partially based on results obtained from the Academy of Finland project Fenno-Ugric Digital Citizens (grant n°270082) and the Future AI and Robot Technology Research and Development project commissioned by the New Energy and Industrial Technology Development Organization (NEDO) in Japan.
The research was partially funded by the Academy of Finland (grant no. 313970) and Finnish Scientific Advisory Board for Defense (MATINE) project no. 2500 M-0106. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Trong, T.N., Jokinen, K., Hautamäki, V. (2019). Enabling Spoken Dialogue Systems for Low-Resourced Languages—End-to-End Dialect Recognition for North Sami. In: D'Haro, L., Banchs, R., Li, H. (eds) 9th International Workshop on Spoken Dialogue System Technology. Lecture Notes in Electrical Engineering, vol 579. Springer, Singapore. https://doi.org/10.1007/978-981-13-9443-0_19
Download citation
DOI: https://doi.org/10.1007/978-981-13-9443-0_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9442-3
Online ISBN: 978-981-13-9443-0
eBook Packages: Literature, Cultural and Media StudiesLiterature, Cultural and Media Studies (R0)