Enabling Spoken Dialogue Systems for Low-Resourced Languages—End-to-End Dialect Recognition for North Sami

Trong, Trung Ngo; Jokinen, Kristiina; Hautamäki, Ville

doi:10.1007/978-981-13-9443-0_19

Enabling Spoken Dialogue Systems for Low-Resourced Languages—End-to-End Dialect Recognition for North Sami

Trung Ngo Trong³⁷,
Kristiina Jokinen³⁸ &
Ville Hautamäki³⁷

Conference paper
First Online: 25 September 2019

333 Accesses
3 Citations

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 579))

Abstract

In this paper, we tackle the challenge of identifying dialects using deep learning for under-resourced languages. Recent advances in spoken dialogue technology have been strongly influenced by the availability of big corpora, while our goal is to work on the spoken interactive application for the North Sami language, which is classified as one of the less-resourced languages spoken in Northern Europe. North Sami has various variations and dialects which are influenced by the majority languages of the areas in which it is spoken: Finnish and Norwegian. To provide reliable and accurate speech components for an interactive system, it is important to recognize the speakers with their Finnish or Norwegian accent. Conventional approaches compute universal statistical models which require a large amount of data to form reliable statistics, and thus they are vulnerable to small data where there is only a limited number of utterances and speakers available. In this paper we will discuss dialect and accent recognition in under-resourced context, and focus on training an attentive network for leveraging unlabeled data in a semi-supervised scenario for robust feature learning. Validation of our approach is done via two DigiSami datasets: conversational and read corpus.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abdel-Hamid O, Mohamed A-R (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22:1533–1545
Article Google Scholar
Amodei D, Anubhai R, Battenberg E et al (2015) Deep speech 2: End-to-end speech recognition in English and Mandarin. CoRR, vol. abs/1512.02595
Google Scholar
Bahdanau D, Chorowski J, Serdyuk D, Brakel P, Bengio Y (2015) End-to-end attention-based large vocabulary speech recognition. CoRR, vol. abs/1508.04395
Google Scholar
Behravan H, Hautamäki V, Siniscalchi SM, Kinnunen T, Lee C-H (2016) I-vector modeling of speech attributes for automatic foreign accent recognition. Audio, Speech, Lang Process, IEEE/ACM Trans 24(1):29–41
Article Google Scholar
Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: a survey. Speech Commun 56:85–100
Article Google Scholar
Crystal D (2000). English as a global language. Cambridge
Google Scholar
Dalyac A, Shanahan M, Kelly J (2014). Tackling class imbalance with deep convolutional neural networks. Thesis, Imperial College London
Google Scholar
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
Ganapathy S, Han K, Thomas S et al (2014) Robust language identification using convolutional neural network features. In: Proceedings of the fifteenth annual conference of the international speech communication association
Google Scholar
Glas DF, Minato T, Ishi CT, Kawahara T, Ishiguro H (2016) Erica: the erato intelligent conversational android. In: 2016 25th IEEE international symposium on robot and human interactive communication (RO-MAN). IEEE, pp 22–29
Google Scholar
Gonzalez-Dominguez J, Lopez-Moreno I, Sak H (2014) Automatic language identification using long short-term memory recurrent neural networks. Interspeech
Google Scholar
Hiovain K, Jokinen K (2016) Acoustic features of different types of laughter in north sami conversational speech. In: Proceedings of the LREC Workshop Just talking—casual talk among humans and machines, Portorož, Slovenia
Google Scholar
Jokinen K (2014) Open-domain interaction and online content in the sami language. In: Proceedings of the language resources and evaluation conference (LREC 2014)
Google Scholar
Jokinen K, Trong TN, Hautamäki V (2016) Variation in Spoken North Sami Language. Interspeech-2016, pp. 3299–3303
Google Scholar
Jokinen K, Hiovain K, Laxström N, Rauhala I, Wilcock G (2017) DigiSami and digital natives: Interaction technology for the north sami language. In: Jokinen K, Wilcock G (eds) Dialogues with social robots. Springer, pp 3–19
Google Scholar
Jokinen K, Wilcock G (2013) Multimodal open-domain conversations with the Nao robot. In: Natural interaction with robots, knowbots and smartphones: putting spoken dialogue systems into practice. Springer, pp 213–224
Google Scholar
Jokinen K, Wilcock G (2014) Community-based resource building and data collection. In: Proceedings of the 4th international workshop on spoken language technologies for under-resourced languages (SLTU’14). St Petersburg, Russia, pp 201–206
Google Scholar
Kirkpatrick K, Pascanu R, Rabinowitz NC et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA 114(13):3521–3526
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
LeCun Y, Bottou L, Orr GB, Müller KR (1998) Efficient Back-Prop. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 9–50
Google Scholar
Lee KA, Li H, Deng L, Hautamäki V et al (2016) The 2015 NIST language recognition evaluation: The shared view of i2r, fantastic4 and singams. Interspeech
Google Scholar
Leinonen J (2015) Automatic speech recognition for human-robot interaction using an under-resourced language. Master’s thesis, Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics, Espoo
Google Scholar
Li H, Ma B, Lee KA (2013) Spoken language recognition: From fundamentals to practice. Proc IEEE 101(5):1136–1159
Article Google Scholar
Lopez-Moreno I, Gonzalez-Dominguez J, Plchot O (2014) Automatic language identification using deep neural networks. ICASSP
Google Scholar
Matrouf D, Scheffer N, Fauve BGB, Bonastre J-F (2007) A straightforward and efficient implementation of the factor analysis model for speaker verification. Interspeech, pp 1242–1245
Google Scholar
Mi H, Wang Z, Ittycheriah A (2016) Supervised attentions for neural machine translation, CoRR, vol. abs/1608.00112
Google Scholar
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. AIS-TATS05, pp 246–252
Google Scholar
Prechelt L (2012) Neural Networks: Tricks of the Trade, 2nd edn. Chapter “Early Stopping—But When?”. Springer, Berlin, Heidelberg, pp 53–67
Google Scholar
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, vol. abs/1511.06434
Google Scholar
Richardson F, Reynolds DA, Dehak N (2015) A unified deep neural network for speaker and language recognition. CoRR, vol. abs/1504.00923
Google Scholar
Sainath TN, Kingsbury B, Saon G, Soltau H, Mohamed A, Dahl G, Ramabhadran B (2014) Deep convolutional neural networks for large-scale speech tasks. Neural Netw, pp 1–10
Google Scholar
Sainath T, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. ICASSP, pp 4580–4584
Google Scholar
Thomas S, Seltzer ML, Church K, Hermansky H (2013) Deep neural network features and semi-supervised training for low resource speech recognition. In: IEEE international conference on acoustics, speech and signal processing, pp 6704–6708
Google Scholar
Trong TN, Hautamäki V, Lee KA (2016) Deep Language: a comprehensive deep learning approach to end-to-end language recognition. Speaker Odyssey, Bilbao, Spain
Google Scholar
Trong TN, Hiovain K, Jokinen K (2016) Laughing and co-construction of common ground in human conversations. The 4th European and 7th Nordic symposium on multimodal communication, Copenhagen, Denmark
Google Scholar
Wilcock G, Jokinen K (2014) Advances in Wikipedia-based Interaction with Robots. In: Proceedings of the ICMI workshop on multi-modal, multi-party, real-world human-robot interaction, pp 13–18
Google Scholar
Wilcock G, Jokinen K (2015) Multilingual WikiTalk: Wikipedia-based talking robots that switch languages. In: Proceedings of the SIGDial conference, pp 162–164
Google Scholar
Wilcock G, Laxström N, Leinonen J, Smit P, Kurimo M, Jokinen K (2016) Towards SamiTalk: A sami-speaking robot linked to sami wikipedia. In: Jokinen K, Wilcock G (eds) Dialogues with Social Robots. Springer, pp 343–351
Google Scholar
Wilcock G, Jokinen K (2013) WikiTalk human-robot interactions. In: Proceedings of the 15th ACM international conference on multimodal interaction (ICMI), pp 73–74
Google Scholar
Xu K, Ba J, Kiros R, Cho K et al (2015) Show, Attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057
Google Scholar
Zhang S, Qin Y (2013) Semi-supervised accent detection and modelling. In: IEEE international conference on acoustics, speech and signal processing, pp 7175–7179
Google Scholar
Zhang Z, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization
Google Scholar
Ó Laoire M (2008) Indigenous language revitalization and globalization. Te Kaharoa 1
Google Scholar

Download references

Acknowledgements

The paper is partially based on results obtained from the Academy of Finland project Fenno-Ugric Digital Citizens (grant n°270082) and the Future AI and Robot Technology Research and Development project commissioned by the New Energy and Industrial Technology Development Organization (NEDO) in Japan.

The research was partially funded by the Academy of Finland (grant no. 313970) and Finnish Scientific Advisory Board for Defense (MATINE) project no. 2500 M-0106. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

University of Eastern Finland, Joensuu, Finland
Trung Ngo Trong & Ville Hautamäki
AI Research Center, AIST Tokyo Waterfront, Tokyo, Japan
Kristiina Jokinen

Authors

Trung Ngo Trong
View author publications
You can also search for this author in PubMed Google Scholar
Kristiina Jokinen
View author publications
You can also search for this author in PubMed Google Scholar
Ville Hautamäki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trung Ngo Trong .

Editor information

Editors and Affiliations

Universidad Politécnica de Madrid, Madrid, Spain
Luis Fernando D'Haro
Nanyang Technological University, Singapore, Singapore
Rafael E. Banchs
Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore
Haizhou Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Trong, T.N., Jokinen, K., Hautamäki, V. (2019). Enabling Spoken Dialogue Systems for Low-Resourced Languages—End-to-End Dialect Recognition for North Sami. In: D'Haro, L., Banchs, R., Li, H. (eds) 9th International Workshop on Spoken Dialogue System Technology. Lecture Notes in Electrical Engineering, vol 579. Springer, Singapore. https://doi.org/10.1007/978-981-13-9443-0_19

Download citation

DOI: https://doi.org/10.1007/978-981-13-9443-0_19
Published: 25 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9442-3
Online ISBN: 978-981-13-9443-0
eBook Packages: Literature, Cultural and Media StudiesLiterature, Cultural and Media Studies (R0)

Publish with us

Policies and ethics