Skip to main content

Enabling Spoken Dialogue Systems for Low-Resourced Languages—End-to-End Dialect Recognition for North Sami

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 579))

Abstract

In this paper, we tackle the challenge of identifying dialects using deep learning for under-resourced languages. Recent advances in spoken dialogue technology have been strongly influenced by the availability of big corpora, while our goal is to work on the spoken interactive application for the North Sami language, which is classified as one of the less-resourced languages spoken in Northern Europe. North Sami has various variations and dialects which are influenced by the majority languages of the areas in which it is spoken: Finnish and Norwegian. To provide reliable and accurate speech components for an interactive system, it is important to recognize the speakers with their Finnish or Norwegian accent. Conventional approaches compute universal statistical models which require a large amount of data to form reliable statistics, and thus they are vulnerable to small data where there is only a limited number of utterances and speakers available. In this paper we will discuss dialect and accent recognition in under-resourced context, and focus on training an attentive network for leveraging unlabeled data in a semi-supervised scenario for robust feature learning. Validation of our approach is done via two DigiSami datasets: conversational and read corpus.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Abdel-Hamid O, Mohamed A-R (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22:1533–1545

    Article  Google Scholar 

  2. Amodei D, Anubhai R, Battenberg E et al (2015) Deep speech 2: End-to-end speech recognition in English and Mandarin. CoRR, vol. abs/1512.02595

    Google Scholar 

  3. Bahdanau D, Chorowski J, Serdyuk D, Brakel P, Bengio Y (2015) End-to-end attention-based large vocabulary speech recognition. CoRR, vol. abs/1508.04395

    Google Scholar 

  4. Behravan H, Hautamäki V, Siniscalchi SM, Kinnunen T, Lee C-H (2016) I-vector modeling of speech attributes for automatic foreign accent recognition. Audio, Speech, Lang Process, IEEE/ACM Trans 24(1):29–41

    Article  Google Scholar 

  5. Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: a survey. Speech Commun 56:85–100

    Article  Google Scholar 

  6. Crystal D (2000). English as a global language. Cambridge

    Google Scholar 

  7. Dalyac A, Shanahan M, Kelly J (2014). Tackling class imbalance with deep convolutional neural networks. Thesis, Imperial College London

    Google Scholar 

  8. Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  9. Ganapathy S, Han K, Thomas S et al (2014) Robust language identification using convolutional neural network features. In: Proceedings of the fifteenth annual conference of the international speech communication association

    Google Scholar 

  10. Glas DF, Minato T, Ishi CT, Kawahara T, Ishiguro H (2016) Erica: the erato intelligent conversational android. In: 2016 25th IEEE international symposium on robot and human interactive communication (RO-MAN). IEEE, pp 22–29

    Google Scholar 

  11. Gonzalez-Dominguez J, Lopez-Moreno I, Sak H (2014) Automatic language identification using long short-term memory recurrent neural networks. Interspeech

    Google Scholar 

  12. Hiovain K, Jokinen K (2016) Acoustic features of different types of laughter in north sami conversational speech. In: Proceedings of the LREC Workshop Just talking—casual talk among humans and machines, Portorož, Slovenia

    Google Scholar 

  13. Jokinen K (2014) Open-domain interaction and online content in the sami language. In: Proceedings of the language resources and evaluation conference (LREC 2014)

    Google Scholar 

  14. Jokinen K, Trong TN, Hautamäki V (2016) Variation in Spoken North Sami Language. Interspeech-2016, pp. 3299–3303

    Google Scholar 

  15. Jokinen K, Hiovain K, Laxström N, Rauhala I, Wilcock G (2017) DigiSami and digital natives: Interaction technology for the north sami language. In: Jokinen K, Wilcock G (eds) Dialogues with social robots. Springer, pp 3–19

    Google Scholar 

  16. Jokinen K, Wilcock G (2013) Multimodal open-domain conversations with the Nao robot. In: Natural interaction with robots, knowbots and smartphones: putting spoken dialogue systems into practice. Springer, pp 213–224

    Google Scholar 

  17. Jokinen K, Wilcock G (2014) Community-based resource building and data collection. In: Proceedings of the 4th international workshop on spoken language technologies for under-resourced languages (SLTU’14). St Petersburg, Russia, pp 201–206

    Google Scholar 

  18. Kirkpatrick K, Pascanu R, Rabinowitz NC et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA 114(13):3521–3526

    Article  Google Scholar 

  19. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  20. LeCun Y, Bottou L, Orr GB, Müller KR (1998) Efficient Back-Prop. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 9–50

    Google Scholar 

  21. Lee KA, Li H, Deng L, Hautamäki V et al (2016) The 2015 NIST language recognition evaluation: The shared view of i2r, fantastic4 and singams. Interspeech

    Google Scholar 

  22. Leinonen J (2015) Automatic speech recognition for human-robot interaction using an under-resourced language. Master’s thesis, Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics, Espoo

    Google Scholar 

  23. Li H, Ma B, Lee KA (2013) Spoken language recognition: From fundamentals to practice. Proc IEEE 101(5):1136–1159

    Article  Google Scholar 

  24. Lopez-Moreno I, Gonzalez-Dominguez J, Plchot O (2014) Automatic language identification using deep neural networks. ICASSP

    Google Scholar 

  25. Matrouf D, Scheffer N, Fauve BGB, Bonastre J-F (2007) A straightforward and efficient implementation of the factor analysis model for speaker verification. Interspeech, pp 1242–1245

    Google Scholar 

  26. Mi H, Wang Z, Ittycheriah A (2016) Supervised attentions for neural machine translation, CoRR, vol. abs/1608.00112

    Google Scholar 

  27. Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. AIS-TATS05, pp 246–252

    Google Scholar 

  28. Prechelt L (2012) Neural Networks: Tricks of the Trade, 2nd edn. Chapter “Early Stopping—But When?”. Springer, Berlin, Heidelberg, pp 53–67

    Google Scholar 

  29. Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, vol. abs/1511.06434

    Google Scholar 

  30. Richardson F, Reynolds DA, Dehak N (2015) A unified deep neural network for speaker and language recognition. CoRR, vol. abs/1504.00923

    Google Scholar 

  31. Sainath TN, Kingsbury B, Saon G, Soltau H, Mohamed A, Dahl G, Ramabhadran B (2014) Deep convolutional neural networks for large-scale speech tasks. Neural Netw, pp 1–10

    Google Scholar 

  32. Sainath T, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. ICASSP, pp 4580–4584

    Google Scholar 

  33. Thomas S, Seltzer ML, Church K, Hermansky H (2013) Deep neural network features and semi-supervised training for low resource speech recognition. In: IEEE international conference on acoustics, speech and signal processing, pp 6704–6708

    Google Scholar 

  34. Trong TN, Hautamäki V, Lee KA (2016) Deep Language: a comprehensive deep learning approach to end-to-end language recognition. Speaker Odyssey, Bilbao, Spain

    Google Scholar 

  35. Trong TN, Hiovain K, Jokinen K (2016) Laughing and co-construction of common ground in human conversations. The 4th European and 7th Nordic symposium on multimodal communication, Copenhagen, Denmark

    Google Scholar 

  36. Wilcock G, Jokinen K (2014) Advances in Wikipedia-based Interaction with Robots. In: Proceedings of the ICMI workshop on multi-modal, multi-party, real-world human-robot interaction, pp 13–18

    Google Scholar 

  37. Wilcock G, Jokinen K (2015) Multilingual WikiTalk: Wikipedia-based talking robots that switch languages. In: Proceedings of the SIGDial conference, pp 162–164

    Google Scholar 

  38. Wilcock G, Laxström N, Leinonen J, Smit P, Kurimo M, Jokinen K (2016) Towards SamiTalk: A sami-speaking robot linked to sami wikipedia. In: Jokinen K, Wilcock G (eds) Dialogues with Social Robots. Springer, pp 343–351

    Google Scholar 

  39. Wilcock G, Jokinen K (2013) WikiTalk human-robot interactions. In: Proceedings of the 15th ACM international conference on multimodal interaction (ICMI), pp 73–74

    Google Scholar 

  40. Xu K, Ba J, Kiros R, Cho K et al (2015) Show, Attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057

    Google Scholar 

  41. Zhang S, Qin Y (2013) Semi-supervised accent detection and modelling. In: IEEE international conference on acoustics, speech and signal processing, pp 7175–7179

    Google Scholar 

  42. Zhang Z, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization

    Google Scholar 

  43. Ó Laoire M (2008) Indigenous language revitalization and globalization. Te Kaharoa 1

    Google Scholar 

Download references

Acknowledgements

The paper is partially based on results obtained from the Academy of Finland project Fenno-Ugric Digital Citizens (grant n°270082) and the Future AI and Robot Technology Research and Development project commissioned by the New Energy and Industrial Technology Development Organization (NEDO) in Japan.

The research was partially funded by the Academy of Finland (grant no. 313970) and Finnish Scientific Advisory Board for Defense (MATINE) project no. 2500 M-0106. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trung Ngo Trong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Trong, T.N., Jokinen, K., Hautamäki, V. (2019). Enabling Spoken Dialogue Systems for Low-Resourced Languages—End-to-End Dialect Recognition for North Sami. In: D'Haro, L., Banchs, R., Li, H. (eds) 9th International Workshop on Spoken Dialogue System Technology. Lecture Notes in Electrical Engineering, vol 579. Springer, Singapore. https://doi.org/10.1007/978-981-13-9443-0_19

Download citation

Publish with us

Policies and ethics