skip to main content
10.1145/3486001.3486235acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaimlsystemsConference Proceedingsconference-collections
research-article

Few shot learning for cross-lingual isolated word recognition

Published: 22 October 2021 Publication History

Abstract

We address the problem of low resource machine learning in the form of few-shot learning (FSL) applied to word recognition in both mono-lingual and cross-lingual settings. Recently, we proposed an adaptation of a FSL framework, matching networks (MN) to a suite of speech recognition tasks such as multi-speaker small-to-medium vocabulary word recognition and frame-wise phoneme recognition tasks under mel-spectrogram and single-frame feature representations. In this paper, we extend this FSL adaptation of MN to multi-speaker isolated word recognition (IWR), in a framework termed MN-IWR. The IWR task is specifically set in a ‘command-and-control’ (C&C) scenario with the requirement of needing only very few-shot examples (e.g. up to 20) for a target IWR classification task with vocabularies defined dynamically. Moreover, our proposed MN-IWR framework addresses a cross-domain and cross-lingual setting defined as below: a model is trained on a possibly large set of words in a source-language and used for inference on a cross-domain task (vocabulary of words different from the training vocabulary) or a cross-lingual task (vocabulary of words from a target-language different from the source-language). In this work, we present the main formulation of the MN-IWR framework, its adaptation from source-to-target tasks and results on TIMIT vocabulary of words in a mono-lingual setting and on English, Kannada and Tamil words in cross-lingual settings and report very high performances of the proposed MN-IWR FSL paradigm over conventional IWR classification without the FSL advantage of the MN formulation.

References

[1]
Tirthankar Banerjee, Narasimha Rao Thurlapati, V Pavithra, S Mahalakshmi, Dhanya Eledath, and V Ramasubramanian. 2021. Few-shot learning for frame-wise phoneme recognition: Adaptation of matching networks. In Accepted in EUSIPCO-2021 (Dublin, Ireland, Aug 2021).
[2]
Luca Bertinetto, João F. Henriques, Jack Valmadre, Philip H. S. Torr, and Andrea Vedaldi. 2016. Learning Feed-Forward One-Shot Learners. Curran Associates Inc., Red Hook, NY, USA.
[3]
Ishan Bhardwaj and Narendra D Londhe. 2012. Hidden Markov Model based isolated Hindi word recognition. In 2012 2nd International Conference on Power, Control and Embedded Systems. 1–6. https://doi.org/10.1109/ICPCES.2012.6508044
[4]
Yangbin Chen, Tom Ko, Lifeng Shang, Xiao Chen, Xin Jiang, and Qing Li. 2020. An Investigation of Few-Shot Learning in Spoken Term Classification. In Proceedings of Interspeech 2020(Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). Interspeech, 2582–2586. https://doi.org/10.21437/Interspeech.2020-2568
[5]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N., 27403 pages.
[6]
Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara Rivera, Anna Katanova, Alexander Gutkin, Isin Demirsahin, Cibu Johny, Martin Jansche, Supheakmungkol Sarin, and Knot Pipatsrisawat. 2020. Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC). European Language Resources Association (ELRA), Marseille, France, 6494–6503. https://www.aclweb.org/anthology/2020.lrec-1.800
[7]
Tan Lee, P. Ching, and L. Chan. 1998. Isolated word recognition using modular recurrent neural networks. Pattern Recognit. 31(1998), 751–760.
[8]
Ruirui Li, Jyun-Yu Jiang, Jiahao Liu Li, Chu-Cheng Hsieh, and Wei Wang. 2020. Automatic Speaker Recognition with Limited Data. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM ’20). Association for Computing Machinery, New York, NY, USA, 340–348. https://doi.org/10.1145/3336191.3371802
[9]
J. Lu, Pinghua Gong, Jieping Ye, and Changshui Zhang. 2020. Learning from Very Few Samples: A Survey. ArXiv abs/2009.02653(2020).
[10]
Tomas Pfister, James Charles, and Andrew Zisserman. 2014. Domain-Adaptive Discriminative One-Shot Learning of Gestures. In In European Conference on Computer Vision. Springer. 814–829.
[11]
L. Rabiner, A. Rosenberg, and S. Levinson. 1978. Considerations in dynamic time warping algorithms for discrete word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 6(1978), 575–582. https://doi.org/10.1109/TASSP.1978.1163164
[12]
L. R. Rabiner, S. E. Levinson, and M. M. Sondhi. 1984. On the use of hidden Markov models for speaker-independent recognition of isolated words from a medium-size vocabulary. AT T Bell Laboratories Technical Journal 63, 4 (1984), 627–642. https://doi.org/10.1002/j.1538-7305.1984.tb00023.x
[13]
L. R. Rabiner and M. R. Sambur. 1975. An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal 54, 2 (1975), 297–315. https://doi.org/10.1002/j.1538-7305.1975.tb02840.x
[14]
G.V. Ramana Rao and J. Srichland. 1996. Word boundary detection using pitch variations. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, Vol. 2. 813–816 vol.2. https://doi.org/10.1109/ICSLP.1996.607725
[15]
S. Roucos, R. Schwartz, and J. Makhoul. 1982. Segment quantization for very-low-rate speech coding. In ICASSP ’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 7. 1565–1568. https://doi.org/10.1109/ICASSP.1982.1171472
[16]
Harshita Seth, Pulkit Kumar, and Muktabh Srivastava. 2020. Prototypical Metric Transfer Learning for Continuous Speech Keyword Spotting with Limited Training Data. 273–280. https://doi.org/10.1007/978-3-030-20055-8_26
[17]
Bowen Shi, Ming Sun, Krishna C. Puvvada, Chieh-Chi Kao, Spyros Matsoukas, and Chao Wang. 2020. Few-Shot Acoustic Event Detection Via Meta Learning. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 76–80.
[18]
Eleni Triantafillou, Richard S. Zemel, and Raquel Urtasun. 2017. Few-Shot Learning Through an Information Retrieval Lens. In Advances in Neural Information Processing Systems, Vol. 30.
[19]
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching Networks for One Shot Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 3637–3645.
[20]
Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2020. Generalizing from a Few Examples: A Survey on Few-Shot Learning. ACM Comput. Surv. 53, 3, Article 63 (June 2020), 34 pages. https://doi.org/10.1145/3386252
[21]
Pete Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arxiv:1804.03209 [cs.CL]
[22]
Kouichi Yamaguchi, Kenji Sakamoto, Toshio Akabane, and Yoshiji Fujimoto. 1990. A neural network for speaker-independent isolated word recognition. In ICSLP. ISCA.

Cited By

View all
  • (2022)Few-shot learning for E2E speech recognition: architectural variants for support set generation2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909613(444-448)Online publication date: 29-Aug-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AIMLSystems '21: Proceedings of the First International Conference on AI-ML Systems
October 2021
170 pages
ISBN:9781450385947
DOI:10.1145/3486001
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. few-shot learning
  2. isolated word recognition
  3. matching networks
  4. spoken term classification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

AIMLSystems 2021

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Few-shot learning for E2E speech recognition: architectural variants for support set generation2022 30th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO55093.2022.9909613(444-448)Online publication date: 29-Aug-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media