skip to main content
10.1145/3578741.3578833acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmlnlpConference Proceedingsconference-collections
research-article

SPCPFS: a pseudo-label filtering strategy with fusion of perplexity and confidence

Authors Info & Claims
Published:06 March 2023Publication History

ABSTRACT

In the pseudo-label filtering for semi-supervised Mongolian speech recognition, the correctness of word combinations and the correctness of speech and word correspondences in the self-training set cannot be guaranteed simultaneously. To solve this problem, we propose a pseudo-label filtering strategy with fusion of perplexity and confidence, which is called sentence perplexity confidence. The strategy simultaneously evaluates the semantic relations of pseudo-labels and the correspondence between pseudo-labels and acoustic features of unlabeled speech, which improves the accuracy of the self-training set and thus the performance of the target speech recognition model output by semi-supervised training. We conducted ablation experiments and comparison experiments of sentence perplexity confidence on Mongolian datasets IMUT-MC and IMUT-MC-SMI. The experimental results show that the sentence perplexity confidence is ahead of the sentence-level confidence and perplexity in terms of accuracy improvement ability of the self-training set, and the output target speech recognition models reach 14.7% and 16.1% for WER and SER respectively.

References

  1. Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 1–5.Google ScholarGoogle ScholarCross RefCross Ref
  2. Delphine Charlet. 2001. Confidence-measure-driven unsupervised incremental adaptation for HMM-based speech recognition. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 1. IEEE, 357–360.Google ScholarGoogle ScholarCross RefCross Ref
  3. Chinggeltei. 1991. Mongolian grammar. Inner Mongolia People’s Publishing House.Google ScholarGoogle Scholar
  4. XIE C D and GUO W. 2016. Semi-supervised Acoustic Modeling Based on Perplexity Data Selection. Pattern Recognition and Artificial Intelligence 29, 6(2016), 6.Google ScholarGoogle Scholar
  5. Helin Dutağacı. 2002. Statistical language models for large vocabulary Turkish speech recognition. Ph.D. Dissertation. MS Thesis, Department of Computer Engineering, Bogazici University.Google ScholarGoogle Scholar
  6. Alexandru-Lucian Georgescu, Cristian Manolache, Dan Oneaţă, Horia Cucu, and Corneliu Burileanu. 2021. Data-filtering methods for self-training of automatic speech recognition systems. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 1–7.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jacob Kahn, Ann Lee, and Awni Hannun. 2020. Self-training for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7084–7088.Google ScholarGoogle ScholarCross RefCross Ref
  8. Naoyuki Kanda, Shoji Harada, Xugang Lu, and Hisashi Kawai. 2016. Investigation of Semi-Supervised Acoustic Model Training Based on the Committee of Heterogeneous Neural Networks.. In INTERSPEECH. 1325–1329.Google ScholarGoogle Scholar
  9. Jeff Ma and Spyros Matsoukas. 2007. Unsupervised training on a large amount of Arabic broadcast news data. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 2. IEEE, II–349.Google ScholarGoogle ScholarCross RefCross Ref
  10. Morigen. 2016. The analysis and research on syntax of the traditional mongolian sentences based on rules. Ph.D. Dissertation. Inner Mongolia University.Google ScholarGoogle Scholar
  11. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.Google ScholarGoogle ScholarCross RefCross Ref
  12. Karel Veselỳ, Lukás Burget, and Jan Cernockỳ. 2017. Semi-Supervised DNN Training with Word Selection for ASR.. In Interspeech. 3687–3691.Google ScholarGoogle Scholar
  13. Karel Veselỳ, Mirko Hannemann, and Lukáš Burget. 2013. Semi-supervised training of deep neural networks. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 267–272.Google ScholarGoogle ScholarCross RefCross Ref
  14. Shane Walker, Morten Pedersen, Iroro Orife, and Jason Flaks. 2017. Semi-supervised model training for unbounded conversational speech recognition. arXiv preprint arXiv:1705.09724(2017).Google ScholarGoogle Scholar
  15. Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, 2018. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015(2018).Google ScholarGoogle Scholar
  16. Wang Xilou, Guo Wu, and Xie Chuandong. 2018. Speech Recognition Based on Semi-supervised Data Selection via Decoding Multiple Candidate Results. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence 31, 7 (2018), 662–667.Google ScholarGoogle Scholar
  17. Qian Yan-min and Liu Jia. 2013. Optimized data selection strategy based unsupervised acoustic modeling for low data resource speech recognition. Journal of Tsinghua University (Science and Technology) 53, 7(2013), 1001–1004.Google ScholarGoogle Scholar
  18. Rong Zhang and A.I. Rudnicky. 2006. A New Data Selection Approach for Semi-Supervised Acoustic Modeling. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. I–I. https://doi.org/10.1109/ICASSP.2006.1660047Google ScholarGoogle ScholarCross RefCross Ref
  19. Liu Zhiqiang, Ma Zhiqiang, Zhang Xiaoxu, Bao Caijilahu, Xie Xiulan, and Zhu Fangyuan. 2022. IMUT-MC: a speech corpus for Mongolian speech recognition. China Scientific Data 7, 2 (2022), 13.Google ScholarGoogle Scholar

Index Terms

  1. SPCPFS: a pseudo-label filtering strategy with fusion of perplexity and confidence

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing
      December 2022
      406 pages
      ISBN:9781450399067
      DOI:10.1145/3578741

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 March 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)24
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format