ABSTRACT
In the pseudo-label filtering for semi-supervised Mongolian speech recognition, the correctness of word combinations and the correctness of speech and word correspondences in the self-training set cannot be guaranteed simultaneously. To solve this problem, we propose a pseudo-label filtering strategy with fusion of perplexity and confidence, which is called sentence perplexity confidence. The strategy simultaneously evaluates the semantic relations of pseudo-labels and the correspondence between pseudo-labels and acoustic features of unlabeled speech, which improves the accuracy of the self-training set and thus the performance of the target speech recognition model output by semi-supervised training. We conducted ablation experiments and comparison experiments of sentence perplexity confidence on Mongolian datasets IMUT-MC and IMUT-MC-SMI. The experimental results show that the sentence perplexity confidence is ahead of the sentence-level confidence and perplexity in terms of accuracy improvement ability of the self-training set, and the output target speech recognition models reach 14.7% and 16.1% for WER and SER respectively.
- Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 1–5.Google ScholarCross Ref
- Delphine Charlet. 2001. Confidence-measure-driven unsupervised incremental adaptation for HMM-based speech recognition. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 1. IEEE, 357–360.Google ScholarCross Ref
- Chinggeltei. 1991. Mongolian grammar. Inner Mongolia People’s Publishing House.Google Scholar
- XIE C D and GUO W. 2016. Semi-supervised Acoustic Modeling Based on Perplexity Data Selection. Pattern Recognition and Artificial Intelligence 29, 6(2016), 6.Google Scholar
- Helin Dutağacı. 2002. Statistical language models for large vocabulary Turkish speech recognition. Ph.D. Dissertation. MS Thesis, Department of Computer Engineering, Bogazici University.Google Scholar
- Alexandru-Lucian Georgescu, Cristian Manolache, Dan Oneaţă, Horia Cucu, and Corneliu Burileanu. 2021. Data-filtering methods for self-training of automatic speech recognition systems. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 1–7.Google ScholarCross Ref
- Jacob Kahn, Ann Lee, and Awni Hannun. 2020. Self-training for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7084–7088.Google ScholarCross Ref
- Naoyuki Kanda, Shoji Harada, Xugang Lu, and Hisashi Kawai. 2016. Investigation of Semi-Supervised Acoustic Model Training Based on the Committee of Heterogeneous Neural Networks.. In INTERSPEECH. 1325–1329.Google Scholar
- Jeff Ma and Spyros Matsoukas. 2007. Unsupervised training on a large amount of Arabic broadcast news data. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 2. IEEE, II–349.Google ScholarCross Ref
- Morigen. 2016. The analysis and research on syntax of the traditional mongolian sentences based on rules. Ph.D. Dissertation. Inner Mongolia University.Google Scholar
- Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.Google ScholarCross Ref
- Karel Veselỳ, Lukás Burget, and Jan Cernockỳ. 2017. Semi-Supervised DNN Training with Word Selection for ASR.. In Interspeech. 3687–3691.Google Scholar
- Karel Veselỳ, Mirko Hannemann, and Lukáš Burget. 2013. Semi-supervised training of deep neural networks. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 267–272.Google ScholarCross Ref
- Shane Walker, Morten Pedersen, Iroro Orife, and Jason Flaks. 2017. Semi-supervised model training for unbounded conversational speech recognition. arXiv preprint arXiv:1705.09724(2017).Google Scholar
- Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, 2018. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015(2018).Google Scholar
- Wang Xilou, Guo Wu, and Xie Chuandong. 2018. Speech Recognition Based on Semi-supervised Data Selection via Decoding Multiple Candidate Results. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence 31, 7 (2018), 662–667.Google Scholar
- Qian Yan-min and Liu Jia. 2013. Optimized data selection strategy based unsupervised acoustic modeling for low data resource speech recognition. Journal of Tsinghua University (Science and Technology) 53, 7(2013), 1001–1004.Google Scholar
- Rong Zhang and A.I. Rudnicky. 2006. A New Data Selection Approach for Semi-Supervised Acoustic Modeling. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. I–I. https://doi.org/10.1109/ICASSP.2006.1660047Google ScholarCross Ref
- Liu Zhiqiang, Ma Zhiqiang, Zhang Xiaoxu, Bao Caijilahu, Xie Xiulan, and Zhu Fangyuan. 2022. IMUT-MC: a speech corpus for Mongolian speech recognition. China Scientific Data 7, 2 (2022), 13.Google Scholar
Index Terms
- SPCPFS: a pseudo-label filtering strategy with fusion of perplexity and confidence
Recommendations
Semi-supervised and unsupervised discriminative language model training for automatic speech recognition
We investigate supervised, semi-supervised and unsupervised training of DLMs.We use supervised and unsupervised confusion models to generate artificial data.We propose three target output selection methods for unsupervised DLM training.Ranking ...
Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system
This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number ...
Research on Pseudo-label Technology for Multi-label News Classification
Document Analysis and Recognition – ICDAR 2021AbstractMulti-label news classification exerts a significant importance with the growing size of news containing multiple semantics. However, most of the existing multi-label classification methods rely on large-scale labeled corpus while publicly ...
Comments