ABSTRACT
The recognition of mixed-bandwidth audio presents a challenge for both academic and industrial fields, with potentially greater implications for the latter. In this paper, we present a unified ASR architecture for mixed-bandwidth audio's recognition, here we innovatively propose to use the generative adversarial network and two discriminators to help achieving the ability of recognizing mixed sampling audio and guaranteeing the performance of ASR system. Through the adaptive training process of trained generator and ASR system, the performance can be further improved. We conduct experiments on the libri-speech dataset and demonstrate that our method can successfully recognize mixed-bandwidth audio and improve the accuracy of the ASR system by 3.65% in the narrowband data. Overall, the proposed unified ASR architecture provides a promising solution for the recognition of mixed-bandwidth audio in various settings.
- M. Song, Q. Zhang, J. Pan and Y. Yan, "Improving HMM/DNN in ASR of under-resourced languages using probabilistic sampling," 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, China, 2015, pp. 20-24, doi: 10.1109/ChinaSIP.2015.7230354.Google ScholarCross Ref
- C. -T. Do, D. Pastor and A. Goalic, "On the Recognition of Cochlear Implant-Like Spectrally Reduced Speech With MFCC and HMM-Based ASR," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 1065-1068, July 2010, doi: 10.1109/TASL.2009.2032945.Google ScholarCross Ref
- Moreno, P. J., & Stern, R. M. (n.d.). Sources of degradation of speech recognition in the Telephone Network. Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
- N. Morales, D. T. Toledano, J. H. L. Hansen and J. Garrido, "Feature Compensation Techniques for ASR on Band-Limited Speech," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 758-774, May 2009, doi: 10.1109/TASL.2008.2012321.Google ScholarDigital Library
- Avendano, C., Hermansky, H., & Wan, E. A. (1995). Beyond nyquist: Towards the recovery of broad-bandwidth speech from narrow-bandwidth speech. 4th European Conference on Speech Communication and Technology (Eurospeech 1995).Google ScholarCross Ref
- Bansal, D., Raj, B., & Smaragdis, P. (2005). Bandwidth expansion of narrowband speech using non-negative matrix factorization. Interspeech 2005.Google Scholar
- Laaksonen, L., Kontio, J., & Alku, P. (n.d.). Artificial bandwidth expansion method to improve intelligibility and quality of amr-coded narrowband speech. Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.Google ScholarCross Ref
- Kun-Youl Park, & Hyung Soon Kim. (n.d.). Narrowband to wideband conversion of speech using GMM based transformation. 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).Google ScholarCross Ref
- Bauer, P., & Fingscheidt, T. (2008). An HMM-based artificial bandwidth extension evaluated by cross-language training and test. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.Google ScholarCross Ref
- Kontio, J., Laaksonen, L., & Alku, P. (2007). Neural network-based artificial bandwidth expansion of speech. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 873–881.Google ScholarDigital Library
- Li, K., & Lee, C.-H. (2015). A deep neural network approach to speech bandwidth expansion. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
- Nidadavolu, P. S., Iglesias, V., Villalba, J., & Dehak, N. (2019). Investigation on neural bandwidth extension of telephone speech for improved speaker recognition. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
- Haws, D., & Cui, X. (2019). Cyclegan bandwidth extension acoustic modeling for Automatic Speech recognition. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
- Eskimez, S. E., & Koishida, K. (2019). Speech super resolution generative adversarial network. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2019.8682215.Google ScholarCross Ref
- Wang, Y., Yu, G., Wang, J., Wang, H., & Zhang, Q. (2020). Improved relativistic cycle-consistent gan with dilated residual network and multi-attention for speech enhancement. IEEE Access, 8, 183272–183285.Google ScholarCross Ref
- Liu, G., Gong, K., Liang, X., & Chen, Z. (2020). CP-GAN: Context pyramid generative adversarial network for speech enhancement. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarCross Ref
- Sheng, P., Yang, Z., Hu, H., Tan, T., & Qian, Y. (2018). Data augmentation using conditional generative adversarial networks for robust speech recognition. 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).Google ScholarCross Ref
- Toumpanakis, D., & Adams, M. (2019). Generative Adversarial Network. Radiopaedia.org. https://doi.org/10.53347/rid-69034Google ScholarCross Ref
- Haidar, M. A., & Rezagholizadeh, M. (2021). Fine-tuning of pre-trained end-to-end speech recognition with generative adversarial networks. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp39728.2021.9413703.Google ScholarCross Ref
- Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.Google Scholar
- Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2015.7178964.Google ScholarCross Ref
- Gulati, Anmol, "Conformer: Convolution-augmented transformer for speech recognition." arXiv preprint arXiv:2005.08100 (2020).Google Scholar
Index Terms
- A Unified Mixed-Bandwidth ASR Framework with Generative Adversarial Network
Recommendations
Emotions, speech and the ASR framework
Special issue on speech and emotionAutomatic recognition and understanding of speech are crucial steps towards natural human-machine interaction. Apart from the recognition of the word sequence, the recognition of properties such as prosody, emotion tags or stress tags may be of ...
Pitch adaptive MFCC features for improving children's mismatched ASR
A pitch normalization algorithm is proposed for addressing the pitch mismatch between adults' and children's speech for children's automatic speech recognition (ASR). Motivated by the appearance of pitch-dependent distortions in the smoothed mel ...
A novel framework for noise robust ASR using cochlear implant-like spectrally reduced speech
We propose a novel framework for noise robust automatic speech recognition (ASR) based on cochlear implant-like spectrally reduced speech (SRS). Two experimental protocols (EPs) are proposed in order to clarify the advantage of using SRS for noise ...
Comments