Skip to main content
Log in

CycleGAN-Based Speech Mode Transformation Model for Robust Multilingual ASR

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

A Correction to this article was published on 04 June 2022

This article has been updated

Abstract

In this work, we propose a multilingual speech mode transformation (MSMT) model as the front end to improve the robustness of the speech recognition system by transforming the characteristics of conversation and extempore modes of speech into read mode of speech. The proposed front end includes multilingual speech mode classification (MSMC) system and mode-specific MSMT model. The mode-specific MSMT models are developed using a cycle-consistent generative adversarial network (CycleGAN) variant named as weighted CycleGAN (WeCycleGAN). In these models, generator loss is multiplied with relevant weight to learn a strong mapping from conversation and extempore speech to read speech while preserving the linguistic content. The proposed model is developed with non-parallel speech samples of three modes using adversarial networks, which helps in learning among two distributions (extempore vs read or conversation vs read) instead of direct mapping among parallel speech samples. Experiments are conducted on non-parallel speech dataset of conversation, extempore, and read modes from four Indian languages, namely Bengali, Odia, Telugu, and Kannada. The objective evaluation shows that the transformed feature vectors are highly correlated with the target feature vectors. The subjective evaluation shows that the quality of the transformed speech mode is close to the target speech mode. The significance of the proposed MSMT model is demonstrated on speech recognition system. The results report that the performance of speech recognition is significantly improved in the presence of MSMT model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The datasets analyzed during the current study are collected under the consortium project where it was sponsored by MEITY (Ministry of Electronics and Information Technology). So, the datasets are available to MEITY. In case of reasonable request through the proper channel, the datasets may be provided by MEITY.

Change history

References

  1. M. Alzantot, B. Balaji, M. Srivastava, Did you hear that? adversarial examples against automatic speech recognition. arXiv:1801.00554 (2018)

  2. T. Athanaselis, S. Bakamidis, I. Dologlou, R. Cowie, E. Douglas-Cowie, C. Cox, ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw. 18(4), 437–444 (2005)

    Article  Google Scholar 

  3. L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. lembek, N. Goel, M. Karafiát, D. Povey, et al., Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4334–4337. IEEE (2010)

  4. N. Carlini, D. Wagner, Towards evaluating the robustness of neural networks, in 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017)

  5. N. Carlini, D. Wagner, Audio adversarial examples: targeted attacks on speech-to-text, in 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. IEEE (2018)

  6. S.H. Dumpala, A. Panda, S.K. Kopparapu, Analysis of the effect of speech-laugh on speaker recognition system, in Interspeech, pp. 1751–1755 (2018)

  7. S.H. Dumpala, K.V. Sridaran, S.V. Gangashetty, B. Yegnanarayana, Analysis of laughter and speech-laugh signals using excitation source information, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 975–979. IEEE (2014)

  8. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

  9. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of wasserstein gans, in Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)

  10. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  11. Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, I. McGraw, Streaming small-footprint keyword spotting using sequence-to-sequence models, in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 474–481. IEEE (2017)

  12. P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

  13. D. Iter, J. Huang, M. Jermann, Generating adversarial examples for speech recognition. Stanford Technical Report (2017)

  14. J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, inEuropean Conference on Computer Vision, pp. 694–711. Springer (2016)

  15. T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv:1711.11293 (2017)

  16. T. Kaneko, H. Kameoka, Cyclegan-vc: non-parallel voice conversion using cycle-consistent adversarial networks, in 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. IEEE (2018)

  17. T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6820–6824. IEEE (2019)

  18. C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, M. Bacchiani, Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home (2017)

  19. T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial networks, in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1857–1865. JMLR. org (2017)

  20. D.P. Kingma, J.B. Adam, A method for stochastic optimization. arXiv:1412.6980 (2014)

  21. S.B.S. Kumar, K. Sreenivasa Rao, D. Pati, Phonetic and prosodically rich transcribed speech corpus in Indian languages: Bengali and Odia, in Proceedings of International Conference on Oriental COCOSDA held jointly with Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–5, Gurgaon, India (2013)

  22. G. Kurata, B. Ramabhadran, G. Saon, A. Sethy, Language modeling with highway LSTM, in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 244–251. IEEE (2017)

  23. K.E. Manjunath, D.B. Jayagopi, K. Sreenivasa Rao, V. Ramasubramanian, Development and analysis of multilingual phone recognition systems using Indian languages. Int. J. Speech Technol. 22(1), 157–168 (2019)

    Article  Google Scholar 

  24. K.E. Manjunath, K. Sreenivasa Rao, Source and system features for phone recognition. Int. J. Speech Technol. 18(2), 257–270 (2015)

    Article  Google Scholar 

  25. Z. Meng, J. Li, Y. Gong, et al., Cycle-consistent speech enhancement. arXiv:1809.02253 (2018)

  26. M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)

    Article  Google Scholar 

  27. K. Mukaihara, S. Sakti, S. Nakamura, Recognizing emotionally coloured dialogue speech using speaker-adapted DNN-CNN bottleneck features, in International Conference on Speech and Computer, pp. 632–641. Springer (2017)

  28. K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)

  29. D. Nandi, D. Pati, K. Sreenivasa Rao, Parametric representation of excitation source information for language identification. Comput. Speech Lang. 41, 88–115 (2017)

    Article  Google Scholar 

  30. Y. Pantazis, D. Paul, M. Fasoulakis, Y. Stylianou, Training generative adversarial networks with weights, in 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5. IEEE (2019)

  31. V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey, S. Khudanpur, Jhu aspire system: robust LVCSR with TDNNs, i-vector adaptation and RNN-LMS, in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 539–546. IEEE (2015)

  32. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The Kaldi speech recognition toolkit, in Workshop on Automatic Speech Recognition and Understanding, Number EPFL-CONF-192584. IEEE Signal Processing Society (2011)

  33. R. Pradeep, K. Sreenivasa Rao, Deep neural networks for Kannada phoneme recognition, in2016 Ninth International Conference on Contemporary Computing (IC3), pp. 1–6. IEEE (2016)

  34. A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 (2015)

  35. V.V.V. Raju, P. Gangamohan, S.V. Gangashetty, A.K. Vuppala, Application of prosody modification for speech recognition in different emotion conditions, in 2016 IEEE Region 10 Conference (TENCON), pp. 951–954. IEEE (2016)

  36. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image synthesis. arXiv:1605.05396 (2016)

  37. S. Scanzio, P. Laface, L. Fissore, R. Gemello, F. Mana, On the use of a multilingual neural network front-end, in Ninth Annual Conference of the International Speech Communication Association (2008)

  38. M.V. Shridhara, B.K. Banahatti, L. Narthan, V. Karjigi, R. Kumaraswamy, Development of Kannada speech corpus for prosodically guided phonetic search engine, in Proceedings of International Conference on Oriental COCOSDA Held Jointly with Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–6, Gurgaon, India (2013)

  39. Y. Taigman, A. Polyak, L. Wolf, Unsupervised cross-domain image generation. arXiv:1611.02200 (2016)

  40. K. Tamamizu, S. Sakakibara, S. Saiki, M. Nakamura, K. Yasuda, Capturing activities of daily living for elderly at home based on environment change and speech dialog, in International Conference on Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management, pp. 183–194. Springer (2017)

  41. K. Tripathi, K. Sreenivasa Rao, Analysis of sparse representation based feature on speech mode classification, in Interspeech, pp. 731–735 (2018)

  42. K. Tripathi, K. Sreenivasa Rao, Improvement of phone recognition accuracy using speech mode classification. Int. J. Speech Technol. 21(3), 489–500 (2018)

    Article  Google Scholar 

  43. K. Tripathi, K. Sreenivasa Rao, VEP detection for read, extempore and conversation speech. IETE J. Res. (2020)

  44. K. Tripathi, M. Kiran Reddy, K. Sreenivasa Rao, Multilingual and multimode phone recognition system for Indian languages. Speech Commun. 119, 12–23 (2020)

    Article  Google Scholar 

  45. D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022 (2016)

  46. L. Van Der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)

  47. B. Vlasenko, D. Prylipko, A. Wendemuth, Towards robust spontaneous speech recognition with emotional speech adapted acoustic models, in 35th German Conference on Artificial Intelligence (KI-2012), Saarbrücken, Germany (September 2012), pp. 103–107. Citeseer (2012)

  48. W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, A. Stolcke, The microsoft 2017 conversational speech recognition system, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5934–5938. IEEE (2018)

  49. B. Yegnanarayana, K. Sri Rama Murty, Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans. Audio Speech Lang. Process. 17(4), 614–624 (2009)

    Article  Google Scholar 

  50. Z. Yi, H. Zhang, P. Tan, M. Gong, Dualgan: unsupervised dual learning for image-to-image translation, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2849–2857 (2017)

  51. D. Yu, L. Deng, Deep neural network-hidden markov model hybrid systems, in Automatic Speech Recognition, pp. 99–116. Springer (2015)

  52. H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks, in Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)

  53. X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks, in International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215–219, Florence, Italy. IEEE (2014)

  54. T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, A.A. Efros, Learning dense correspondence via 3D-guided cycle consistency, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)

  55. J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232(2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kumud Tripathi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: The missing unnumbered equation before equation (3) is now inserted.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tripathi, K., Rao, K.S. CycleGAN-Based Speech Mode Transformation Model for Robust Multilingual ASR. Circuits Syst Signal Process 41, 5283–5305 (2022). https://doi.org/10.1007/s00034-022-02008-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02008-4

Keywords

Navigation