Progressive AutoSpeech: An Efficient and General Framework for Automatic Speech Classification

Zhu, Guanghui; Cheng, Feng; Qiu, Mengchuan; Xu, Zhuoer; Wang, Wenjie; Yuan, Chunfeng; Huang, Yihua

doi:10.1007/978-3-030-75765-6_14

Guanghui Zhu¹⁵,
Feng Cheng¹⁵,
Mengchuan Qiu¹⁵,
Zhuoer Xu¹⁵,
Wenjie Wang¹⁵,
Chunfeng Yuan¹⁵ &
…
Yihua Huang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12713))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2452 Accesses

Abstract

Speech classification has been widely used in many speech-related applications. However, the complexity of speech classification tasks often exceeds the scope of non-experts, the off-the-shelf speech classification methods are urgently needed. Recently, the automatic speech classification (AutoSpeech) without any human intervention has attracted more and more attention. The practical AutoSpeech solution should be general and can automatically handle classification tasks from different domains. Moreover, AutoSpeech should improve not only the final performance but also the any-time performance especially when the time budget is limited. To address these issues, we propose a three-stage any-time learning algorithm framework called Progressive AutoSpeech for automatic speech classification under a given time budget. Progressive AutoSpeech consists of the fast stage, enhancement stage, and exploration stage. Each stage uses different models and features to ensure generalization. Additionally, we automatically construct ensembles of top-k prediction results to improve the robustness. The experimental results reveal that Progressive AutoSpeech is effective and efficient for a wide range of speech classification tasks and can achieve the best ALC score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The NECTEC 2015 Thai Open-Domain Automatic Speech Recognition System

Improving Automatic Speech Recognition via Joint Training with Speech Enhancement as Multi-task Learning

AugMixSpeech: A Data Augmentation Method and Consistency Regularization for Mandarin Automatic Speech Recognition

Notes

1.
Progressive AutoSpeech won the first place in the NeurIPS 2019 AutoSpeech challenge and the second place in the Interspeech 2020 AutoSpeech challenge.

References

Adavanne, S., Drossos, K., Çakir, E., Virtanen, T.: Stacked convolutional and recurrent neural networks for bird audio detection. In: Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 1729–1733 (2017)
Google Scholar
Adavanne, S., Virtanen, T.: Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv preprint arXiv:1710.02998 (2017)
Brazdil, P., Giraud-Carrier, C.: Metalearning and algorithm selection: progress, state of the art and introduction to the 2018 special issue. Mach. Learn. 107(1), 1–14 (2018)
Article MathSciNet Google Scholar
Carmi, N., Cohen, A., Avigal, M., Lerner, A.: A storyteller’s tale: literature audiobooks genre classification using CNN and RNN architectures. In: Proceedings of Interspeech 2019, pp. 3387–3390 (2019)
Google Scholar
Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425 (2017)
Google Scholar
Ellis, D.P.W.: Classifying music audio with timbral and chroma features. In: Proceedings of the International Conference on Music Information Retrieval, pp. 339–340 (2007)
Google Scholar
Ganchev, T., Fakotakis, N., Kokkinakis, G.: Comparative evaluation of various mfcc implementations on the speaker verification task. In: Proceedings of the International Conference on Speech and Computer, pp. 191–194 (2005)
Google Scholar
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning. TSSCML. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5
Book Google Scholar
Irvin, J., Chartock, E., Hollander, N.: Recurrent neural networks with attention for genre classification (2016)
Google Scholar
Kim, T., Lee, J., Nam, J.: Comparison and analysis of sample cnn architectures for audio classification. IEEE J. Sel. Topics Signal Process. 13(2), 285–297 (2019)
Article Google Scholar
Lin, Y.L., Wei, G.: Speech emotion recognition based on HMM and SVM. In: Proceedings of the International Conference on Machine Learning and Cybernetics, pp. 4898–4901 (2005)
Google Scholar
Liu, C., Wang, Y., Kumar, K., Gong, Y.: Investigations on speaker adaptation of LSTM RNN models for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5020–5024 (2016)
Google Scholar
Liu, Z., et al.: Autocv challenge design and baseline results. In: CAp 2019 - Conférence sur l’Apprentissage Automatique. Toulouse, France (2019)
Google Scholar
Majumder, N., Poria, S., Hazarika, D., Mihalcea, R., Gelbukh, A., Cambria, E.: Dialoguernn: an attentive RNN for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6818–6825 (2019)
Google Scholar
Malik, M., Adavanne, S., Drossos, K., Virtanen, T., Ticha, D., Jarina, R.: Stacked convolutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:1706.02292 (2017)
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Nakagawa, S., Wang, L., Ohtsuka, S.: Speaker identification and verification by combining MFCC and phase information. IEEE Trans. Audio Speech Lang. Process. 20(4), 1085–1095 (2011)
Article Google Scholar
Padi, B., Mohan, A., Ganapathy, S.: Attention based hybrid i-vector BLSTM model for language recognition. In: Proceedings of Interspeech 2019, pp. 1263–1267 (2019)
Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
Google Scholar
Parchami, M., Zhu, W.P., Champagne, B., Plourde, E.: Recent developments in speech enhancement in the short-time fourier transform domain. IEEE Circ. Syst. Mag. 16(3), 45–77 (2016)
Article Google Scholar
Park, K., Mulc, T.: Css10: a collection of single speaker speech datasets for 10 languages. arXiv preprint arXiv:1903.11269 (2019)
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Dig. Signal Process. 10(1–3), 19–41 (2000)
Article Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Google Scholar
Wang, J., et al.: Autospeech 2020: the second automated machine learning challenge for speech classification. In: Interspeech 2020, pp. 1967–1971 (2020)
Google Scholar
Xie, W., Nagrani, A., Chung, J.S., Zisserman, A.: Utterance-level aggregation for speaker recognition in the wild. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795 (2019)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (U1811461), National Key R&D Program of China (2019YFC1711000), and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Guanghui Zhu, Feng Cheng, Mengchuan Qiu, Zhuoer Xu, Wenjie Wang, Chunfeng Yuan & Yihua Huang

Authors

Guanghui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Feng Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Mengchuan Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoer Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chunfeng Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yihua Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yihua Huang .

Editor information

Editors and Affiliations

IIIT, Hyderabad, Hyderabad, India
Kamal Karlapalem
Chinese University of Hong Kong, Shatin, Hong Kong
Hong Cheng
Virginia Tech, Arlington, VA, USA
Naren Ramakrishnan
Jawaharlal Nehru University, New Delhi, India
R. K. Agrawal
IIIT Hyderabad, Hyderabad, India
P. Krishna Reddy
University of Minnesota, Minneapolis, MN, USA
Jaideep Srivastava
IIIT Delhi, New Delhi, India
Tanmoy Chakraborty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, G. et al. (2021). Progressive AutoSpeech: An Efficient and General Framework for Automatic Speech Classification. In: Karlapalem, K., et al. Advances in Knowledge Discovery and Data Mining. PAKDD 2021. Lecture Notes in Computer Science(), vol 12713. Springer, Cham. https://doi.org/10.1007/978-3-030-75765-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-75765-6_14
Published: 08 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75764-9
Online ISBN: 978-3-030-75765-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Progressive AutoSpeech: An Efficient and General Framework for Automatic Speech Classification