Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

Yakovenko, Olga; Bondarenko, Ivan

doi:10.1007/978-3-030-71214-3_10

Olga Yakovenko²³ &
Ivan Bondarenko²⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1357))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

704 Accesses

Abstract

For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

Robust Noisy Speech Parameterization Using Convolutional Neural Networks

Multi-channel spectrograms for speech processing applications using deep learning methods

Article Open access 24 September 2020

References

Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: ICML (2016)
Google Scholar
Dai, B., et al.: Hidden talents of the variational autoencoder (2017)
Google Scholar
van den Oord, A., et al.: Neural discrete representation learning. In: NIPS (2017)
Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. In: SSW (2016)
Google Scholar
Zhu, Z., et al.: Siamese recurrent auto-encoder representation for query-by-example spoken term detection. In: Interspeech (2018)
Google Scholar
Milde, B., Biemann, C.: Unspeech: unsupervised speech context embeddings. In: Interspeech (2018)
Google Scholar
Chung, Y.-A., Glass, J.R.: Speech2Vec: a sequence-to-sequence framework for learning word embeddings from speech. In: Interspeech (2018)
Google Scholar
LibriSpeech. http://www.openslr.org/12/
Google Speech Commands. https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
Audio\(\_\)vae. https://github.com/nsu-ai-team/audio_vae

Download references

Author information

Authors and Affiliations

Center of Financial Technologies, Novosibirsk, Russia
Olga Yakovenko
Novosibirsk State University, Novosibirsk, Russia
Ivan Bondarenko

Authors

Olga Yakovenko
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Bondarenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olga Yakovenko .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Germany
Wil M. P. van der Aalst
University of Ljubljana, Ljubljana, Slovenia
Vladimir Batagelj
National Research University Higher School of Economics, Perm, Russia
Alexey Buzmakov
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
University of Melbourne, Melbourne, VIC, Australia
Anna Kalenkova
Krasovskii Institute of Mathematics and Mechanics of RAS, Ekaterinburg, Russia
Michael Khachay
National Research University Higher School of Economics, Saint-Petersburg, Russia
Olessia Koltsova
University of Oslo, Oslo, Norway
Andrey Kutuzov
National Research University Higher School of Economics, Moscow, Russia
Sergei O. Kuznetsov
National Research University Higher School of Economics, Moscow, Russia
Irina A. Lomazova
Lomonosov Moscow State University, Moscow, Russia
Natalia Loukachevitch
National Research University Higher School of Economics, Moscow, Russia
Ilya Makarov
LORIA, Vandœuvre-lès-Nancy, France
Amedeo Napoli
Skolkovo Institute of Science and Technology, Moscow, Russia
Alexander Panchenko
University of Florida, Gainesville, FL, USA
Panos M. Pardalos
Università Ca’ Foscari Venezia, Venezia, Italy
Marcello Pelillo
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Kazan Federal University, Kazan, Russia
Elena Tutubalina

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yakovenko, O., Bondarenko, I. (2021). Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition. In: van der Aalst, W.M.P., et al. Recent Trends in Analysis of Images, Social Networks and Texts. AIST 2020. Communications in Computer and Information Science, vol 1357. Springer, Cham. https://doi.org/10.1007/978-3-030-71214-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-71214-3_10
Published: 25 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71213-6
Online ISBN: 978-3-030-71214-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

Abstract

Access this chapter

Similar content being viewed by others

Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

Robust Noisy Speech Parameterization Using Convolutional Neural Networks

Multi-channel spectrograms for speech processing applications using deep learning methods

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

Abstract

Access this chapter

Similar content being viewed by others

Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

Robust Noisy Speech Parameterization Using Convolutional Neural Networks

Multi-channel spectrograms for speech processing applications using deep learning methods

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation