Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

Deng, Jun; Schuller, Björn; Eyben, Florian; Schuller, Dagmar; Zhang, Zixing; Francois, Holly; Oh, Eunmi

doi:10.1007/s00521-019-04158-0

Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

Deep learning for music and audio
Published: 10 May 2019

Volume 32, pages 1095–1107, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Jun Deng ORCID: orcid.org/0000-0003-3196-8690¹,
Björn Schuller¹,
Florian Eyben¹,
Dagmar Schuller¹,
Zixing Zhang¹,
Holly Francois² &
…
Eunmi Oh³

1922 Accesses
3 Altmetric
Explore all metrics

Abstract

Perceptual audio coding is heavily and successfully applied for audio compression. However, perceptual audio coders may inject audible coding artifacts when encoding audio at low bitrates. Low-bitrate audio restoration is a challenging problem, which tries to recover a high-quality audio sample close to the uncompressed original from a low-quality encoded version. In this paper, we propose a novel data-driven method for audio restoration, where temporal and spectral dynamics are explicitly captured by a deep time-frequency-LSTM recurrent neural networks. Leveraging the captured temporal and spectral information can facilitate the task of learning a nonlinear mapping from the magnitude spectrogram of low-quality audio to that of high-quality audio. The proposed method substantially attenuates audible artifacts caused by codecs and is conceptually straightforward. Extensive experiments were carried out and the experimental results show that for low-bitrate audio at 96 kbps (mono), 64 kbps (mono), and 96 kbps (stereo), the proposed method can efficiently generate improved-quality audio that is competitive or even superior in perceptual quality to the audio produced by other state-of-the-art deep neural network methods and the LAME-MP3 codec.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TFF-Codec: A High Fidelity End-to-End Neural Audio Codec

Article 25 January 2025

Deep Learning-Based Lossless Audio Encoder (DLLAE)

Real time speech enhancement using densely connected neural networks and Squeezed temporal convolutional modules

Article 06 November 2023

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

References

Abbaszadeh P (2016) Improving hydrological process modeling using optimized threshold-based wavelet de-noising technique. Water Resour Manag 30(5):1701–1721
Article Google Scholar
Aliabadi M, Golmohammadi R, Mansoorizadeh M, Khotanlou H, Hamadani AO (2013) An empirical technique for predicting noise exposure level in the typical embroidery workrooms using artificial neural networks. Appl Acoust 74(3):364–374
Article Google Scholar
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of international conference on machine learning (ICML), New York City, NY, USA, pp 173–182
Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Raiman J, Sengupta S, et al (2017) Deep voice: real-time neural text-to-speech. arXiv preprint arXiv:1702.07825
Brandenburg K (1999) MP3 and AAC explained. In: Audio engineering society conference: 17th international conference: high-quality audio coding. Audio Engineering Society
Cohen I, Gannot S (2008) Spectral enhancement methods. In: Springer handbook of speech processing. Springer, pp 873–902
Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedings of ACII, Geneva, Switzerland, pp 511–516
Deng J, Zhang Z, Eyben F, Schuller B (2014) Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process Lett 21(9):1068–1072
Article Google Scholar
Deng J, Eyben F, Schuller B, Burkhardt F (2017) Deep neural networks for anger detection from real life speech data. In: Proceedings of 2nd international workshop on automatic sentiment analysis in the wild (WASA 2017) held in conjunction with the 7th biannual conference on affective computing and intelligent interaction (ACII 2017), AAAC, IEEE, San Antonio, TX
Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process Lett 24(4):500–504
Article Google Scholar
Dietz M, Liljeryd L, Kjorling K, Kunz O (2002) Spectral band replication, a novel approach in audio coding. In: Audio engineering society convention 112
Disch S, Bäckström T (2017) Bandwidth extension. In: Speech coding: code- excited linear prediction. Springer, Cham, chap 11, pp 151–160
Graves A, Fernández S, Schmidhuber J (2007) Multi-dimensional recurrent neural networks. In: Proceedings of International conference on artificial neural networks (ICANN), Porto, Portugal, pp 549–558
Graves A et al (2012) Supervised sequence labelling with recurrent neural networks, vol 385. Springer, Berlin
Book Google Scholar
Gray A, Markel J (1976) Distance measures for speech processing. IEEE Trans Acoust Speech Signal Process 24(5):380–391
Article Google Scholar
Griffin D, Lim J (1984) Signal estimation from modified short-time fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Article Google Scholar
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al (2017) CNN architectures for large-scale audio classification. In: Proceedings Of ICASSP, New Orleans, USA, pp 131–135
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML, Lille, France, pp 448–456
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of ICLR, San Diego, USA
Kuleshov V, Enam SZ, Ermon S (2017) Audio super-resolution using neural nets. https://openreview.net/pdf?id=S1gNakBFx. Accessed 12 July 2017
Kuleshov V, Enam SZ, Ermon S (2017) Audio super resolution using neural networks. CoRR abs/1708.00853, arXiv:1708.00853
Larsen ER, Aarts RM (2004) Audio Bandwidth extension: application of psychoacoustics, signal processing and loudspeaker design. Wiley, Hoboken
Book Google Scholar
Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789
Li J, Mohamed A, Zweig G, Gong Y (2015) LSTM time and frequency recurrence for automatic speech recognition. In: Proceedings of IEEE workshop on automatic speech recognition and understanding (ASRU), Scottsdale, AZ, USA, pp 187–191
Li J, Mohamed A, Zweig G, Gong Y (2016) Exploring multidimensional LSTMs for large vocabulary ASR. In: 2016 IEEE international conference on acoustics, speech and signal processing, ICASSP 2016, Shanghai, China, March 20–25, 2016, pp 4940–4944
Li K, Lee C (2015) A deep neural network approach to speech bandwidth expansion. In: 2015 IEEE International conference on acoustics, speech and signal processing, ICASSP 2015, South Brisbane, Queensland, Australia, pp 4395–4399
Li K, Huang Z, Xu Y, Lee C (2015) DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech. In: Proceedings of INTERSPEECH, pp 2578–2582
Li R, Wu Z, Ning Y, Sun L, Meng H, Cai L (2017) Spectro-temporal modelling with time-frequency LSTM and structured output layer for voice conversion. In: Proceedings of INTERSPEECH, Stockholm, Sweden, pp 3409–3413
Liu CM, Hsu HW, Lee WC (2008) Compression artifacts in perceptual audio coding. IEEE Trans Audio Speech Lang Process 16(4):681–695
Article Google Scholar
Liu X, Bao C, Jia M, Sha Y (2010) A harmonic bandwidth extension based on gaussian mixture model. In: 2010 IEEE 10th international conference on signal processing (ICSP). IEEE, pp 474–477
Liu Y, Wang D (2017) Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 5600–5604
Maas AL, Le QV, O’Neil TM, Vinyals O, Nguyen P, Ng AY (2012) Recurrent neural networks for noise reduction in robust ASR. In: Proceedings of INTERSPEECH
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–25
McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Proceedings of INTERSPEECH, Singapore
Morzyński L, Makarewicz G (2003) Application of neural networks in active noise reduction systems. Int J Occup Saf Ergon 9(3):257–270
Article Google Scholar
Nagel F, Disch S (2009) A harmonic bandwidth extension method for audio codecs. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2009, 19–24 April 2009, Taipei, Taiwan, pp 145–148
Nagel F, Disch S, Wilde S (2010) A continuous modulated single sideband bandwidth extension. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2010, Dallas, Texas, USA, pp 357–360
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML, Haifa, Israel, pp 807–814
Odena A, Dumoulin V, Olah C (2016) Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconv-checkerboard/. Accessed 12 July 2017
Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499
Painter T, Spanias A (2000) Perceptual coding of digital audio. Proc IEEE 88(4):451–515
Article Google Scholar
Park SR, Lee J (2016) A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Proceedings of NIPS workshop
Project TL (2017) Lame. http://lame.sf.net. lAME 64 bits version 3.99.5
Pulakka H, Remes U, Palomäki KJ, Kurimo M, Alku P (2011) Speech bandwidth extension using gaussian mixture model-based estimation of the highband MEL spectrum. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011, May 22–27, 2011, Prague Congress Center, Prague, Czech Republic, pp 5100–5103
Pulakka H, Remes U, Yrttiaho S, Palomäki KJ, Kurimo M, Alku P (2012) Bandwidth extension of telephone speech to low frequencies using sinusoidal synthesis and a Gaussian mixture model. IEEE Trans Audio Speech Lang Process 20(8):2219–2231
Article Google Scholar
Ribeiro FP, Florêncio DAF, Zhang C, Seltzer ML (2011) CROWDMOS: an approach for crowdsourcing mean opinion score studies. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011, Prague, Czech Republic, pp 2416–2419
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241
Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, pp 338–342
Saon G, Sercu T, Rennie SJ, Kuo HJ (2016) The IBM 2016 english conversational telephone speech recognition system. In: Proceedings of INTERSPEECH, CA, USA, pp 7–11
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Article Google Scholar
Sercu T, Puhrsch C, Kingsbury B, LeCun Y (2016) Very deep multilingual convolutional neural networks for LVCSR. In: Proceedings of ICASSP, Shanghai, China, pp 4955–4959
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Spanias A, Painter T, Atti V (2006) Audio signal processing and coding. Wiley, Hoboken
Google Scholar
Stollenga MF, Byeon W, Liwicki M, Schmidhuber J (2015) Parallel multi-dimensional LSTM, with application to fast biomedical volumetric image segmentation. In: Proceedings of Advances in neural information processing systems (NIPS), Montreal, Quebec, Canada, pp 2998–3006
Sze V, Chen YH, Yang TJ, Emer J (2017) Efficient processing of deep neural networks: a tutorial and survey. arXiv preprint arXiv:1703.09039
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Article Google Scholar
Wolters M, Kjorling K, Homm D, Purnhagen H (2003) A closer look into mpeg-4 high efficiency AAC. In: Audio engineering society convention 115
Wu C, Vinton M (2017) Blind bandwidth extension using k-means and support vector regression. In: 2017 IEEE international conference on acoustics, speech and signal processing, ICASSP 2017, New Orleans, LA, USA, March 5–9, 2017, pp 721–725
Zernicki T, Domanski M (2008) Improved coding of tonal components in MPEG-4 AAC with SBR. In: 2008 16th European signal processing conference, EUSIPCO 2008, August 25–29, 2008, Lausanne, Switzerland, pp 1–5

Download references

Author information

Authors and Affiliations

audEERING GmbH, Gilching, Germany
Jun Deng, Björn Schuller, Florian Eyben, Dagmar Schuller & Zixing Zhang
Samsung Research UK, Staines, UK
Holly Francois
Samsung Research, Seoul, Republic of Korea
Eunmi Oh

Authors

Jun Deng
View author publications
You can also search for this author inPubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author inPubMed Google Scholar
Florian Eyben
View author publications
You can also search for this author inPubMed Google Scholar
Dagmar Schuller
View author publications
You can also search for this author inPubMed Google Scholar
Zixing Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Holly Francois
View author publications
You can also search for this author inPubMed Google Scholar
Eunmi Oh
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jun Deng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, J., Schuller, B., Eyben, F. et al. Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration. Neural Comput & Applic 32, 1095–1107 (2020). https://doi.org/10.1007/s00521-019-04158-0

Download citation

Received: 13 December 2017
Accepted: 19 March 2019
Published: 10 May 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s00521-019-04158-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

TFF-Codec: A High Fidelity End-to-End Neural Audio Codec

Deep Learning-Based Lossless Audio Encoder (DLLAE)

Real time speech enhancement using densely connected neural networks and Squeezed temporal convolutional modules

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now