Abstract
Perceptual audio coding is heavily and successfully applied for audio compression. However, perceptual audio coders may inject audible coding artifacts when encoding audio at low bitrates. Low-bitrate audio restoration is a challenging problem, which tries to recover a high-quality audio sample close to the uncompressed original from a low-quality encoded version. In this paper, we propose a novel data-driven method for audio restoration, where temporal and spectral dynamics are explicitly captured by a deep time-frequency-LSTM recurrent neural networks. Leveraging the captured temporal and spectral information can facilitate the task of learning a nonlinear mapping from the magnitude spectrogram of low-quality audio to that of high-quality audio. The proposed method substantially attenuates audible artifacts caused by codecs and is conceptually straightforward. Extensive experiments were carried out and the experimental results show that for low-bitrate audio at 96 kbps (mono), 64 kbps (mono), and 96 kbps (stereo), the proposed method can efficiently generate improved-quality audio that is competitive or even superior in perceptual quality to the audio produced by other state-of-the-art deep neural network methods and the LAME-MP3 codec.




Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Abbaszadeh P (2016) Improving hydrological process modeling using optimized threshold-based wavelet de-noising technique. Water Resour Manag 30(5):1701–1721
Aliabadi M, Golmohammadi R, Mansoorizadeh M, Khotanlou H, Hamadani AO (2013) An empirical technique for predicting noise exposure level in the typical embroidery workrooms using artificial neural networks. Appl Acoust 74(3):364–374
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of international conference on machine learning (ICML), New York City, NY, USA, pp 173–182
Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Raiman J, Sengupta S, et al (2017) Deep voice: real-time neural text-to-speech. arXiv preprint arXiv:1702.07825
Brandenburg K (1999) MP3 and AAC explained. In: Audio engineering society conference: 17th international conference: high-quality audio coding. Audio Engineering Society
Cohen I, Gannot S (2008) Spectral enhancement methods. In: Springer handbook of speech processing. Springer, pp 873–902
Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedings of ACII, Geneva, Switzerland, pp 511–516
Deng J, Zhang Z, Eyben F, Schuller B (2014) Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process Lett 21(9):1068–1072
Deng J, Eyben F, Schuller B, Burkhardt F (2017) Deep neural networks for anger detection from real life speech data. In: Proceedings of 2nd international workshop on automatic sentiment analysis in the wild (WASA 2017) held in conjunction with the 7th biannual conference on affective computing and intelligent interaction (ACII 2017), AAAC, IEEE, San Antonio, TX
Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process Lett 24(4):500–504
Dietz M, Liljeryd L, Kjorling K, Kunz O (2002) Spectral band replication, a novel approach in audio coding. In: Audio engineering society convention 112
Disch S, Bäckström T (2017) Bandwidth extension. In: Speech coding: code- excited linear prediction. Springer, Cham, chap 11, pp 151–160
Graves A, Fernández S, Schmidhuber J (2007) Multi-dimensional recurrent neural networks. In: Proceedings of International conference on artificial neural networks (ICANN), Porto, Portugal, pp 549–558
Graves A et al (2012) Supervised sequence labelling with recurrent neural networks, vol 385. Springer, Berlin
Gray A, Markel J (1976) Distance measures for speech processing. IEEE Trans Acoust Speech Signal Process 24(5):380–391
Griffin D, Lim J (1984) Signal estimation from modified short-time fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al (2017) CNN architectures for large-scale audio classification. In: Proceedings Of ICASSP, New Orleans, USA, pp 131–135
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML, Lille, France, pp 448–456
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of ICLR, San Diego, USA
Kuleshov V, Enam SZ, Ermon S (2017) Audio super-resolution using neural nets. https://openreview.net/pdf?id=S1gNakBFx. Accessed 12 July 2017
Kuleshov V, Enam SZ, Ermon S (2017) Audio super resolution using neural networks. CoRR abs/1708.00853, arXiv:1708.00853
Larsen ER, Aarts RM (2004) Audio Bandwidth extension: application of psychoacoustics, signal processing and loudspeaker design. Wiley, Hoboken
Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789
Li J, Mohamed A, Zweig G, Gong Y (2015) LSTM time and frequency recurrence for automatic speech recognition. In: Proceedings of IEEE workshop on automatic speech recognition and understanding (ASRU), Scottsdale, AZ, USA, pp 187–191
Li J, Mohamed A, Zweig G, Gong Y (2016) Exploring multidimensional LSTMs for large vocabulary ASR. In: 2016 IEEE international conference on acoustics, speech and signal processing, ICASSP 2016, Shanghai, China, March 20–25, 2016, pp 4940–4944
Li K, Lee C (2015) A deep neural network approach to speech bandwidth expansion. In: 2015 IEEE International conference on acoustics, speech and signal processing, ICASSP 2015, South Brisbane, Queensland, Australia, pp 4395–4399
Li K, Huang Z, Xu Y, Lee C (2015) DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech. In: Proceedings of INTERSPEECH, pp 2578–2582
Li R, Wu Z, Ning Y, Sun L, Meng H, Cai L (2017) Spectro-temporal modelling with time-frequency LSTM and structured output layer for voice conversion. In: Proceedings of INTERSPEECH, Stockholm, Sweden, pp 3409–3413
Liu CM, Hsu HW, Lee WC (2008) Compression artifacts in perceptual audio coding. IEEE Trans Audio Speech Lang Process 16(4):681–695
Liu X, Bao C, Jia M, Sha Y (2010) A harmonic bandwidth extension based on gaussian mixture model. In: 2010 IEEE 10th international conference on signal processing (ICSP). IEEE, pp 474–477
Liu Y, Wang D (2017) Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 5600–5604
Maas AL, Le QV, O’Neil TM, Vinyals O, Nguyen P, Ng AY (2012) Recurrent neural networks for noise reduction in robust ASR. In: Proceedings of INTERSPEECH
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–25
McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Proceedings of INTERSPEECH, Singapore
Morzyński L, Makarewicz G (2003) Application of neural networks in active noise reduction systems. Int J Occup Saf Ergon 9(3):257–270
Nagel F, Disch S (2009) A harmonic bandwidth extension method for audio codecs. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2009, 19–24 April 2009, Taipei, Taiwan, pp 145–148
Nagel F, Disch S, Wilde S (2010) A continuous modulated single sideband bandwidth extension. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2010, Dallas, Texas, USA, pp 357–360
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML, Haifa, Israel, pp 807–814
Odena A, Dumoulin V, Olah C (2016) Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconv-checkerboard/. Accessed 12 July 2017
Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499
Painter T, Spanias A (2000) Perceptual coding of digital audio. Proc IEEE 88(4):451–515
Park SR, Lee J (2016) A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Proceedings of NIPS workshop
Project TL (2017) Lame. http://lame.sf.net. lAME 64 bits version 3.99.5
Pulakka H, Remes U, Palomäki KJ, Kurimo M, Alku P (2011) Speech bandwidth extension using gaussian mixture model-based estimation of the highband MEL spectrum. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011, May 22–27, 2011, Prague Congress Center, Prague, Czech Republic, pp 5100–5103
Pulakka H, Remes U, Yrttiaho S, Palomäki KJ, Kurimo M, Alku P (2012) Bandwidth extension of telephone speech to low frequencies using sinusoidal synthesis and a Gaussian mixture model. IEEE Trans Audio Speech Lang Process 20(8):2219–2231
Ribeiro FP, Florêncio DAF, Zhang C, Seltzer ML (2011) CROWDMOS: an approach for crowdsourcing mean opinion score studies. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011, Prague, Czech Republic, pp 2416–2419
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241
Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, pp 338–342
Saon G, Sercu T, Rennie SJ, Kuo HJ (2016) The IBM 2016 english conversational telephone speech recognition system. In: Proceedings of INTERSPEECH, CA, USA, pp 7–11
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Sercu T, Puhrsch C, Kingsbury B, LeCun Y (2016) Very deep multilingual convolutional neural networks for LVCSR. In: Proceedings of ICASSP, Shanghai, China, pp 4955–4959
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Spanias A, Painter T, Atti V (2006) Audio signal processing and coding. Wiley, Hoboken
Stollenga MF, Byeon W, Liwicki M, Schmidhuber J (2015) Parallel multi-dimensional LSTM, with application to fast biomedical volumetric image segmentation. In: Proceedings of Advances in neural information processing systems (NIPS), Montreal, Quebec, Canada, pp 2998–3006
Sze V, Chen YH, Yang TJ, Emer J (2017) Efficient processing of deep neural networks: a tutorial and survey. arXiv preprint arXiv:1703.09039
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Wolters M, Kjorling K, Homm D, Purnhagen H (2003) A closer look into mpeg-4 high efficiency AAC. In: Audio engineering society convention 115
Wu C, Vinton M (2017) Blind bandwidth extension using k-means and support vector regression. In: 2017 IEEE international conference on acoustics, speech and signal processing, ICASSP 2017, New Orleans, LA, USA, March 5–9, 2017, pp 721–725
Zernicki T, Domanski M (2008) Improved coding of tonal components in MPEG-4 AAC with SBR. In: 2008 16th European signal processing conference, EUSIPCO 2008, August 25–29, 2008, Lausanne, Switzerland, pp 1–5
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Deng, J., Schuller, B., Eyben, F. et al. Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration. Neural Comput & Applic 32, 1095–1107 (2020). https://doi.org/10.1007/s00521-019-04158-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04158-0