Skip to main content
Log in

Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration

  • Deep learning for music and audio
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Perceptual audio coding is heavily and successfully applied for audio compression. However, perceptual audio coders may inject audible coding artifacts when encoding audio at low bitrates. Low-bitrate audio restoration is a challenging problem, which tries to recover a high-quality audio sample close to the uncompressed original from a low-quality encoded version. In this paper, we propose a novel data-driven method for audio restoration, where temporal and spectral dynamics are explicitly captured by a deep time-frequency-LSTM recurrent neural networks. Leveraging the captured temporal and spectral information can facilitate the task of learning a nonlinear mapping from the magnitude spectrogram of low-quality audio to that of high-quality audio. The proposed method substantially attenuates audible artifacts caused by codecs and is conceptually straightforward. Extensive experiments were carried out and the experimental results show that for low-bitrate audio at 96 kbps (mono), 64 kbps (mono), and 96 kbps (stereo), the proposed method can efficiently generate improved-quality audio that is competitive or even superior in perceptual quality to the audio produced by other state-of-the-art deep neural network methods and the LAME-MP3 codec.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Abbaszadeh P (2016) Improving hydrological process modeling using optimized threshold-based wavelet de-noising technique. Water Resour Manag 30(5):1701–1721

    Article  Google Scholar 

  2. Aliabadi M, Golmohammadi R, Mansoorizadeh M, Khotanlou H, Hamadani AO (2013) An empirical technique for predicting noise exposure level in the typical embroidery workrooms using artificial neural networks. Appl Acoust 74(3):364–374

    Article  Google Scholar 

  3. Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In: Proceedings of international conference on machine learning (ICML), New York City, NY, USA, pp 173–182

  4. Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Raiman J, Sengupta S, et al (2017) Deep voice: real-time neural text-to-speech. arXiv preprint arXiv:1702.07825

  5. Brandenburg K (1999) MP3 and AAC explained. In: Audio engineering society conference: 17th international conference: high-quality audio coding. Audio Engineering Society

  6. Cohen I, Gannot S (2008) Spectral enhancement methods. In: Springer handbook of speech processing. Springer, pp 873–902

  7. Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedings of ACII, Geneva, Switzerland, pp 511–516

  8. Deng J, Zhang Z, Eyben F, Schuller B (2014) Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process Lett 21(9):1068–1072

    Article  Google Scholar 

  9. Deng J, Eyben F, Schuller B, Burkhardt F (2017) Deep neural networks for anger detection from real life speech data. In: Proceedings of 2nd international workshop on automatic sentiment analysis in the wild (WASA 2017) held in conjunction with the 7th biannual conference on affective computing and intelligent interaction (ACII 2017), AAAC, IEEE, San Antonio, TX

  10. Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process Lett 24(4):500–504

    Article  Google Scholar 

  11. Dietz M, Liljeryd L, Kjorling K, Kunz O (2002) Spectral band replication, a novel approach in audio coding. In: Audio engineering society convention 112

  12. Disch S, Bäckström T (2017) Bandwidth extension. In: Speech coding: code- excited linear prediction. Springer, Cham, chap 11, pp 151–160

  13. Graves A, Fernández S, Schmidhuber J (2007) Multi-dimensional recurrent neural networks. In: Proceedings of International conference on artificial neural networks (ICANN), Porto, Portugal, pp 549–558

  14. Graves A et al (2012) Supervised sequence labelling with recurrent neural networks, vol 385. Springer, Berlin

    Book  Google Scholar 

  15. Gray A, Markel J (1976) Distance measures for speech processing. IEEE Trans Acoust Speech Signal Process 24(5):380–391

    Article  Google Scholar 

  16. Griffin D, Lim J (1984) Signal estimation from modified short-time fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243

    Article  Google Scholar 

  17. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al (2017) CNN architectures for large-scale audio classification. In: Proceedings Of ICASSP, New Orleans, USA, pp 131–135

  18. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  Google Scholar 

  19. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  20. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of ICML, Lille, France, pp 448–456

  21. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of ICLR, San Diego, USA

  22. Kuleshov V, Enam SZ, Ermon S (2017) Audio super-resolution using neural nets. https://openreview.net/pdf?id=S1gNakBFx. Accessed 12 July 2017

  23. Kuleshov V, Enam SZ, Ermon S (2017) Audio super resolution using neural networks. CoRR abs/1708.00853, arXiv:1708.00853

  24. Larsen ER, Aarts RM (2004) Audio Bandwidth extension: application of psychoacoustics, signal processing and loudspeaker design. Wiley, Hoboken

    Book  Google Scholar 

  25. Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789

  26. Li J, Mohamed A, Zweig G, Gong Y (2015) LSTM time and frequency recurrence for automatic speech recognition. In: Proceedings of IEEE workshop on automatic speech recognition and understanding (ASRU), Scottsdale, AZ, USA, pp 187–191

  27. Li J, Mohamed A, Zweig G, Gong Y (2016) Exploring multidimensional LSTMs for large vocabulary ASR. In: 2016 IEEE international conference on acoustics, speech and signal processing, ICASSP 2016, Shanghai, China, March 20–25, 2016, pp 4940–4944

  28. Li K, Lee C (2015) A deep neural network approach to speech bandwidth expansion. In: 2015 IEEE International conference on acoustics, speech and signal processing, ICASSP 2015, South Brisbane, Queensland, Australia, pp 4395–4399

  29. Li K, Huang Z, Xu Y, Lee C (2015) DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech. In: Proceedings of INTERSPEECH, pp 2578–2582

  30. Li R, Wu Z, Ning Y, Sun L, Meng H, Cai L (2017) Spectro-temporal modelling with time-frequency LSTM and structured output layer for voice conversion. In: Proceedings of INTERSPEECH, Stockholm, Sweden, pp 3409–3413

  31. Liu CM, Hsu HW, Lee WC (2008) Compression artifacts in perceptual audio coding. IEEE Trans Audio Speech Lang Process 16(4):681–695

    Article  Google Scholar 

  32. Liu X, Bao C, Jia M, Sha Y (2010) A harmonic bandwidth extension based on gaussian mixture model. In: 2010 IEEE 10th international conference on signal processing (ICSP). IEEE, pp 474–477

  33. Liu Y, Wang D (2017) Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 5600–5604

  34. Maas AL, Le QV, O’Neil TM, Vinyals O, Nguyen P, Ng AY (2012) Recurrent neural networks for noise reduction in robust ASR. In: Proceedings of INTERSPEECH

  35. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–25

  36. McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Proceedings of INTERSPEECH, Singapore

  37. Morzyński L, Makarewicz G (2003) Application of neural networks in active noise reduction systems. Int J Occup Saf Ergon 9(3):257–270

    Article  Google Scholar 

  38. Nagel F, Disch S (2009) A harmonic bandwidth extension method for audio codecs. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2009, 19–24 April 2009, Taipei, Taiwan, pp 145–148

  39. Nagel F, Disch S, Wilde S (2010) A continuous modulated single sideband bandwidth extension. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2010, Dallas, Texas, USA, pp 357–360

  40. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML, Haifa, Israel, pp 807–814

  41. Odena A, Dumoulin V, Olah C (2016) Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconv-checkerboard/. Accessed 12 July 2017

  42. Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499

  43. Painter T, Spanias A (2000) Perceptual coding of digital audio. Proc IEEE 88(4):451–515

    Article  Google Scholar 

  44. Park SR, Lee J (2016) A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132

  45. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Proceedings of NIPS workshop

  46. Project TL (2017) Lame. http://lame.sf.net. lAME 64 bits version 3.99.5

  47. Pulakka H, Remes U, Palomäki KJ, Kurimo M, Alku P (2011) Speech bandwidth extension using gaussian mixture model-based estimation of the highband MEL spectrum. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011, May 22–27, 2011, Prague Congress Center, Prague, Czech Republic, pp 5100–5103

  48. Pulakka H, Remes U, Yrttiaho S, Palomäki KJ, Kurimo M, Alku P (2012) Bandwidth extension of telephone speech to low frequencies using sinusoidal synthesis and a Gaussian mixture model. IEEE Trans Audio Speech Lang Process 20(8):2219–2231

    Article  Google Scholar 

  49. Ribeiro FP, Florêncio DAF, Zhang C, Seltzer ML (2011) CROWDMOS: an approach for crowdsourcing mean opinion score studies. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011, Prague, Czech Republic, pp 2416–2419

  50. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241

  51. Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, pp 338–342

  52. Saon G, Sercu T, Rennie SJ, Kuo HJ (2016) The IBM 2016 english conversational telephone speech recognition system. In: Proceedings of INTERSPEECH, CA, USA, pp 7–11

  53. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  54. Sercu T, Puhrsch C, Kingsbury B, LeCun Y (2016) Very deep multilingual convolutional neural networks for LVCSR. In: Proceedings of ICASSP, Shanghai, China, pp 4955–4959

  55. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  56. Spanias A, Painter T, Atti V (2006) Audio signal processing and coding. Wiley, Hoboken

    Google Scholar 

  57. Stollenga MF, Byeon W, Liwicki M, Schmidhuber J (2015) Parallel multi-dimensional LSTM, with application to fast biomedical volumetric image segmentation. In: Proceedings of Advances in neural information processing systems (NIPS), Montreal, Quebec, Canada, pp 2998–3006

  58. Sze V, Chen YH, Yang TJ, Emer J (2017) Efficient processing of deep neural networks: a tutorial and survey. arXiv preprint arXiv:1703.09039

  59. Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560

    Article  Google Scholar 

  60. Wolters M, Kjorling K, Homm D, Purnhagen H (2003) A closer look into mpeg-4 high efficiency AAC. In: Audio engineering society convention 115

  61. Wu C, Vinton M (2017) Blind bandwidth extension using k-means and support vector regression. In: 2017 IEEE international conference on acoustics, speech and signal processing, ICASSP 2017, New Orleans, LA, USA, March 5–9, 2017, pp 721–725

  62. Zernicki T, Domanski M (2008) Improved coding of tonal components in MPEG-4 AAC with SBR. In: 2008 16th European signal processing conference, EUSIPCO 2008, August 25–29, 2008, Lausanne, Switzerland, pp 1–5

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Deng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, J., Schuller, B., Eyben, F. et al. Exploiting time-frequency patterns with LSTM-RNNs for low-bitrate audio restoration. Neural Comput & Applic 32, 1095–1107 (2020). https://doi.org/10.1007/s00521-019-04158-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04158-0

Keywords

Navigation