Skip to main content

Analysis and classification of acoustic scenes with wavelet transform-based mel-scaled features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Analysis of audio from real-life environments and their categorization into different acoustic scenes can make context-aware devices and applications more efficient. Unlike speech, such signals have overlapping frequency content while spanning a much larger audible frequency range. Also, they are less structured than speech/music signals. Wavelet transform has good time-frequency localization ability owing to its variable-length basis functions. Consequently, it facilitates the extraction of more characteristic information from environmental audio. This paper attempts to classify acoustic scenes by a novel use of wavelet-based mel-scaled features. The design of the proposed framework is based on the experiments conducted on two datasets which have same scene classes but differ with regard to sample length and amount of data (in hours). It outperformed two benchmark systems, one based on mel-frequency cepstral coefficients and Gaussian mixture models and the other based on log mel-band energies and multi-layer perceptron. We also present an investigation on the use of different train and test sample duration for acoustic scene classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Aucouturier JJ, Défréville B, Pachet F (2007) The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J Acoustic Soc Amer 122(2):881–891

    Article  Google Scholar 

  2. Barchiesi D, Giannoulis D, Stowell D, Plumbley MD (2015) Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Proc Mag 32(3):16–34

    Article  Google Scholar 

  3. Bisot V, Serizel R, Essid S, Richard G (2017) Feature learning with matrix factorization applied to acoustic scene classification. IEEE/ACM Trans Audio, Speech, Language Process 25(6):1216–1229

    Article  Google Scholar 

  4. Brown GJ, Cooke M (1994) Computational auditory scene analysis. Comput Speech Language 8(4):297–336

    Article  Google Scholar 

  5. Brummer N (2007) FoCal multi-class: toolkit for evaluation, fusion and calibration of multi-class recognition scores. Tutorial and user manual. Software available at https://sites.google.com/site/nikobrummer/focal

  6. Chu S, Narayanan S, Kuo CCJ (2009) Environmental sound recognition with time-frequency audio features. IEEE Trans Audio, Speech, and Language Process 17 (6):1142–1158

    Article  Google Scholar 

  7. Daubechies I (1992) Ten lectures on wavelets, 61. Siam

  8. Davis SB, Mermelstein P (1990) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Readings in speech recognition. Elsevier, 65–74

  9. Dubois D, Guastavino C, Raimbault M (2006) A cognitive approach to urban soundscapes: Using verbal data to access everyday life auditory categories. Acta Acustic United Acustica 92(6):865–874

    Google Scholar 

  10. Eghbal-Zadeh H, Lehner B, Dorfer M, Widmer G (2016) CP-JKU Submissions for DCASE-2016: a hybrid approach using binaural i-vectors and deep convolutional neural networks. In: IEEE AASP Challenge on detection and classification of acoustic scenes and events (DCASE 2016), Budapest, Hungary, Tech. Rep

  11. Gabor D (1946) Theory of communication. part 1: the analysis of information. J Institut Electr Engineers-Part III: Radio Commun Eng 93(26):429–441

    Google Scholar 

  12. Ghodasara V, Naser DS, Waldekar S, Saha G (2015) Speech/music classification using block based MFCC features. Music Information Retrieval Evaluation eXchange (MIREX)

  13. Ghodasara V, Waldekar S, Paul D, Saha G (2016) Acoustic scene classification using block-based MFCC features. In: IEEE AASP Challenge on detection and classification of acoustic scenes and events (DCASE 2016), Budapest, Hungary, Tech. Rep

  14. Giannoulis D, Benetos E, Stowell D, Rossignol M, Lagrange M, Plumbley MD (2013) Detection and classification of acoustic scenes and events: an IEEE AASP challenge. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 1–4

  15. Gowdy JN, Tufekci Z (2000) Mel-scaled discrete wavelet coefficients for speech recognition. In: Proceedings of the 2000 IEEE international conference on Acoustics, speech, and signal processing. ICASSP’00, vol 3, pp 1351–1354

  16. Kim K, Youn DH, Lee C (2000) Evaluation of wavelet filters for speech recognition. In: 2000 IEEE international conference on Systems, man, and cybernetics, vol 4, pp 2891–2894

  17. Lagrange M, Lafay G, Défréville B, Aucouturier JJ (2015) The bag-of-frames approach: A not so sufficient model for urban soundscapes. J Acoustic Soc Amer 138(5):EL487–EL492

    Article  Google Scholar 

  18. Li Y, Zhang X, Jin H, Li X, Wang Q, He Q, Huang Q (2018) Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection. Multimed Tool Appl 77(1):897–916

    Article  Google Scholar 

  19. Lyon RF (2010) Machine hearing: an emerging field. IEEE Signal Proc Mag 27 (5):131–139

    Article  Google Scholar 

  20. Ma J, Wang R, Ji W, Zheng H, Zhu E, Yin J (2019) Relational recurrent neural networks for polyphonic sound event detection. Multimedia Tools and Applications 1–19

  21. Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11(7):674–693

    Article  Google Scholar 

  22. Mesaros A, Heittola T, Eronen A, Virtanen T (2010) Acoustic event detection in real life recordings. In: IEEE 2010 18th European on Signal Processing Conference, pp 1267–1271

  23. Mesaros A, Heittola T, Virtanen T (2016) TUT database for acoustic scene classification and sound event detection. In: IEEE 2016 24th European on Signal Processing Conference (EUSIPCO), pp 1128–1132

  24. Mesaros A, Heittola T, Benetos E, Foster P, Lagrange M, Virtanen T, Plumbley M (2017) Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech and Language Processing

  25. Mesaros A, Heittola T, Diment A, Elizalde B, Shah A, Vincent E, Raj B, Virtanen T (2017) DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events

  26. Mesaros A, Heittola T, Virtanen T (2018) Acoustic scene classification: an overview of DCASE 2017 challenge entries. In: 16th International Workshop on Acoustic Signal Enhancement (IWAENC)

  27. Mun S, Park S, Han D, Ko H (2017) Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane. Tech. rep., Tech. Rep., DCASE2017 Challenge

  28. Phan H, Chén OY, Koch P, Pham L, McLoughlin I, Mertins A, De Vos M (2018) Beyond equal-length snippets: How long is sufficient to recognize an audio scene?. arXiv:181101095

  29. Rabaoui A, Davy M, Rossignol S, Ellouze N (2008) Using one-class SVMs and wavelets for audio surveillance. IEEE Trans Inform Forensics Secur 3(4):763–775

    Article  Google Scholar 

  30. Rakotomamonjy A, Gasso G (2015) Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP) 23(1):142–153

    Google Scholar 

  31. Sahidullah M, Saha G (2012) Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Comm 54(4):543–565

    Article  Google Scholar 

  32. Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley MD (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimed 17 (10):1733–1746

    Article  Google Scholar 

  33. Tang G, Liang R, Xie Y, Bao Y, Wang S (2018) Improved convolutional neural networks for acoustic event classification. Multimedia Tools and Applications 1–16

  34. Tufekci Z, Gowdy J (2000) Feature extraction using discrete wavelet transform for speech recognition. In: 2000 Proceedings of the IEEE on Southeastcon, pp 116–123

  35. Tzanetakis G, Essl G, Cook P (2001) Audio analysis using the discrete wavelet transform. In: Proc. Conf. in Acoustics and Music Theory Applications, vol 66

  36. Waldekar S, Saha G (2018) Classification of audio scenes with novel features in a fused system framework. Digital Signal Processing

  37. Waldekar S, Saha G (2018) Wavelet transform based mel-scaled features for acoustic scene classification. Proc Interspeech 2018:3323–3327

    Article  Google Scholar 

  38. Weston J, Watkins C et al (1999) Support vector machines for multi-class pattern recognition. In: Esann, vol 99, pp 219–224

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shefali Waldekar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Waldekar, S., Saha, G. Analysis and classification of acoustic scenes with wavelet transform-based mel-scaled features. Multimed Tools Appl 79, 7911–7926 (2020). https://doi.org/10.1007/s11042-019-08279-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08279-5

Keywords