Skip to main content
Log in

Hand-crafted versus learned representations for audio event detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (\(\sim \)30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video. In: Advances in neural information processing systems

  2. Becker S, Ackermann M, Lapuschkin S, Müller K-R, Samek W (2018) Interpreting and explaining deep neural networks for classification of audio signals. CoRR

  3. Çakir E, Parascandolo G, Heittola T, Huttunen H, Virtanen T (2017) Convolutional recurrent neural networks for polyphonic sound event detection. CoRR

  4. Çakir E, Virtanen T (2018) End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input. CoRR

  5. Dai W, Dai C, Qu S, Li J, Das S (2016) Very deep convolutional neural networks for raw waveforms

  6. Dinkel H, Qian Y, Yu K P (2018) A hybrid asr model approach on weakly labeled scene classification

  7. Eutizi C, Benedetto F (2021) On the performance improvements of deep learning methods for audio event detection and classification. In: 2021 44th International Conference on Telecommunications and Signal Processing (TSP), pp 141–145

  8. Fonseca E, Ortego D, McGuinness K, O’Connor N E, Serra X (2020) Unsupervised contrastive learning of sound event representations

  9. Gemmeke J F, Ellis D P W, Freedman D, Jansen A, Lawrence W, Moore R C, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 776–780

  10. Giannakopoulos T, Spyrou E, Perantonis S (2019) Recognition of urban sound events using deep context-aware feature extractors and handcrafted features. Int. Conf. on Artificial Intelligence Applications and Innovations

  11. Hershey S, Chaudhuri S, Ellis D P W, Gemmeke J F, Jansen A, Moore R C, Plakal M, Platt D, Saurous R A, Seybold B, Slaney M, Weiss R J, Wilson K W (2016) CNN architectures for large-scale audio classification. CoRR

  12. Kayser M, Zhong V (2015) Denoising convolutional autoencoders for noisy speech recognition. Technical Report, CS231 Standford Reports

  13. Kong Q, Cao Y, Iqbal T, Wang Y, Wang W, Plumbley M D (2020) Panns: Large-scale pretrained audio neural networks for audio pattern recognition

  14. Kothinti S, Sell G, Watanabe S, Elhilali M (2019) Integrated bottom-up and top-down inference for sound event detection. Technical Report, Department of Electrical and Computer Engineering. Johns Hopkins University, Baltimore

  15. Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

    Google Scholar 

  16. Kwak J-Y, Chung Y-J (2020) Sound event detection using derivative features in deep neural networks. Appl Sci 10:4911. https://doi.org/10.3390/app10144911

    Article  Google Scholar 

  17. Lee J, Kim T, Park J, Nam J (2017) Raw waveform-based audio classification using sample-level cnn architectures. 31st Conf. on Neural Information Processing Systems (NIPS)

  18. Lefèvre S, Vincent N (2011) A two level strategy for audio segmentation. Digit Signal Process 21(2):270–277

    Article  Google Scholar 

  19. Li J, Dai W, Metze F, Qu S, Das S (2017) A comparison of deep learning methods for environmental sound. CoRR

  20. Liu H, Zhang S (2012) Noisy data elimination using mutual k-nearest neighbor for classification mining. J Syst Softw 85(5):1067–1074

    Article  Google Scholar 

  21. Maas A, Le Q, neil T, Vinyals O, Nguyen P, Ng A (2012) Recurrent neural networks for noise reduction in robust asr. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 1

  22. Maria A, Jeyaseelan A S (2021) Development of optimal feature selection and deep learning toward hungry stomach detection using audio signals. J Control Autom Electr Syst 32

  23. Mesaros A, Heittola T, Virtanen T (2016) Tut database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp 1128–1132

  24. Mesaros A, Heittola T, Virtanen T, Plumbley M D (2021) Sound event detection: a tutorial. IEEE Signal Proc Mag 38(5):670–83. https://doi.org/10.1109/msp.2021.3090678

    Article  Google Scholar 

  25. Muhammad G, Melhem M (2014) Pathological voice detection and binary classification using mpeg-7 audio features. Biomed Signal Process Control 11:1–9

    Article  Google Scholar 

  26. Ntalampiras S, Potamitis I, Fakotakis N (2009) On acoustic surveillance of hazardous situations. In: IEEE international conference on acoustics, speech and signal processing, pp 165–168

  27. Ntalampiras S, Potamitis I, Fakotakis N (2009) A portable system for robust acoustic detection of atypical situations. In: 17th European signal processing conference, pp 1121–1125

  28. Piczak K J (2015) Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp 1–6

  29. Piczak K J (2016) Recognizing bird species in audio recordings using deep convolutional neural networks. In: CLEF

  30. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

    Google Scholar 

  31. Saeed A, Grangier D, Zeghidour N (2020) Contrastive learning of general-purpose audio representations

  32. Shah A, Kumar A, Hauptmann A G, Raj B (2018) A closer look at weak label learning for audio events. arXiv:1804.09288

  33. Stevens S S, Volkmann J, Newman E B (1937) A scale for the measurement of the psychological magnitude pitch. J Acoust Soc Am 8:185–190

    Article  Google Scholar 

  34. Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley M D (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimed 17(10):1733–1746

    Article  Google Scholar 

  35. Sun Y, Ghaffarzadegan S (2020) An ontology-aware framework for audio event classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020

  36. Turpault N, Serizel R (2020) Training sound event detection on a heterogeneous dataset

  37. Vasilakis M, Stylianou Y (2009) Spectral jitter modeling and estimation. Biomed Signal Process Control 4(3):183–193

    Article  Google Scholar 

  38. Wang Z, Casebeer J, Clemmitt A, Tzinis E, Smaragdis P (2021) Sound event detection with adaptive frequency selection. arXiv:2105.07596

  39. Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson P J B, Plumbley M D (2016) Fully deep neural networks incorporating unsupervised feature learning for audio tagging. CoRR, arXiv:1607.03681

  40. Zang Z, Yang M, Liu L (2019) An improved system for dcase 2019 challenge task4. Technical Report, University of Electronic Science and Technology of China School of Information and Communication Engineering

  41. Zhuang X, Zhou X, Huang T S, Hasegawa-Johnson M (2008) Feature analysis and selection for acoustic event detection. In: 2008 IEEE international conference on acoustics, speech and signal processing, pp 17–20

Download references

Acknowledgements

We would like to thank Türk Telekom Research Center for providing hardware components for the experiments. Dr. Kalkan is supported by the BAGEP Award of the Science Academy, Turkey.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Selver Ezgi Küçükbay.

Ethics declarations

Conflict of Interests

Selver Ezgi Küçükbay, Adnan Yazıcı and Sinan Kalkan declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Küçükbay, S.E., Yazıcı, A. & Kalkan, S. Hand-crafted versus learned representations for audio event detection. Multimed Tools Appl 81, 30911–30930 (2022). https://doi.org/10.1007/s11042-022-12873-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12873-5

Keywords

Navigation