Abstract
Extraction of effective audio features from acoustic events definitely influences the performance of Acoustic Event Detection (AED) system, especially in adverse audio situations. In this study, we propose a framework for extracting Deep Audio Feature (DAF) using multi-stream hierarchical Deep Neural Network (DNN). The DAF outputted from the proposed framework fuses the potential complementary information of multiple input feature streams and thus could be more discriminative than those input features for AED. We take two input feature streams and the hierarchical DNNs with two stages as an example for showing the extraction of DAF. The effectiveness of different audio features for AED is evaluated on two audio corpora, i.e. BBC (British Broadcasting Corporation) audio dataset and TV audio dataset with different signal-to-noise ratios. Experimental results show that DAF outperforms other features for AED under several experimental conditions.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-016-4332-z/MediaObjects/11042_2016_4332_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-016-4332-z/MediaObjects/11042_2016_4332_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-016-4332-z/MediaObjects/11042_2016_4332_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-016-4332-z/MediaObjects/11042_2016_4332_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-016-4332-z/MediaObjects/11042_2016_4332_Fig5_HTML.gif)
Similar content being viewed by others
References
Atrey PK, Maddage M, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: Proc. of IEEE ICASSP, pp 813–816. IEEE
British Broadcasting Corporation (BBC), “BBC Sound Effects Library,” http://www.sound-ideas.com/bbc.html, Accessed May 2015
Bugalho M, Portelo J, Trancoso I, Pellegrini T, Abad A (2009) Detecting audio events for semantic video search. In: Proc. of INTERSPEECH, pp 1151–1154. ISCA
Cakir E, Heittola T, Huttunen H, Virtanen T (2015) Polyphonic sound event detection using multi label deep neural networks. In: Proc. of International Joint Conference on Neural Networks, pp 1–7. IEEE
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. In: ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. ACM
Childers DG, Skinner DP, Kemerait RC (1977) The cepstrum: a guide to processing. In: Proceeding of IEEE, 65(10):1428–1443. IEEE
Diment A, Heittola T, Virtanen T (2013) Sound event detection for office live and office synthetic AASP challenge. In: Proc. of IEEE AASP challenge on detection and classification of acoustic scenes and events. IEEE
Fawcett T (2011) An introduction to ROC analysis. In: Pattern Recognition Letters, 27(8):861–874. Elsevier
Gabor D (1946) Theory of communication. In: Institute Electronica, no. 93, pp 429–457
Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. In: Proc. of the 22nd European Signal Processing Conference, pp 506–510. ISCA
Giannoulis D, Stowell D, Benetos E, Rossignol M, Lagrange M, Plumbley MD (2013) A database and challenge for acoustic scene classification and event detection. In: Proc. of EUSIPCO, pp 1–5. ISCA
Grezl F, Karafiat M, Kontar S, Cernocky J (2007) Probabilistic and bottle-neck features for LVCSR of meetings. In: Proc. of IEEE ICASSP, pp 757–760. IEEE
Heittola T, Klapuri A (2008) TUT acoustic event detection system 2007. In: multimodal technologies for perception of humans, vol. 4625 of the series Lecture Notes in Computer Science, pp 364–370. Springer
Heittola T, Mesaros A, Virtanen T, Gabbouj M (2013) Supervised model training for overlapping sound events based on unsupervised source separation. In: Proc. of IEEE ICASSP, Vancouver, Canada, pp 8677–8681. IEEE
Hinton GE, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. In: IEEE Signal Processing Magazine, 29(6):82–97. IEEE
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554, MIT Press
Jin F, Sattar F, Krishnan S (2012) Log-frequency spectrogram for respiratory sound monitoring. In: Proc. of IEEE ICASSP, pp 597–600. IEEE
Lin KZ, Pwint M (2010) Structuring sport video through audio event classification. In: PCM 2010, Part I, LNCS 6297, pp 481–492. Springer
Loren DE, Robert KO (1968) Programming and analysis for digital time series data, United Stated Department of Defense, first edition, Shock and Vibration Information Center
Lu L, Hanjalic A (2009) audio keywords discovery for text-like audio content analysis and retrieval. In: IEEE Trans. on Multimedia 10(1):74–85. IEEE
Ma L, Milner B, Smith D (2006) Acoustic environment classification. In: ACM Trans. On Speech Language Processing, 3(2):1–22. ACM
McLoughlin I, Zhang HM, Xie ZP, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. In: IEEE Trans. on Audio, Speech, and Language Processing, 23(3):540–552. IEEE
Moritz N, Anemüller J, Kollmeier B (2011) Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments. In: Proc. of IEEE ICASSP, pp 5492–5495. IEEE
Niessen ME, Van Kasteren TLM, Merentitis A (2013) Hierarchical modeling using automated sub-clustering for sound event recognition. In: Proc. of IEEE workshop on applications of signal processing to audio and acoustics, pp 1–4. IEEE
Nogueira W, Roma G, Herrera P (2013) Automatic event classification using front end single channel noise reduction, MFCC features and a support vector machine classifier. In: IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE
Okuyucu C, Sert M, Yazlcl A (2013) Audio feature and classifier analysis for efficient recognition of environmental sounds. In: Proc. of IEEE International Symposium on Multimedia, pp 125–132. IEEE
Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. In: IEEE Trans. on Audio Speech & Language Processing, 23(1):20–31. IEEE
Qiu A, Schreiner C, Escabi M (2003) Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. J Neurophysiol 90(1):456–476, American Physiological Society
Schadler MR, Kollmeier B (2012) Normalization of spectro-temporal Gabor filter bank features for improved robust automatic speech recognition systems. In: Proc. of INTERSPEECH, pp 1–4. ISCA
Schädler MR, Meyer BT, Kollmeier B (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J Acoust Soc Am 131(5):4134–4151, Acoustical Society of America
Schröder J, Cauchi B, Schädler MR, Moritz N, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) Acoustic event detection using signal enhancement and spectro-temporal feature extraction. IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE
Schröder J, Goetze S, Anemüller J (2015) Spectro-temporal gabor filterbank features for acoustic event detection. In: IEEE/ACM Trans. on Audio, Speech, and Language Processing, 23(12):2198–2208. IEEE/ACM
Schröder J, Moritz N, Schädler MR, Cauchi B, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) On the use of spectro-temporal features for the IEEE AASP challenge detection and classification of acoustic scenes and events. In: Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 1–4. IEEE
Temko A, Malkin R, Zieger C, Macho D, Nadeu C, Omologo M (2007) Clear evaluation of acoustic event detection and classification systems. Lecture notes in computing science, 4122:311–322. Springer
Temko A, Nadeu C (2009) Acoustic event detection in meeting-room environments. In: Pattern recognition letter, 30(14):1281–1288. Elsevier
Temko A, Nadeu C, Macho D, Malkin R, Zieger C, Omologo M (2009) Acoustic event detection and classification. In: Computers in the human interaction loop, pp 61–73. Springer
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. In: Speech Communication, 12(3):247–251. ISCA
Veselý K, Lukáš B, František (2010) Parallel training of neural networks for speech recognition. In: Proc. of INTERSPEECH, pp 439–446. ISCA
Wang S, Yang X, Zhang Y, Phillips P, Yang J, Yuan T (2015) Identification of green, Oolong and black teas in China via wavelet packet entropy and fuzzy support vector machine. In: Entropy, 17(10):6663–6682. MDPI
Young SJ, Evermann G, Gales MJF, Hain T, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland PC (2006) The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge
Yu D, Seltzer ML (2011) Improved bottleneck features using pretrained deep neural networks. In: Proc. of INTERSPEECH, pp 237–240. ISCA
Zhang Y, Chen S, Wang S, Yang J, Phillips P (2015) Magnetic resonance brain image classification based on weighted-type fractional Fourier transform and nonparallel support vector machine. Int J Imaging Syst Technol 25(4):317–327, Wiley
Zhang X, He Q, Feng X (2015) Acoustic feature extraction by tensor-based sparse representation for sound effects classification. In: Proc. of IEEE ICASSP, pp 166–170. IEEE
Zhang Y, Wu L (2012) Classification of fruits using computer vision and a multiclass support vector machine. In: Sensors, 12(9):12489–12505. MDPI
Acknowledgments
The work was supported by the National Natural Science Foundation of China (61101160, 61271314, 61571192), the Fundamental Research Funds for the Central Universities, South China University of Technology, China (2015ZZ102), Project of the Pearl River Young Talents of Science and Technology in Guangzhou, China (2013J2200070), Science and Technology Planning Project of Guangdong Province (2014A050503022, 2015A010103003) and the Foundation of China Scholarship Council (201208440078).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., Zhang, X., Jin, H. et al. Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection. Multimed Tools Appl 77, 897–916 (2018). https://doi.org/10.1007/s11042-016-4332-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-4332-z