Acoustic scene classification with multi-temporal complex modulation spectrogram features and a convolutional LSTM network

Mirzaei, Sayeh; Jazani, Iman Khani

doi:10.1007/s11042-022-14192-1

Acoustic scene classification with multi-temporal complex modulation spectrogram features and a convolutional LSTM network

Published: 11 November 2022

Volume 82, pages 16395–16408, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

232 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Acoustic scene classification (ASC) is a mapping from an environmental sound recording to predefined classes representing the auditory scene of the recording. This paper proposes an ASC solution based on the combination of convolutional neural networks, long short term memory cells, and multi-temporal input encoding. The major novelty of the work is applying complex modulation spectrogram for feature extraction. We evaluate the complex modulation spectrogram as discriminant features, resulting in a 4.7% improvement in comparison with the commonly used Mel spectrogram. These features are computed for individual temporal segments of the audio recording to acquire a representation containing both spectral and temporal structure. Also, we derive a de-noising method which has not been used for ASC before but was beneficial in other speech processing tasks. This method leads to 1.5% improvement in prediction accuracy in comparison with a model without de-noising. The proposed model outperforms the state of the art methods by 7.5% in terms of the prediction accuracy for evaluation data in ASC on the DCASE 2017 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AtResNet: Residual Atrous CNN with Multi-scale Feature Representation for Low Complexity Acoustic Scene Classification

Article 24 July 2022

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

Article Open access 02 December 2020

DCNN-LSTM Based Audio Classification Combining Multiple Feature Engineering and Data Augmentation Techniques

Data availability

The datasets analyzed during the current study are available in the following repository: https://dcase.community/challenge2017/download

References

Ahmadi S, Ahadi SM, Cranen B, Boves L (2014) Sparse coding of the modulation spectrum for noise-robust automatic speech recognition. EURASIP J Audio Speech Music Process vol 2014, no 1, p 36
Baby D, Virtanen T, Gemmeke JF, Barker T, Hamme HV (2014) Exemplar-based noise robust automatic speech recognition using modulation spectrogram features. In 2014 IEEE Spoken Language Technology Workshop (SLT), pp 519–524
Barker T, Virtanen T (2013) Non-negative tensor factorisation of modulation spectrograms for monaural sound source separation. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013), 25–29 August, Lyon, France, pp 827–831
Barker T, Virtanen T (2016) Blind separation of audio mixtures through nonnegative tensor factorization of modulation spectrograms. IEEEACM Trans Audio Speech Lang Process 24(12):2377–2389
Article Google Scholar
Chollet F (2016) Xception: deep learning with depthwise separable convolutions. 2017 IEEE Conf Comput Vis Pattern Recognit CVPR, pp 1800–1807
Chung Y-A, Wu C-C, Shen C-H, Lee H-Y, Lee L-S (2016) Audio Word2Vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv preprint arXiv:1603.00982. https://doi.org/10.48550/arXiv.1603.00982
Deng J, Dong W, Socher R, Li L, Li K, Li F-F (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255
Giannoulis D, Stowell D, Benetos E, Rossignol M, Lagrange M, Plumbley MD (2013) A database and challenge for acoustic scene classification and event detection. In 21st European Signal Processing Conference (EUSIPCO 2013), pp 1–5
Greenberg S, Kingsbury BED (1997) The modulation spectrogram: in pursuit of an invariant representation of speech. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, pp 1647–1650
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708
Kingsbury BED, Morgan N, Greenberg S (1998) Robust speech recognition using the modulation spectrogram. Speech Commun 25(1):117–132
Article Google Scholar
Kırbız S, Günsel B (2014) A multiresolution non-negative tensor factorization approach for single channel sound source separation. Signal Process 105:56–69
Article Google Scholar
Lu L, Yang Y, Jiang Y, Ai H, Tu W (2018) Shallow convolutional neural networks for acoustic scene classification. Wuhan Univ J Nat Sci 23(2):178–184
Article Google Scholar
Masaya S (2018) Audio signal separation through complex tensor factorization: utilizing modulation frequency and phase information. Signal Process 142:137–148
Article Google Scholar
Mesaros A, Heittola T, Virtanen T (2016) TUT database for acoustic scene classification and sound event detection. In 2016 24th European Signal Processing Conference (EUSIPCO), pp 1128–1132
Mesaros A et al (2017) DCASE 2017 Challenge setup: Tasks, datasets and baseline system
Moritz N, Anemüller J, Kollmeier B (2011) Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5492–5495
Phan H, Koch P, Katzberg F, Maass M, Mazur R, Mertins A (2017) Audio scene classification with deep recurrent neural networks. In Interspeech 2017, pp 3043–3047
Shi X, Chen Z, Wang H, Yeung D-Y, Wong W, Woo W (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In advances in neural information processing systems 28, Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R Eds. Curran Associates, Inc, pp 802–810
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR, vol. abs/1409.1556
Valenti M, Squartini S, Diment A, Parascandolo G, Virtanen T (2017) A convolutional neural network approach for acoustic scene classification. In 2017 International Joint Conference on Neural Networks (IJCNN), pp 1547–1554
Wang D, Zhang L, Xu K, Wang Y (2019) Acoustic scene classification based on dense convolutional networks incorporating multi-channel features. J Phys Conf Ser 1169:012037
Article Google Scholar
Wu S, Falk TH, Chan W-Y (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53(5):768–785
Article Google Scholar
Xie J, Zhu M (2019) Investigation of acoustic and visual features for acoustic scene classification. Expert Syst Appl 126:20–29
Article Google Scholar
Xu J, Lin T, Yu T, Tai T, Chang P (2018) Acoustic scene classification using reduced mobilenet architecture. In 2018 IEEE International Symposium on Multimedia (ISM), pp 267–270
Xu K et al (2018) Mixup-based acoustic scene classification using multi-channel convolutional neural network. In Advances in Multimedia Information Processing – PCM 2018, pp 14–23
Yang Y et al (2019) Kullback–Leibler divergence frequency warping scale for acoustic scene classification using convolutional neural network. In ICASSP 2019–2019 IEEE international conference on acoustics, Speech and Signal Processing (ICASSP), pp 840–844
Zeinali H, Burget L, Cernocky JH (2018) Convolutional neural networks and x-vector embedding for DCASE2018 Acoustic Scene Classification challenge. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp 202–206

Download references

Author information

Authors and Affiliations

School of Engineering Science, College of Engineering, University of Tehran, Tehran, Iran
Sayeh Mirzaei
Faculty of Computer Engineering, Amirkabir University of Technology, Tehran, Iran
Iman Khani Jazani

Authors

Sayeh Mirzaei
View author publications
You can also search for this author in PubMed Google Scholar
Iman Khani Jazani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sayeh Mirzaei.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mirzaei, S., Jazani, I.K. Acoustic scene classification with multi-temporal complex modulation spectrogram features and a convolutional LSTM network. Multimed Tools Appl 82, 16395–16408 (2023). https://doi.org/10.1007/s11042-022-14192-1

Download citation

Received: 13 August 2021
Revised: 14 September 2022
Accepted: 27 October 2022
Published: 11 November 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-022-14192-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Acoustic scene classification with multi-temporal complex modulation spectrogram features and a convolutional LSTM network

Abstract

Access this article

Similar content being viewed by others

AtResNet: Residual Atrous CNN with Multi-scale Feature Representation for Low Complexity Acoustic Scene Classification

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

DCNN-LSTM Based Audio Classification Combining Multiple Feature Engineering and Data Augmentation Techniques

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Acoustic scene classification with multi-temporal complex modulation spectrogram features and a convolutional LSTM network

Abstract

Access this article

Similar content being viewed by others

AtResNet: Residual Atrous CNN with Multi-scale Feature Representation for Low Complexity Acoustic Scene Classification

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

DCNN-LSTM Based Audio Classification Combining Multiple Feature Engineering and Data Augmentation Techniques

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation