Music genre classification based on res-gated CNN and attention mechanism

Xie, Changjiang; Song, Huazhu; Zhu, Hao; Mi, Kaituo; Li, Zhouhan; Zhang, Yi; Cheng, Jiawen; Zhou, Honglin; Li, Renjie; Cai, Haofeng

doi:10.1007/s11042-023-15277-1

Music genre classification based on res-gated CNN and attention mechanism

Published: 06 July 2023

Volume 83, pages 13527–13542, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Changjiang Xie¹,
Huazhu Song ORCID: orcid.org/0000-0003-3174-6203^1,2,
Hao Zhu¹,
Kaituo Mi²,
Zhouhan Li¹,
Yi Zhang¹,
Jiawen Cheng¹,
Honglin Zhou¹,
Renjie Li¹ &
…
Haofeng Cai¹

344 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The amount of digital music available on the internet has grown significantly with the rapid development of digital multimedia technology. Managing these massive music resources is a thorny problem that powerful music media platforms need to face where music genre classification plays an important role, and a good music genre classifier is indispensable for the research and application of music resources in the related aspects, such as efficient organization, retrieval, recommendation, etc. Due to convolutional networks’ powerful feature extraction capability, more and more researchers are devoting their efforts to music genre classification models based on convolutional neural networks (CNNs). However, many models do not combine the musical signal features for effective design of the convolutional structure, which cause a simpler convolutional network part of the model and weaker local feature extraction ability. To solve the above problem, our group proposes a model using a 1D res-gated CNN to extract local information of audio sequences rather than the traditional CNN architecture. Meanwhile, to aggregate the global information of audio feature sequences, our group applies the Transformer to the music genre classification model and modifies the decoder structure of the Transformer according to the task. The experiments utilize the benchmark datasets, including GTZAN and Extended Ballroom. Our group conducted contrastive experiments to verify our model, and experimental results demonstrated that our model outperforms most of the previous approaches and can improve the performance of music genre classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

References

Abdoli S, Cardinal P, Koerich A L (2019) End-to-end environmental sound classification using a 1d convolutional neural network. Expert Syst Appl 136:252–263
Article Google Scholar
Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128
Article MathSciNet Google Scholar
Ba J, Mnih V, Kavukcuoglu K (2015) Multiple object recognition with visual attention. In: ICLR (Poster)
Cano P, Gómez E, Gouyon F, Herrera P, Koppenberger M, Ong B, Serra X, Streich S, Wack N (2006) Ismir 2004 audio description contest. Music Technology Group of the Universitat Pompeu Fabra, Tech. Rep
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Chen C-F R, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision , pp 357–366
Chen X, Wu Y, Wang Z, Liu S, Li J (2021) Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In: ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5904–5908
Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2392–2396
Ciresan D C, Meier U, Masci J, Gambardella L M, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. In: Twenty-second international joint conference on artificial intelligence
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Article Google Scholar
Dai J, Liang S, Xue W, Ni C, Liu W (2016) Long short-term memory recurrent neural network based segment features for music genre classification. In: 2016 10th International symposium on chinese spoken language processing (ISCSLP). IEEE, pp 1–5
Dai Y, Wu Y, Zhou F, Barnard K (2021) Attentional local contrast networks for infrared small target detection. IEEE Trans Geosci Remote Sens 59 (11):9813–9824
Article Google Scholar
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6964–6968
Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (bcrsn): an efficient model for music emotion recognition. IEEE Trans Multimed 21(12):3150–3163
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Downie J S (2003) Music information retrieval. Ann Rev Inform Sci Technol 37(1):295–340
Article Google Scholar
Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: International conference on machine learning. PMLR, pp 933–941
Freitag M, Amiriparian S, Pugachevskiy S, Cummins N, Schuller B (2017) audeep: unsupervised learning of representations from audio with deep recurrent neural networks. J Mach Learn Res 18(1):6340–6344
MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hermann K M, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. Advances in Neural Information Processing Systems, 28
Hong D, Gao L, Yao J, Zhang B, Plaza A, Chanussot J (2020) Graph convolutional networks for hyperspectral image classification. IEEE Trans Geosci Remote Sens 59(7):5966–5978
Article Google Scholar
Kereliuk C, Sturm B L, Larsen J (2015) Deep learning, audio adversaries, and music content analysis. In: 2015 IEEE Workshop on applications of signal processing to audio and acoustics (WASPAA). IEEE, pp 1–5
Kim T, Lee J, Nam J (2019) Comparison and analysis of samplecnn architectures for audio classification. IEEE J Selected Topics Signal Process 13(2):285–297
Article Google Scholar
Koerich K M, Esmailpour M, Abdoli S, Britto A S, Koerich A L (2020) Cross-representation transferability of adversarial attacks: from spectrograms to audio waveforms. In: 2020 International joint conference on neural networks (IJCNN). IEEE, pp 1–7
Kumar D P, Sowmya BJ, Srinivasa KG et al (2016) A comparative study of classifiers for music genre classification based on feature extractors. In: 2016 IEEE Distributed computing, VLSI, electrical circuits and robotics (DISCOVER). IEEE, pp 190–194
Li TL, Chan A B, Chun AH (2010) Automatic musical pattern feature extraction using convolutional neural network. Genre 10(2010):1–1
Google Scholar
Ling W, Dyer C, Black A W, Trancoso I (2015) Two/too simple adaptations of word2vec for syntax problems. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1299–1304
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Marchand U, Peeters G (2016) The extended ballroom dataset
Malhotra P, Vig L, Shroff G, Agarwal P et al (2015) Long short term memory networks for anomaly detection in time series. In: Proceedings, vol 89, pp 89–94
Meng F, Lu Z, Wang M, Li H, Jiang W, Liu Q (2015) Encoding source language with convolutional neural network for machine translation. arXiv:1503.01838
Mnih V, Heess N, Graves A et al (2014) Recurrent models of visual attention. Advances in Neural Information Processing Systems, 27
Nanni L, Costa YM, Lucio DR, Silla CN Jr, Brahnam S (2017) Combining visual and acoustic features for audio classification tasks. Pattern Recogn Lett 88:49–56
Article Google Scholar
Ngai H, Park Y, Chen J, Parsapoor M (2021) Transformer-based models for question answering on covid19. arXiv:2101.11432
Pons J, Slizovskaia O, Gong R, Gómez E, Serra X (2017) Timbre analysis of music audio signals with convolutional neural networks. In: 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, pp 2744–2748
Rush A M, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing , pp 379–389
Shi Y, Wang Y, Wu C, Yeh C-F, Chan J, Zhang F, Le D, Seltzer M (2021) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6783–6787
Sigtia S, Dixon S (2014) Improved music feature learning with deep neural networks. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6959–6963
Silla Jr C N, Kaestner Celso AA, Koerich A L (2007) Automatic music genre classification using ensemble of classifiers. In: 2007 IEEE International conference on systems, man and cybernetics . IEEE, pp 1687–1692
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems, 30
Wang F, Tax DMJ (2016) Survey on the attention based rnn model and its applications in computer vision. arXiv:1601.06823
Wang Z, Muknahallipatna S, Fan M, Okray A, Lan C (2019) Music classification using an improved crnn with multi-directional spatial dependencies in both time and frequency dimensions. In: 2019 International joint conference on neural networks (IJCNN). IEEE, pp 1–8
Xu W, Carpuat M (2021) Editor: an edit-based transformer with repositioning for neural machine translation with soft lexical constraints. Trans Assoc Comput Linguist 9:311–328
Article Google Scholar
Xu C, Maddage N C, Shao X, Cao F, Tian Q (2003) Musical genre classification using support vector machines. In: 2003 IEEE International conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03)., vol 5. IEEE, pp V–429
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Yang H, Zhang W-Q (2019) Music genre classification using duplicated convolutional layers in neural networks.. In: INTERSPEECH, pp 3382–3386
Yang R, Feng L, Wang H, Yao J, Luo S (2020) Parallel recurrent convolutional neural networks-based music genre classification method for mobile devices. IEEE Access 8:19629–19637
Article Google Scholar
Yang C-H H, Qi J, Chen S Y-C, Chen P-Y, Siniscalchi S M, Ma X, Lee C-H (2021) Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6523–6527
Zhang P, Zheng X, Zhang W, Li S, Qian S, He W, Zhang S, Wang Z (2015) A deep neural network for modeling music. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 379–386
Zhang W, Lei W, Xu X, Xing X (2016) Improved music genre classification with convolutional neural networks. In: Interspeech, pp 3304–3308
Zhang T, Gong X, Chen CLP (2021) Bmt-net: broad multitask transformer network for sentiment analysis. IEEE Transactions on Cybernetics

Download references

Author information

Authors and Affiliations

Wuhan University of Technology, WuHan, China
Changjiang Xie, Huazhu Song, Hao Zhu, Zhouhan Li, Yi Zhang, Jiawen Cheng, Honglin Zhou, Renjie Li & Haofeng Cai
Anngeen Technology Co., ltd., WuHan, China
Huazhu Song & Kaituo Mi

Authors

Changjiang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Huazhu Song
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Kaituo Mi
View author publications
You can also search for this author in PubMed Google Scholar
Zhouhan Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiawen Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Honglin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Renjie Li
View author publications
You can also search for this author in PubMed Google Scholar
Haofeng Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huazhu Song.

Ethics declarations

No potential conflict of interest was reported by the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xie, C., Song, H., Zhu, H. et al. Music genre classification based on res-gated CNN and attention mechanism. Multimed Tools Appl 83, 13527–13542 (2024). https://doi.org/10.1007/s11042-023-15277-1

Download citation

Received: 21 May 2022
Revised: 20 February 2023
Accepted: 06 April 2023
Published: 06 July 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11042-023-15277-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Music genre classification based on res-gated CNN and attention mechanism

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A survey of the recent architectures of deep convolutional neural networks

A review of convolutional neural networks in computer vision

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Music genre classification based on res-gated CNN and attention mechanism

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A survey of the recent architectures of deep convolutional neural networks

A review of convolutional neural networks in computer vision

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation