Skip to main content
Log in

Music genre classification based on res-gated CNN and attention mechanism

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The amount of digital music available on the internet has grown significantly with the rapid development of digital multimedia technology. Managing these massive music resources is a thorny problem that powerful music media platforms need to face where music genre classification plays an important role, and a good music genre classifier is indispensable for the research and application of music resources in the related aspects, such as efficient organization, retrieval, recommendation, etc. Due to convolutional networks’ powerful feature extraction capability, more and more researchers are devoting their efforts to music genre classification models based on convolutional neural networks (CNNs). However, many models do not combine the musical signal features for effective design of the convolutional structure, which cause a simpler convolutional network part of the model and weaker local feature extraction ability. To solve the above problem, our group proposes a model using a 1D res-gated CNN to extract local information of audio sequences rather than the traditional CNN architecture. Meanwhile, to aggregate the global information of audio feature sequences, our group applies the Transformer to the music genre classification model and modifies the decoder structure of the Transformer according to the task. The experiments utilize the benchmark datasets, including GTZAN and Extended Ballroom. Our group conducted contrastive experiments to verify our model, and experimental results demonstrated that our model outperforms most of the previous approaches and can improve the performance of music genre classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abdoli S, Cardinal P, Koerich A L (2019) End-to-end environmental sound classification using a 1d convolutional neural network. Expert Syst Appl 136:252–263

    Article  Google Scholar 

  2. Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128

    Article  MathSciNet  Google Scholar 

  3. Ba J, Mnih V, Kavukcuoglu K (2015) Multiple object recognition with visual attention. In: ICLR (Poster)

  4. Cano P, Gómez E, Gouyon F, Herrera P, Koppenberger M, Ong B, Serra X, Streich S, Wack N (2006) Ismir 2004 audio description contest. Music Technology Group of the Universitat Pompeu Fabra, Tech. Rep

  5. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

  6. Chen C-F R, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision , pp 357–366

  7. Chen X, Wu Y, Wang Z, Liu S, Li J (2021) Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In: ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5904–5908

  8. Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111

  9. Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2392–2396

  10. Ciresan D C, Meier U, Masci J, Gambardella L M, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. In: Twenty-second international joint conference on artificial intelligence

  11. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27

    Article  Google Scholar 

  12. Dai J, Liang S, Xue W, Ni C, Liu W (2016) Long short-term memory recurrent neural network based segment features for music genre classification. In: 2016 10th International symposium on chinese spoken language processing (ISCSLP). IEEE, pp 1–5

  13. Dai Y, Wu Y, Zhou F, Barnard K (2021) Attentional local contrast networks for infrared small target detection. IEEE Trans Geosci Remote Sens 59 (11):9813–9824

    Article  Google Scholar 

  14. Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6964–6968

  15. Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (bcrsn): an efficient model for music emotion recognition. IEEE Trans Multimed 21(12):3150–3163

    Article  Google Scholar 

  16. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929

  17. Downie J S (2003) Music information retrieval. Ann Rev Inform Sci Technol 37(1):295–340

    Article  Google Scholar 

  18. Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: International conference on machine learning. PMLR, pp 933–941

  19. Freitag M, Amiriparian S, Pugachevskiy S, Cummins N, Schuller B (2017) audeep: unsupervised learning of representations from audio with deep recurrent neural networks. J Mach Learn Res 18(1):6340–6344

    MathSciNet  Google Scholar 

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  21. Hermann K M, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. Advances in Neural Information Processing Systems, 28

  22. Hong D, Gao L, Yao J, Zhang B, Plaza A, Chanussot J (2020) Graph convolutional networks for hyperspectral image classification. IEEE Trans Geosci Remote Sens 59(7):5966–5978

    Article  Google Scholar 

  23. Kereliuk C, Sturm B L, Larsen J (2015) Deep learning, audio adversaries, and music content analysis. In: 2015 IEEE Workshop on applications of signal processing to audio and acoustics (WASPAA). IEEE, pp 1–5

  24. Kim T, Lee J, Nam J (2019) Comparison and analysis of samplecnn architectures for audio classification. IEEE J Selected Topics Signal Process 13(2):285–297

    Article  Google Scholar 

  25. Koerich K M, Esmailpour M, Abdoli S, Britto A S, Koerich A L (2020) Cross-representation transferability of adversarial attacks: from spectrograms to audio waveforms. In: 2020 International joint conference on neural networks (IJCNN). IEEE, pp 1–7

  26. Kumar D P, Sowmya BJ, Srinivasa KG et al (2016) A comparative study of classifiers for music genre classification based on feature extractors. In: 2016 IEEE Distributed computing, VLSI, electrical circuits and robotics (DISCOVER). IEEE, pp 190–194

  27. Li TL, Chan A B, Chun AH (2010) Automatic musical pattern feature extraction using convolutional neural network. Genre 10(2010):1–1

    Google Scholar 

  28. Ling W, Dyer C, Black A W, Trancoso I (2015) Two/too simple adaptations of word2vec for syntax problems. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1299–1304

  29. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  30. Marchand U, Peeters G (2016) The extended ballroom dataset

  31. Malhotra P, Vig L, Shroff G, Agarwal P et al (2015) Long short term memory networks for anomaly detection in time series. In: Proceedings, vol 89, pp 89–94

  32. Meng F, Lu Z, Wang M, Li H, Jiang W, Liu Q (2015) Encoding source language with convolutional neural network for machine translation. arXiv:1503.01838

  33. Mnih V, Heess N, Graves A et al (2014) Recurrent models of visual attention. Advances in Neural Information Processing Systems, 27

  34. Nanni L, Costa YM, Lucio DR, Silla CN Jr, Brahnam S (2017) Combining visual and acoustic features for audio classification tasks. Pattern Recogn Lett 88:49–56

    Article  Google Scholar 

  35. Ngai H, Park Y, Chen J, Parsapoor M (2021) Transformer-based models for question answering on covid19. arXiv:2101.11432

  36. Pons J, Slizovskaia O, Gong R, Gómez E, Serra X (2017) Timbre analysis of music audio signals with convolutional neural networks. In: 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, pp 2744–2748

  37. Rush A M, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing , pp 379–389

  38. Shi Y, Wang Y, Wu C, Yeh C-F, Chan J, Zhang F, Le D, Seltzer M (2021) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6783–6787

  39. Sigtia S, Dixon S (2014) Improved music feature learning with deep neural networks. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6959–6963

  40. Silla Jr C N, Kaestner Celso AA, Koerich A L (2007) Automatic music genre classification using ensemble of classifiers. In: 2007 IEEE International conference on systems, man and cybernetics . IEEE, pp 1687–1692

  41. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302

    Article  Google Scholar 

  42. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems, 30

  43. Wang F, Tax DMJ (2016) Survey on the attention based rnn model and its applications in computer vision. arXiv:1601.06823

  44. Wang Z, Muknahallipatna S, Fan M, Okray A, Lan C (2019) Music classification using an improved crnn with multi-directional spatial dependencies in both time and frequency dimensions. In: 2019 International joint conference on neural networks (IJCNN). IEEE, pp 1–8

  45. Xu W, Carpuat M (2021) Editor: an edit-based transformer with repositioning for neural machine translation with soft lexical constraints. Trans Assoc Comput Linguist 9:311–328

    Article  Google Scholar 

  46. Xu C, Maddage N C, Shao X, Cao F, Tian Q (2003) Musical genre classification using support vector machines. In: 2003 IEEE International conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03)., vol 5. IEEE, pp V–429

  47. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057

  48. Yang H, Zhang W-Q (2019) Music genre classification using duplicated convolutional layers in neural networks.. In: INTERSPEECH, pp 3382–3386

  49. Yang R, Feng L, Wang H, Yao J, Luo S (2020) Parallel recurrent convolutional neural networks-based music genre classification method for mobile devices. IEEE Access 8:19629–19637

    Article  Google Scholar 

  50. Yang C-H H, Qi J, Chen S Y-C, Chen P-Y, Siniscalchi S M, Ma X, Lee C-H (2021) Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6523–6527

  51. Zhang P, Zheng X, Zhang W, Li S, Qian S, He W, Zhang S, Wang Z (2015) A deep neural network for modeling music. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 379–386

  52. Zhang W, Lei W, Xu X, Xing X (2016) Improved music genre classification with convolutional neural networks. In: Interspeech, pp 3304–3308

  53. Zhang T, Gong X, Chen CLP (2021) Bmt-net: broad multitask transformer network for sentiment analysis. IEEE Transactions on Cybernetics

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huazhu Song.

Ethics declarations

No potential conflict of interest was reported by the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, C., Song, H., Zhu, H. et al. Music genre classification based on res-gated CNN and attention mechanism. Multimed Tools Appl 83, 13527–13542 (2024). https://doi.org/10.1007/s11042-023-15277-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15277-1

Keywords

Navigation