Abstract
The amount of digital music available on the internet has grown significantly with the rapid development of digital multimedia technology. Managing these massive music resources is a thorny problem that powerful music media platforms need to face where music genre classification plays an important role, and a good music genre classifier is indispensable for the research and application of music resources in the related aspects, such as efficient organization, retrieval, recommendation, etc. Due to convolutional networks’ powerful feature extraction capability, more and more researchers are devoting their efforts to music genre classification models based on convolutional neural networks (CNNs). However, many models do not combine the musical signal features for effective design of the convolutional structure, which cause a simpler convolutional network part of the model and weaker local feature extraction ability. To solve the above problem, our group proposes a model using a 1D res-gated CNN to extract local information of audio sequences rather than the traditional CNN architecture. Meanwhile, to aggregate the global information of audio feature sequences, our group applies the Transformer to the music genre classification model and modifies the decoder structure of the Transformer according to the task. The experiments utilize the benchmark datasets, including GTZAN and Extended Ballroom. Our group conducted contrastive experiments to verify our model, and experimental results demonstrated that our model outperforms most of the previous approaches and can improve the performance of music genre classification.
Similar content being viewed by others
References
Abdoli S, Cardinal P, Koerich A L (2019) End-to-end environmental sound classification using a 1d convolutional neural network. Expert Syst Appl 136:252–263
Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128
Ba J, Mnih V, Kavukcuoglu K (2015) Multiple object recognition with visual attention. In: ICLR (Poster)
Cano P, Gómez E, Gouyon F, Herrera P, Koppenberger M, Ong B, Serra X, Streich S, Wack N (2006) Ismir 2004 audio description contest. Music Technology Group of the Universitat Pompeu Fabra, Tech. Rep
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Chen C-F R, Fan Q, Panda R (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision , pp 357–366
Chen X, Wu Y, Wang Z, Liu S, Li J (2021) Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In: ICASSP 2021-2021 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5904–5908
Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2392–2396
Ciresan D C, Meier U, Masci J, Gambardella L M, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. In: Twenty-second international joint conference on artificial intelligence
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Dai J, Liang S, Xue W, Ni C, Liu W (2016) Long short-term memory recurrent neural network based segment features for music genre classification. In: 2016 10th International symposium on chinese spoken language processing (ISCSLP). IEEE, pp 1–5
Dai Y, Wu Y, Zhou F, Barnard K (2021) Attentional local contrast networks for infrared small target detection. IEEE Trans Geosci Remote Sens 59 (11):9813–9824
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6964–6968
Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (bcrsn): an efficient model for music emotion recognition. IEEE Trans Multimed 21(12):3150–3163
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Downie J S (2003) Music information retrieval. Ann Rev Inform Sci Technol 37(1):295–340
Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: International conference on machine learning. PMLR, pp 933–941
Freitag M, Amiriparian S, Pugachevskiy S, Cummins N, Schuller B (2017) audeep: unsupervised learning of representations from audio with deep recurrent neural networks. J Mach Learn Res 18(1):6340–6344
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hermann K M, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. Advances in Neural Information Processing Systems, 28
Hong D, Gao L, Yao J, Zhang B, Plaza A, Chanussot J (2020) Graph convolutional networks for hyperspectral image classification. IEEE Trans Geosci Remote Sens 59(7):5966–5978
Kereliuk C, Sturm B L, Larsen J (2015) Deep learning, audio adversaries, and music content analysis. In: 2015 IEEE Workshop on applications of signal processing to audio and acoustics (WASPAA). IEEE, pp 1–5
Kim T, Lee J, Nam J (2019) Comparison and analysis of samplecnn architectures for audio classification. IEEE J Selected Topics Signal Process 13(2):285–297
Koerich K M, Esmailpour M, Abdoli S, Britto A S, Koerich A L (2020) Cross-representation transferability of adversarial attacks: from spectrograms to audio waveforms. In: 2020 International joint conference on neural networks (IJCNN). IEEE, pp 1–7
Kumar D P, Sowmya BJ, Srinivasa KG et al (2016) A comparative study of classifiers for music genre classification based on feature extractors. In: 2016 IEEE Distributed computing, VLSI, electrical circuits and robotics (DISCOVER). IEEE, pp 190–194
Li TL, Chan A B, Chun AH (2010) Automatic musical pattern feature extraction using convolutional neural network. Genre 10(2010):1–1
Ling W, Dyer C, Black A W, Trancoso I (2015) Two/too simple adaptations of word2vec for syntax problems. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1299–1304
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Marchand U, Peeters G (2016) The extended ballroom dataset
Malhotra P, Vig L, Shroff G, Agarwal P et al (2015) Long short term memory networks for anomaly detection in time series. In: Proceedings, vol 89, pp 89–94
Meng F, Lu Z, Wang M, Li H, Jiang W, Liu Q (2015) Encoding source language with convolutional neural network for machine translation. arXiv:1503.01838
Mnih V, Heess N, Graves A et al (2014) Recurrent models of visual attention. Advances in Neural Information Processing Systems, 27
Nanni L, Costa YM, Lucio DR, Silla CN Jr, Brahnam S (2017) Combining visual and acoustic features for audio classification tasks. Pattern Recogn Lett 88:49–56
Ngai H, Park Y, Chen J, Parsapoor M (2021) Transformer-based models for question answering on covid19. arXiv:2101.11432
Pons J, Slizovskaia O, Gong R, Gómez E, Serra X (2017) Timbre analysis of music audio signals with convolutional neural networks. In: 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, pp 2744–2748
Rush A M, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing , pp 379–389
Shi Y, Wang Y, Wu C, Yeh C-F, Chan J, Zhang F, Le D, Seltzer M (2021) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6783–6787
Sigtia S, Dixon S (2014) Improved music feature learning with deep neural networks. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6959–6963
Silla Jr C N, Kaestner Celso AA, Koerich A L (2007) Automatic music genre classification using ensemble of classifiers. In: 2007 IEEE International conference on systems, man and cybernetics . IEEE, pp 1687–1692
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems, 30
Wang F, Tax DMJ (2016) Survey on the attention based rnn model and its applications in computer vision. arXiv:1601.06823
Wang Z, Muknahallipatna S, Fan M, Okray A, Lan C (2019) Music classification using an improved crnn with multi-directional spatial dependencies in both time and frequency dimensions. In: 2019 International joint conference on neural networks (IJCNN). IEEE, pp 1–8
Xu W, Carpuat M (2021) Editor: an edit-based transformer with repositioning for neural machine translation with soft lexical constraints. Trans Assoc Comput Linguist 9:311–328
Xu C, Maddage N C, Shao X, Cao F, Tian Q (2003) Musical genre classification using support vector machines. In: 2003 IEEE International conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03)., vol 5. IEEE, pp V–429
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Yang H, Zhang W-Q (2019) Music genre classification using duplicated convolutional layers in neural networks.. In: INTERSPEECH, pp 3382–3386
Yang R, Feng L, Wang H, Yao J, Luo S (2020) Parallel recurrent convolutional neural networks-based music genre classification method for mobile devices. IEEE Access 8:19629–19637
Yang C-H H, Qi J, Chen S Y-C, Chen P-Y, Siniscalchi S M, Ma X, Lee C-H (2021) Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6523–6527
Zhang P, Zheng X, Zhang W, Li S, Qian S, He W, Zhang S, Wang Z (2015) A deep neural network for modeling music. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 379–386
Zhang W, Lei W, Xu X, Xing X (2016) Improved music genre classification with convolutional neural networks. In: Interspeech, pp 3304–3308
Zhang T, Gong X, Chen CLP (2021) Bmt-net: broad multitask transformer network for sentiment analysis. IEEE Transactions on Cybernetics
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
No potential conflict of interest was reported by the authors.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xie, C., Song, H., Zhu, H. et al. Music genre classification based on res-gated CNN and attention mechanism. Multimed Tools Appl 83, 13527–13542 (2024). https://doi.org/10.1007/s11042-023-15277-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15277-1