Abstract
Deep convolutional neural networks (DCNNs) has been widely used in music auto-tagging which is a multi-label classification task that predicts tags of audio signals. This paper presents a sample-level DCNN for music auto-tagging. The proposed DCNN highlights two components: strided convolutional layer for extracting local feature and reducing temporal dimension, and residual block from WaveNet for preserving input resolution and extracting more complex features. In order to further improve performance, squeeze-and-excitation (SE) block is introduced to the residual block. Under the evaluation metric of Area Under Receiver Operating Characteristic Curve (AUC-ROC) score, experiment results on MagnaTagATune (MTAT) dataset show that the two proposed models achieve 91.47% and 92.76% respectively. Furthermore, our proposed models have slightly surpass the state-of-the-art model SampleCNN with SE block.
Similar content being viewed by others
References
Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P (2011) The million song dataset in Proceedings of the 12th International Society for Music Information Retrieval Conference. ISMIR 2011, Miami
Choi K, Fazekas G, Sandler M, Cho K, Ieee (2017) Convolutional recurrent neural networks for music classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, pp 2392–2396
Dieleman S, Schrauwen B, Ieee (2014) End-to-end learning for music audio. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, pp 6964–6968
Fu Z, Lu G, Ting KM, Zhang D (2011) A survey of audio-based music classification and annotation. IEEE Trans Multimed 13(2):303–319
He K, Zhang X, Ren S, Sun J, Ieee (2016) Deep residual learning for image recognition in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, pp 770–778
Holschneider M, Kronland-Martinet R, Morlet J, Tchamitchian P, Combes JM, Grossman A (1989) A real-time algorithm for signal analysis with the help of the wavelet transform. Wavelets, Time-Frequency Methods and Phase Space 1:286
Hoshen Y, Weiss RJ, Wilson KW (2015) Speech acoustic modeling from raw multichannel waveforms. In 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), International Conference on Acoustics Speech and Signal Processing ICASSP, Brisbane, pp 4624–4628
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42:2011–2023
Kim T, Lee J, Nam J (2019) Comparison and analysis of sample cnn architectures for audio classification. IEEE Journal of Selected Topics in Signal Processing 13:285–297
Kumar A, Rajpal A, Rathore D (2018) Genre classification using feature extraction and deep learning techniques in 2018 10th International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh City, pp 175–180
Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. Ieee Signal Processing Letters 24(8):1208–1212
Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. ArXiv vol. abs/1703.01789
Lin Y, Chung C, Chen HH (2018) Playlist-based tag propagation for improving music auto-tagging in European Signal Processing Conference (EUSIPCO). European Signal Processing Conference, Rome, pp 2270–2274
Nam J, Choi K, Lee J, Chou S, Yang Y (2019) Deep learning for audio-based music classification and tagging teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36(1):41–51
Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499
Pons J, Lidy T, Serra X, Ieee (2016) Experimenting with musically motivated convolutional neural networks. In 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), Bucharest, pp 1–6
Rajanna AR, Aryafar K, Shokoufandeh A, Ptucha R (2015) Deep neural networks: A case study for music genre classification. 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, pp 655–660
Song G, Wang Z, Han F, Ding S, Iqbal MA (2018) Music auto-tagging using deep Recurrent Neural . Neurocomputing 292:104–110
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Ulaganathan AS, Ramanna S (2019) Granular methods in automatic music genre classification: a case study. J Intell Inf Syst 52(1)85–105
van den Oord A, Kalchbrenner N, Vinyals O, Espeholt L, Graves A, Kavukcuoglu K (2016) Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems 29 (Nips 2016) 29
Zen H, Agiomyrgiannakis Y, Egberts N, Henderson F, Szczepaniak P (2016) Fast, compact, and high quality lstm-rnn based statistical parametric speech synthesizers for mobile devices. 17th Annual Conference of the International Speech Communication Association (Interspeech 2016), vols 1–5: Understanding Speech Processing in Humans and Machines, pp 2273–2277
Acknowledgments
This work is supported by Research Fund for International Young Scientists of National Nat-ural Science Foundation of China (NSFC Grant No.61550110248), Sichuan Science and Technology Program (Grant No.2019YFG0190) and Research on Sino-Tibetan multi-source information acquisition, fusion, data mining and its application (Grant No.H04W170186).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yu, Yb., Qi, Mh., Tang, Yf. et al. A sample-level DCNN for music auto-tagging. Multimed Tools Appl 80, 11459–11469 (2021). https://doi.org/10.1007/s11042-020-10330-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10330-9