Skip to main content
Log in

A sample-level DCNN for music auto-tagging

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Deep convolutional neural networks (DCNNs) has been widely used in music auto-tagging which is a multi-label classification task that predicts tags of audio signals. This paper presents a sample-level DCNN for music auto-tagging. The proposed DCNN highlights two components: strided convolutional layer for extracting local feature and reducing temporal dimension, and residual block from WaveNet for preserving input resolution and extracting more complex features. In order to further improve performance, squeeze-and-excitation (SE) block is introduced to the residual block. Under the evaluation metric of Area Under Receiver Operating Characteristic Curve (AUC-ROC) score, experiment results on MagnaTagATune (MTAT) dataset show that the two proposed models achieve 91.47% and 92.76% respectively. Furthermore, our proposed models have slightly surpass the state-of-the-art model SampleCNN with SE block.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/qmh1234567/music-auto-tagging-by-sample-level-DCNN

References

  1. Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P (2011) The million song dataset in Proceedings of the 12th International Society for Music Information Retrieval Conference. ISMIR 2011, Miami

  2. Choi K, Fazekas G, Sandler M, Cho K, Ieee (2017) Convolutional recurrent neural networks for music classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, pp 2392–2396

  3. Dieleman S, Schrauwen B, Ieee (2014) End-to-end learning for music audio. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, pp 6964–6968

  4. Fu Z, Lu G, Ting KM, Zhang D (2011) A survey of audio-based music classification and annotation. IEEE Trans Multimed 13(2):303–319

    Article  Google Scholar 

  5. He K, Zhang X, Ren S, Sun J, Ieee (2016) Deep residual learning for image recognition in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, pp 770–778

  6. Holschneider M, Kronland-Martinet R, Morlet J, Tchamitchian P, Combes JM, Grossman A (1989) A real-time algorithm for signal analysis with the help of the wavelet transform. Wavelets, Time-Frequency Methods and Phase Space 1:286

  7. Hoshen Y, Weiss RJ, Wilson KW (2015) Speech acoustic modeling from raw multichannel waveforms. In 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), International Conference on Acoustics Speech and Signal Processing ICASSP, Brisbane, pp 4624–4628

  8. Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42:2011–2023

  9. Kim T, Lee J, Nam J (2019) Comparison and analysis of sample cnn architectures for audio classification. IEEE Journal of Selected Topics in Signal Processing 13:285–297

  10. Kumar A, Rajpal A, Rathore D (2018) Genre classification using feature extraction and deep learning techniques in 2018 10th International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh City, pp 175–180

  11. Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. Ieee Signal Processing Letters 24(8):1208–1212

  12. Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. ArXiv vol. abs/1703.01789

  13. Lin Y, Chung C, Chen HH (2018) Playlist-based tag propagation for improving music auto-tagging in European Signal Processing Conference (EUSIPCO). European Signal Processing Conference, Rome, pp 2270–2274

  14. Nam J, Choi K, Lee J, Chou S, Yang Y (2019) Deep learning for audio-based music classification and tagging teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36(1):41–51

  15. Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499

  16. Pons J, Lidy T, Serra X, Ieee (2016) Experimenting with musically motivated convolutional neural networks. In 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), Bucharest, pp 1–6

  17. Rajanna AR, Aryafar K, Shokoufandeh A, Ptucha R (2015) Deep neural networks: A case study for music genre classification. 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, pp 655–660

  18. Song G, Wang Z, Han F, Ding S, Iqbal MA (2018) Music auto-tagging using deep Recurrent Neural . Neurocomputing 292:104–110

  19. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

  20. Ulaganathan AS, Ramanna S (2019) Granular methods in automatic music genre classification: a case study. J Intell Inf Syst 52(1)85–105

  21. van den Oord A, Kalchbrenner N, Vinyals O, Espeholt L, Graves A, Kavukcuoglu K (2016) Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems 29 (Nips 2016) 29

  22. Zen H, Agiomyrgiannakis Y, Egberts N, Henderson F, Szczepaniak P (2016) Fast, compact, and high quality lstm-rnn based statistical parametric speech synthesizers for mobile devices. 17th Annual Conference of the International Speech Communication Association (Interspeech 2016), vols 1–5: Understanding Speech Processing in Humans and Machines, pp 2273–2277

Download references

Acknowledgments

This work is supported by Research Fund for International Young Scientists of National Nat-ural Science Foundation of China (NSFC Grant No.61550110248), Sichuan Science and Technology Program (Grant No.2019YFG0190) and Research on Sino-Tibetan multi-source information acquisition, fusion, data mining and its application (Grant No.H04W170186).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min-hui Qi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Yb., Qi, Mh., Tang, Yf. et al. A sample-level DCNN for music auto-tagging. Multimed Tools Appl 80, 11459–11469 (2021). https://doi.org/10.1007/s11042-020-10330-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10330-9

Keywords

Navigation