Skip to main content
Log in

Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Automatically assigning a group of appropriate semantic tags to one music piece provides an effective way for people to efficiently utilize the massive and ever increasing online and off-line music data. In this paper, we propose a novel end-to-end deep neural network model for automatic music annotation, which effectively integrates available multiple complementary music representations and jointly accomplishes music representation learning, structure modeling, and tag prediction. The model first hierarchically leverages attentive convolutional networks and recurrent networks to learn informative descriptions from Mel-spectrogram and raw waveform of the music and depict time-varying structures embedded in the description sequence. A dual-state LSTM network is then employed to capture the correlations between two representation channels as supplementary music descriptions. Finally, the model aggregates music description sequence into a holistic embedding with a self-attentive multi-weighting mechanism, which adaptively captures multi-aspect summarized information of the music for tag prediction. Experiments on the public MagnaTagATune benchmark music dataset show that the proposed model outperforms state-of-the-art methods for automatic music annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Bergstra J, Casagrande N, Erhan D, Eck D, Kégl B (2006) Aggregate features and adaboost for music classification. Mach Learn 65(2–3):473–484

    Article  Google Scholar 

  2. Bertin-Mahieux T, Eck D, Maillet F, Lamere P (2008) Autotagger: a model for predicting social tags from acoustic features on large music databases. J New Music Res 37(2):115–135

    Article  Google Scholar 

  3. Chang KK, Jang JSR, Iliopoulos CS (2010) Music genre classification via compressive sampling. In: Proceedings of the 11th conference of the international society for music information retrieval (ISMIR), pp 387–392

  4. Chen ZS, Jang JSR (2009) On the use of anti-word models for audio music annotation and retrieval. IEEE Trans Audio Speech Lang Process 17(8):1547–1556

    Article  Google Scholar 

  5. Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th conference of the international society for music information retrieval (ISMIR)

  6. Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2392–2396

  7. Dauphin YN, Fan A, Auli M, Grangier D (2016) Language modeling with gated convolutional networks. Preprint arXiv:1612.08083

  8. Dieleman S, Schrauwen B (2013) Multiscale approaches to music audio feature learning. In: Proceedings of the 14th conference of the international society for music information retrieval (ISMIR), pp 116–121

  9. Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6964–6968

  10. Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471

    Article  Google Scholar 

  11. Grosse R, Raina R, Kwong H, Ng AY (2012) Shift-invariance sparse coding for audio classification. Preprint arXiv:1206.5241

  12. Güçlü U, Thielen J, Hanke M, van Gerven MAJ (2016) Brains on beats. In: 30th conference on neural information processing systems (NIPS 2016), pp 2101–2109

  13. Hamel P, Lemieux S, Bengio Y, Eck D (2011) Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: Proceedings of the 12th conference of the international society for music information retrieval (ISMIR), pp 729–734

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  15. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, pp 448–456

  16. Kim T, Lee J, Nam J (2018) Sample-level CNN architectures for music auto-tagging using raw waveforms. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 366–370

  17. Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference for learning representations

  18. Law E, West K, Mandel M, Bay M, Downie JS (2009) Evaluation of algorithms using games: the case of music tagging. In: Proceedings of the 10th conference of the international society for music information retrieval (ISMIR), pp 387–392

  19. Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24:1208–1212

    Article  Google Scholar 

  20. Lin Z, Feng M, dos Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. In: International conference on learning representations

  21. Liu JY, Yang YH (2016) Event localization in music auto-tagging. In: Proceedings of the 24th ACM international conference on multimedia, pp 1048–1057

  22. McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–24

  23. McKinney MF, Breebaart J (2003) Features for audio and music classification. In: Proceedings of the 4th conference of the international society for music information retrieval (ISMIR), pp 151–158

  24. Moore B (2012) An introduction to the psychology of hearing. Brill, Leiden

    Google Scholar 

  25. Nam J, Herrera J, Lee K (2015) A deep bag-of-features model for music auto-tagging. Preprint arXiv:1508.04999

  26. Nam J, Herrera J, Slaney M, Smith J (2012) Learning sparse feature representations for music annotation and retrieval. In: Proceedings of the 13th conference of the international society for music information retrieval (ISMIR), pp 565–571

  27. Ness SR, Theocharis A, Martins LG (2009) Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs. In: Proceedings of the 17th ACM international conference on multimedia, pp 705–708

  28. Schluter J, Osendorfer C (2011) Music similarity estimation with the mean–covariance restricted Boltzmann machine. In: 2011 10th international conference on machine learning and applications, pp 118–123

  29. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  30. Sordo M (2012) Semantic annotation of music collections: a computational approach. PhD thesis, Universitat Pompeu Fabra

  31. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  32. Tingle D, Kim YE, Turnbull D (2010) Exploring automatic music annotation with acoustically-objective tags. In: Proceedings of the international conference on multimedia information retrieval (MIR 2010), pp 55–62

  33. Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16(2):467–476

    Article  Google Scholar 

  34. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302

    Article  Google Scholar 

  35. van den Oord A, Dieleman S, Schrauwen B (2014) Transfer learning by supervised pre-training for audio-based music classification. In: Proceedings of the 15th conference of the international society for music information retrieval (ISMIR), pp 29–34

  36. Xiong Y, Su F, Wang Q (2017) Automatic music mood classification by learning cross-media relevance between audio and lyrics. In: 2017 IEEE international conference on multimedia and expo (ICME), pp 961–966

  37. Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 146–153

Download references

Acknowledgements

The research was supported by the Natural Science Foundation of Jiangsu Province of China under Grant No. BK20171345 and the National Natural Science Foundation of China under Grant Nos. 61003113, 61321491, and 61672273.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Su.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Q., Su, F. & Wang, Y. Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations. Int J Multimed Info Retr 9, 3–16 (2020). https://doi.org/10.1007/s13735-019-00186-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-019-00186-7

Keywords

Navigation