Abstract
Automatically assigning a group of appropriate semantic tags to one music piece provides an effective way for people to efficiently utilize the massive and ever increasing online and off-line music data. In this paper, we propose a novel end-to-end deep neural network model for automatic music annotation, which effectively integrates available multiple complementary music representations and jointly accomplishes music representation learning, structure modeling, and tag prediction. The model first hierarchically leverages attentive convolutional networks and recurrent networks to learn informative descriptions from Mel-spectrogram and raw waveform of the music and depict time-varying structures embedded in the description sequence. A dual-state LSTM network is then employed to capture the correlations between two representation channels as supplementary music descriptions. Finally, the model aggregates music description sequence into a holistic embedding with a self-attentive multi-weighting mechanism, which adaptively captures multi-aspect summarized information of the music for tag prediction. Experiments on the public MagnaTagATune benchmark music dataset show that the proposed model outperforms state-of-the-art methods for automatic music annotation.
Similar content being viewed by others
References
Bergstra J, Casagrande N, Erhan D, Eck D, Kégl B (2006) Aggregate features and adaboost for music classification. Mach Learn 65(2–3):473–484
Bertin-Mahieux T, Eck D, Maillet F, Lamere P (2008) Autotagger: a model for predicting social tags from acoustic features on large music databases. J New Music Res 37(2):115–135
Chang KK, Jang JSR, Iliopoulos CS (2010) Music genre classification via compressive sampling. In: Proceedings of the 11th conference of the international society for music information retrieval (ISMIR), pp 387–392
Chen ZS, Jang JSR (2009) On the use of anti-word models for audio music annotation and retrieval. IEEE Trans Audio Speech Lang Process 17(8):1547–1556
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th conference of the international society for music information retrieval (ISMIR)
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2392–2396
Dauphin YN, Fan A, Auli M, Grangier D (2016) Language modeling with gated convolutional networks. Preprint arXiv:1612.08083
Dieleman S, Schrauwen B (2013) Multiscale approaches to music audio feature learning. In: Proceedings of the 14th conference of the international society for music information retrieval (ISMIR), pp 116–121
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6964–6968
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471
Grosse R, Raina R, Kwong H, Ng AY (2012) Shift-invariance sparse coding for audio classification. Preprint arXiv:1206.5241
Güçlü U, Thielen J, Hanke M, van Gerven MAJ (2016) Brains on beats. In: 30th conference on neural information processing systems (NIPS 2016), pp 2101–2109
Hamel P, Lemieux S, Bengio Y, Eck D (2011) Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: Proceedings of the 12th conference of the international society for music information retrieval (ISMIR), pp 729–734
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, pp 448–456
Kim T, Lee J, Nam J (2018) Sample-level CNN architectures for music auto-tagging using raw waveforms. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 366–370
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference for learning representations
Law E, West K, Mandel M, Bay M, Downie JS (2009) Evaluation of algorithms using games: the case of music tagging. In: Proceedings of the 10th conference of the international society for music information retrieval (ISMIR), pp 387–392
Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24:1208–1212
Lin Z, Feng M, dos Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. In: International conference on learning representations
Liu JY, Yang YH (2016) Event localization in music auto-tagging. In: Proceedings of the 24th ACM international conference on multimedia, pp 1048–1057
McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–24
McKinney MF, Breebaart J (2003) Features for audio and music classification. In: Proceedings of the 4th conference of the international society for music information retrieval (ISMIR), pp 151–158
Moore B (2012) An introduction to the psychology of hearing. Brill, Leiden
Nam J, Herrera J, Lee K (2015) A deep bag-of-features model for music auto-tagging. Preprint arXiv:1508.04999
Nam J, Herrera J, Slaney M, Smith J (2012) Learning sparse feature representations for music annotation and retrieval. In: Proceedings of the 13th conference of the international society for music information retrieval (ISMIR), pp 565–571
Ness SR, Theocharis A, Martins LG (2009) Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs. In: Proceedings of the 17th ACM international conference on multimedia, pp 705–708
Schluter J, Osendorfer C (2011) Music similarity estimation with the mean–covariance restricted Boltzmann machine. In: 2011 10th international conference on machine learning and applications, pp 118–123
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Sordo M (2012) Semantic annotation of music collections: a computational approach. PhD thesis, Universitat Pompeu Fabra
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Tingle D, Kim YE, Turnbull D (2010) Exploring automatic music annotation with acoustically-objective tags. In: Proceedings of the international conference on multimedia information retrieval (MIR 2010), pp 55–62
Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16(2):467–476
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
van den Oord A, Dieleman S, Schrauwen B (2014) Transfer learning by supervised pre-training for audio-based music classification. In: Proceedings of the 15th conference of the international society for music information retrieval (ISMIR), pp 29–34
Xiong Y, Su F, Wang Q (2017) Automatic music mood classification by learning cross-media relevance between audio and lyrics. In: 2017 IEEE international conference on multimedia and expo (ICME), pp 961–966
Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 146–153
Acknowledgements
The research was supported by the Natural Science Foundation of Jiangsu Province of China under Grant No. BK20171345 and the National Natural Science Foundation of China under Grant Nos. 61003113, 61321491, and 61672273.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, Q., Su, F. & Wang, Y. Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations. Int J Multimed Info Retr 9, 3–16 (2020). https://doi.org/10.1007/s13735-019-00186-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-019-00186-7