Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations

Wang, Qianqian; Su, Feng; Wang, Yuyang

doi:10.1007/s13735-019-00186-7

Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations

Regular Paper
Published: 01 January 2020

Volume 9, pages 3–16, (2020)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

385 Accesses
8 Citations
Explore all metrics

Abstract

Automatically assigning a group of appropriate semantic tags to one music piece provides an effective way for people to efficiently utilize the massive and ever increasing online and off-line music data. In this paper, we propose a novel end-to-end deep neural network model for automatic music annotation, which effectively integrates available multiple complementary music representations and jointly accomplishes music representation learning, structure modeling, and tag prediction. The model first hierarchically leverages attentive convolutional networks and recurrent networks to learn informative descriptions from Mel-spectrogram and raw waveform of the music and depict time-varying structures embedded in the description sequence. A dual-state LSTM network is then employed to capture the correlations between two representation channels as supplementary music descriptions. Finally, the model aggregates music description sequence into a holistic embedding with a self-attentive multi-weighting mechanism, which adaptively captures multi-aspect summarized information of the music for tag prediction. Experiments on the public MagnaTagATune benchmark music dataset show that the proposed model outperforms state-of-the-art methods for automatic music annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in the creative industries: a review

Article Open access 02 July 2021

Synchronous composition and semantic line detection based on cross-attention

Article 09 April 2024

Multi-modal Graph and Sequence Fusion Learning for Recommendation

References

Bergstra J, Casagrande N, Erhan D, Eck D, Kégl B (2006) Aggregate features and adaboost for music classification. Mach Learn 65(2–3):473–484
Article Google Scholar
Bertin-Mahieux T, Eck D, Maillet F, Lamere P (2008) Autotagger: a model for predicting social tags from acoustic features on large music databases. J New Music Res 37(2):115–135
Article Google Scholar
Chang KK, Jang JSR, Iliopoulos CS (2010) Music genre classification via compressive sampling. In: Proceedings of the 11th conference of the international society for music information retrieval (ISMIR), pp 387–392
Chen ZS, Jang JSR (2009) On the use of anti-word models for audio music annotation and retrieval. IEEE Trans Audio Speech Lang Process 17(8):1547–1556
Article Google Scholar
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th conference of the international society for music information retrieval (ISMIR)
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2392–2396
Dauphin YN, Fan A, Auli M, Grangier D (2016) Language modeling with gated convolutional networks. Preprint arXiv:1612.08083
Dieleman S, Schrauwen B (2013) Multiscale approaches to music audio feature learning. In: Proceedings of the 14th conference of the international society for music information retrieval (ISMIR), pp 116–121
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6964–6968
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471
Article Google Scholar
Grosse R, Raina R, Kwong H, Ng AY (2012) Shift-invariance sparse coding for audio classification. Preprint arXiv:1206.5241
Güçlü U, Thielen J, Hanke M, van Gerven MAJ (2016) Brains on beats. In: 30th conference on neural information processing systems (NIPS 2016), pp 2101–2109
Hamel P, Lemieux S, Bengio Y, Eck D (2011) Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In: Proceedings of the 12th conference of the international society for music information retrieval (ISMIR), pp 729–734
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, pp 448–456
Kim T, Lee J, Nam J (2018) Sample-level CNN architectures for music auto-tagging using raw waveforms. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 366–370
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference for learning representations
Law E, West K, Mandel M, Bay M, Downie JS (2009) Evaluation of algorithms using games: the case of music tagging. In: Proceedings of the 10th conference of the international society for music information retrieval (ISMIR), pp 387–392
Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24:1208–1212
Article Google Scholar
Lin Z, Feng M, dos Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. In: International conference on learning representations
Liu JY, Yang YH (2016) Event localization in music auto-tagging. In: Proceedings of the 24th ACM international conference on multimedia, pp 1048–1057
McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–24
McKinney MF, Breebaart J (2003) Features for audio and music classification. In: Proceedings of the 4th conference of the international society for music information retrieval (ISMIR), pp 151–158
Moore B (2012) An introduction to the psychology of hearing. Brill, Leiden
Google Scholar
Nam J, Herrera J, Lee K (2015) A deep bag-of-features model for music auto-tagging. Preprint arXiv:1508.04999
Nam J, Herrera J, Slaney M, Smith J (2012) Learning sparse feature representations for music annotation and retrieval. In: Proceedings of the 13th conference of the international society for music information retrieval (ISMIR), pp 565–571
Ness SR, Theocharis A, Martins LG (2009) Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs. In: Proceedings of the 17th ACM international conference on multimedia, pp 705–708
Schluter J, Osendorfer C (2011) Music similarity estimation with the mean–covariance restricted Boltzmann machine. In: 2011 10th international conference on machine learning and applications, pp 118–123
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Article Google Scholar
Sordo M (2012) Semantic annotation of music collections: a computational approach. PhD thesis, Universitat Pompeu Fabra
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Tingle D, Kim YE, Turnbull D (2010) Exploring automatic music annotation with acoustically-objective tags. In: Proceedings of the international conference on multimedia information retrieval (MIR 2010), pp 55–62
Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16(2):467–476
Article Google Scholar
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
Article Google Scholar
van den Oord A, Dieleman S, Schrauwen B (2014) Transfer learning by supervised pre-training for audio-based music classification. In: Proceedings of the 15th conference of the international society for music information retrieval (ISMIR), pp 29–34
Xiong Y, Su F, Wang Q (2017) Automatic music mood classification by learning cross-media relevance between audio and lyrics. In: 2017 IEEE international conference on multimedia and expo (ICME), pp 961–966
Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 146–153

Download references

Acknowledgements

The research was supported by the Natural Science Foundation of Jiangsu Province of China under Grant No. BK20171345 and the National Natural Science Foundation of China under Grant Nos. 61003113, 61321491, and 61672273.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China
Qianqian Wang, Feng Su & Yuyang Wang

Authors

Qianqian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Su
View author publications
You can also search for this author in PubMed Google Scholar
Yuyang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Su.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Q., Su, F. & Wang, Y. Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations. Int J Multimed Info Retr 9, 3–16 (2020). https://doi.org/10.1007/s13735-019-00186-7

Download citation

Received: 05 September 2019
Revised: 04 November 2019
Accepted: 06 December 2019
Published: 01 January 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s13735-019-00186-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Synchronous composition and semantic line detection based on cross-attention

Multi-modal Graph and Sequence Fusion Learning for Recommendation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical attentive deep neural networks for semantic music annotation through multiple music representations

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Synchronous composition and semantic line detection based on cross-attention

Multi-modal Graph and Sequence Fusion Learning for Recommendation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation