Abstract
With the development of the Internet and technology, online music platforms and music streaming services are flourishing. Information overload due to an abundance of digital music has become a common problem for many users. Social tags that are helpful for music recommendations have been discussed. However, label sparsity and a cold start problem, commonly observed with social tags, limit the effectiveness in supporting the recommendation system. A music autotagging system then becomes an alternative solution for supplementing a shortage of tags. Most prior studies on automatic labeling used only audio data for their analysis. However, some studies have suggested that lyrics enhance the music classification system to obtain more information and improve the overall accuracy. In addition to lyrics, audio data are also an important resource for finding music features. In summary, this paper proposes a music autotagging system that relies on both audio and lyrics to solve the above problems. Due to the development of deep learning algorithms in recent years, many scholars have effectively used neural networks to extract audio and textual features. Some of them also considered a structure of lyrics to extract features that consequentially improves the classification task. For lyric feature extraction, this study employs two types of deep learning models: convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The feature extraction architecture is mainly motivated and characterized by the lyric architecture. In addition, a multitask learning method is adopted to learn correlations between tags. The experiments support that a multitask learning classifier that combines audio and lyric information has a better performance than a single-task learning classification method using only audio data than previous studies.
Similar content being viewed by others
Notes
enwiki-20,190,220-pages-articles:https://dumps.wikimedia.org/enwiki/20190220/
References
Alías, F., Socoró, J. C., & Sevillano, X. (2016). A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Applied Sciences 6(5):143
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bertin-Mahieux, T., Eck, D., & Mandel, M. I. (2011). Automatic Tagging of Audio: The State-of-the-Art. Machine audition: Principles, algorithms and systems, IGI Global.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
Choi K (2018) Deep neural networks for music tagging. Queen Mary University of London
Choi, K., Fazekas, G., & Sandler, M. (2016). Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298.
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: Paper presented at the 2017 IEEE International Conference on Acoustics. Signal Processing (ICASSP), Speech and
Coviello, E. (2014). Automatic music tagging with time series models. UC San Diego.
Datta AK, Solanki SS, Sengupta R, Chakraborty S, Mahto K, Patranabis A (2017) Signal analysis of Hindustani classical music: Springer.Datta, A. K., Solanki, S. S., Sengupta, R., Chakraborty, S., Mahto, K., & Patranabis, A. (2017). Springer Singapore, Signal Analysis of Hindustani Classical Music
De Leon, F., & Martinez, K. (2012). Enhancing timbre model using MFCC and its time derivatives for music similarity estimation. Paper presented at the 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).
Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, J., & Moussallam, M. (2018). Music mood detection based on audio and lyrics with deep neural Net. arXiv preprint arXiv:1809.07276.
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. Paper presented at the IEEE International Conference on Acoustics. Speech and Signal Processing, Florence, Italy
Duan SF, Zhang JL, Roe P, Towsey M (2014) A survey of tagging techniques for music, speech and environmental sound. Artif Intell Rev 42(4):637–661 Retrieved from <Go to ISI>://WOS:000345089400005
Fell M, Sporleder C (2014) Lyrics-based analysis and classification of music. Paper presented at the International Conference on Computational Linguistics. Dublin, Ireland
Gossi D, Gunes MH (2016) Lyric-based music recommendation. In: Cherifi H, Gonçalves B, Menezes R, Sinatra R (eds) Complex networks VII: Proceedings of the 7th Workshop on Complex Networks CompleNet 2016. Springer International Publishing, Cham, pp 301–310
Gouyon, F., Sturm, B., Oliveira, J., Hespanhol, N. & Langlois, T. (2014) On evaluation validity in music autotagging, arXiv preprint arXiv:1410.0001.
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780
Horsburgh B, Craw S, Massie S (2015) Learning pseudo-tags to augment sparse tagging in hybrid music recommender systems. Artificial Intelligence Review 219(C):25–39
Hu X, Choi K, Downie JS (2017) A framework for evaluating multimodal music mood classification. Journal of the Association for Information Science and Technology 68(2):273–285 Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.23649
Huang, Y., Wang, W., Wang, L., & Tan, T. (2013). Multi-task deep neural network for multi-label learning. Paper presented at the 2013 IEEE International Conference on Image Processing.
Huang Y, Wang W, Wang L (2015) Unconstrained multimodal multi-label learning. Ieee Trans Multimed 17(11):1923–1935
Jeong, I.-Y., & Lim, H. (2018). Audio tagging system using densely connected convolutional networks. Paper presented at the Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018).
Kaminskas M, Ricci F, Schedl M (2013) Location-aware music recommendation using auto-tagging and hybrid matching. In: Paper presented at the 7th ACM conference on Recommender systems. China, Hong Kong
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
Kim, T., Lee, J., & Nam, J. (2018). Sample-level cnn architectures for music auto-tagging using raw waveforms. Paper presented at the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).
Knees P, Schedl M (2013) A survey of music similarity and recommendation from music context data. Acm Transactions on Multimedia Computing Communications and Applications 10(1):21 Retrieved from <Go to ISI>://WOS:000329025400002
Labrosa. (2011). Last.Fm dataset. Retrieved from: http://labrosa.ee.columbia.edu/millionsong/lastfm
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. Paper presented at the Association for the Advancement of Artificial Intelligence. Austin Texas, USA
Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE signal processing letters 24(8):1208–1212
Lee, J., Park, J., Kim, K. L., & Nam, J. (2017). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:.01789.
Lee J, Park J, Kim KL, Nam J (2018) SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Applied Sciences 8(1):150
Liu, K., Li, Y., Xu, N., & Natarajan, P. (2018). Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:.11730.
Malheiro R, Panda R, Gomes P, Paiva RP (2018) Emotionally-relevant features for classification and regression of music lyrics. IEEE Trans Affective Comput (2):240–254
Marques G, Domingues M, Langlois T, Gouyon F (2011) Three current issues in music autotagging, paper presented at the Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011. Miami, Florida
Nam, J., Herrera, J., & Lee, K. (2015). A deep bag-of-features model for music auto-tagging. arXiv preprint arXiv:.04999.
Nayyar, R. K., Nair, S., Patil, O., Pawar, R., & Lolage, A. (2017). Content-based auto-tagging of audios using deep learning. Paper presented at the 2017 International Conference on Big Data, IoT and Data Science (BID).
Oğul H, Kırmacı B (2016) Lyrics mining for music meta-data estimation. Paper presented at the International Conference on Artificial Intelligence Applications and Innovations. Thessaloniki, Greece
Panagakis Y, Kotropoulos C (2012) Automatic music tagging by low-rank representation. In: Paper presented at the 2012 IEEE International Conference on Acoustics. Signal Processing (ICASSP), Speech and
PwC. (2017). Perspectives from the Global Entertainment and Media Outlook 2017–2021. Retrieved from https://www.pwc.com/gx/en/entertainment-media/pdf/outlook-2017-curtain-up.pdf
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:.05098.
Shao X, Cheng Z, Kankanhalli MS (2019) Music auto-tagging based on the unified latent semantic modeling. Multimedia Tools Applications 78(1):161–176
Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Applied Acoustics 158:107020
Shen J, Meng W, Yan S, Pang H, Hua X (2010) Effective music tagging through advanced statistical modeling. Paper presented at the Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval.
Shen J, Tao M, Qu Q, Tao D, Rui Y (2019) Toward efficient indexing structure for scalable content-based music retrieval. Multimedia Systems 25(6):639–653
Song G, Wang Z, Han F, Ding S, Iqbal MA (2018) Music auto-tagging using deep recurrent neural networks. Neurocomputing 292:104–110 Retrieved from <Go to ISI>://WOS:000429321400009
Sturm B (2014) The state of the art ten years after a state of the art: future research in music information retrieval. Journal of New Music Research 43(2):147–172
Sung B, Chung M, Ko I (2008) A feature based music content recognition method using simplified MFCC. International Journal Principles Applications of Information Science and Technology 2(1):13–23
Thiruvengatanadhan R (2018) Music classification using MFCC and SVM. International Research Journal of Engineering and Technology 5:922–924
Tsaptsinos A (2017) Lyrics-based music genre classification using a hierarchical attention network. arXiv preprint arXiv:1707.04678.
Turnbull D, Barrington L, Torres D, Lanckriet G (2008a) Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech, Language Processing 16(2):467–476
Turnbull D, Barrington L, Lanckriet G (2008b) Five approaches to collecting tags for music. Paper presented at the Proceedings of the 9th International Conference on Music Information Retrieval, ISMIR, Philadelphia.
Wang Q, Su F, Wang Y (2019) A hierarchical attentive deep neural network model for semantic music annotation integrating multiple music representations. Paper presented at the Proceedings of the 2019 on International Conference on Multimedia Retrieval.
Wei S, Xu K, Wang D, Liao F, Wang H, Kong Q (2018) Sample mixed-based data augmentation for domestic audio tagging. arXiv preprint arXiv:.03883.
Won M, Chun S, Serra X (2019a) Toward interpretable music tagging with self-attention. arXiv preprint arXiv:.04972.
Won M, Chun S, Nieto O, Serra X (2019b) Automatic music tagging with Harmonic CNN. Paper presented at the 20th International society for music information retrieval Deft, Netherlands.
Xie B, Bian W, Tao D, Chordia P (2011) Music tagging with regularized logistic regression. Paper presented at the ISMIR.
Xu Y, Kong Q, Wang W, Plumbley MD (2018). Large-scale weakly supervised audio classification using gated convolutional neural network. Paper presented at the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zhuang N, Yan Y, Chen S, Wang H, Shen C (2018) Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recognition 80:225–240
Zuo Y, Zeng J, Gong M, Jiao L (2016) Tag-aware recommender systems based on deep neural networks. Neurocomputing 204:51–60
Acknowledgements
The research is based on work supported by Taiwan Ministry of Science and Technology under Grant No. MOST 107-2410-H-006 040-MY3 and 108-2511-H-006-009. We would like to thank the Center of Innovative Fintech Business Models for a research grant to support this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, HC., Syu, SW. & Wongchaisuwat, P. A method of music autotagging based on audio and lyrics. Multimed Tools Appl 80, 15511–15539 (2021). https://doi.org/10.1007/s11042-020-10381-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10381-y