A hybrid neural network model based on optimized margin softmax loss function for music classification

Li, Jingxian; Han, Lixin; Wang, Xin; Wang, Yang; Xia, Jianhua; Yang, Yi; Hu, Bing; Li, Shu; Yan, Hong

doi:10.1007/s11042-023-17056-4

A hybrid neural network model based on optimized margin softmax loss function for music classification

Published: 16 October 2023

Volume 83, pages 43871–43906, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jingxian Li ORCID: orcid.org/0000-0002-2450-6052^1,2,
Lixin Han¹,
Xin Wang¹,
Yang Wang³,
Jianhua Xia⁴,
Yi Yang⁵,
Bing Hu¹,
Shu Li¹ &
…
Hong Yan⁶

193 Accesses
Explore all metrics

Abstract

Music classification has achieved great progress due to the development of Convolutional Neural Networks (CNNs), which is important for music retrieval and recommendation. However, CNN cannot capture temporal information from music audio, which restricts the prediction performance of the model. To address the issue, we propose a Convolutional Neural Network-Long Short Term Memory (CNN-LSTM) model to learn local spatial features by CNN and learn temporal dependencies by LSTM. In addition, the traditional softmax loss function commonly lacks sufficient discrimination in music classification. Therefore, we propose an additive angular margin and cosine margin softmax (AACM-Softmax) loss function to improve classification results, which minimizes intra-class variances and maximizes inter-class variances simultaneously by enforcing combined margin penalties. Furthermore, we combine the CNN-LSTM model with AACM-Softmax loss function to comprehensively improve the classification performance by learning temporal-dependencies-included discriminative essential features. Extensive experiments on music genre datasets and music emotion datasets show that the proposed model consistently outperforms other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Audio-Based Music Classification with DenseNet and Data Augmentation

Classification of Music Genres Based on Mel-Frequency Cepstrum Coefficients Using Deep Learning Models

Classifying Audio Music Genres Using CNN and RNN

Data availability

The datasets analyzed in this study are public datasets, which are available in the public repository.

Notes

https://www.kaggle.com/makvel/mer500.

References

Abdulwahab HM, Ajitha S, Saif MAN (2022) Feature selection techniques in the context of big data: taxonomy and analysis. Appl Intell 52(12):13568–13613
Article Google Scholar
Alhagry S, Fahmy AA, El-Khoribi RA (2017) Emotion recognition based on EEG using LSTM recurrent neural network. Int J Adv Comput Sci Appl 8(10):355–358
Google Scholar
Almalawi A, Khan AI, Alsolami F, Alkhathlan A, Fahad A, Irshad K, ... & Qaiyum S (2022) Arithmetic optimization algorithm with deep learning enabled airborne particle-bound metals size prediction model. Chemosphere 303:134960
Bhattacharjee M, Prasanna SM, Guha P (2020) Speech/music classification using features from spectral peaks. IEEE/ACM Trans Audio Speech Lang Process 28:1549–1559
Article Google Scholar
Chen C, Li Q (2020) A multimodal music emotion classification method based on multifeature combined network classifier. Math Probl Eng 2020:1–11
Google Scholar
Chen G, Parada C, Sainath TN (2015) Query-by-example keyword spotting using long short-term memory networks. In: Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5236–5240
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th International Society for Music Information Retrieval Conference, pp 805–811
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2392–2396
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 539–546
Costa YM, Oliveira LS, Silla CN Jr (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Appl Soft Comput 52:28–38
Article Google Scholar
da Silva ACM, Coelho MAN, Neto RF (2020) A Music Classification model based on metric learning applied to MP3 audio files. Expert Syst Appl 144:113071
Article Google Scholar
Dai J, Liang S, Xue W, Ni C, Liu W (2016) Long short-term memory recurrent neural network based segment features for music genre classification. In: Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, pp. 1–5
Defferrard M, Benzi K, Vandergheynst P, Bresson X (2016) FMA: A dataset for music analysis. In: Proceedings of the 18th International Society for Music Information Retrieval Conference, pp. 316–323
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699
Dhal P, Azad C (2021) A comprehensive survey on feature selection in the various fields of machine learning. Appl Intell 52(4):4543–4581
Article Google Scholar
Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (BCRSN): an efficient model for music emotion recognition. IEEE Trans Multimed 21(12):3150–3163
Article Google Scholar
Eck D, Schmidhuber J (2002) A first look at music composition using lstm recurrent neural networks. Technical report, Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale 103(4):48–56
Google Scholar
Ferraro A, Bogdanov D, Jay XS, Jeon H, Yoon J (2021) How low can you go? Reducing frequency and time resolution in current CNN architectures for music auto-tagging. In: Proceedings of the 28th European Signal Processing Conference, pp. 131–135
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst 28(10):2222–2232
Article MathSciNet Google Scholar
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1735–1742
Han D, Kong Y, Han J, Wang G (2022) A survey of music emotion recognition. Front Comp Sci 16(6):166335
Article Google Scholar
Hizlisoy S, Yildirim S, Tufekci Z (2021) Music emotion recognition using convolutional long short term memory deep neural networks. Eng Sci Technol 24(3):760–767
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: Proceedings of the 3rd International Workshop on Similarity-Based Pattern Recognition, pp. 84–92
Islam N, Irshad K (2022) Artificial ecosystem optimization with Deep Learning Enabled Water Quality Prediction and Classification model. Chemosphere 309:136615
Article Google Scholar
Jakubik J (2017) Evaluation of gated recurrent neural networks in music classification tasks. In: Proceedings of the International Conference on Information Systems Architecture and Technology, pp. 27–37
Khan AI, Alsolami F, Alqurashi F, Abushark YB, Sarker IH (2022) Novel energy management scheme in IoT enabled smart irrigation system using optimized intelligence methods. Eng Appl Artif Intell 114:104996
Article Google Scholar
Li C, Bao Z, Li L, Zhao Z (2020) Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf Process Manage 57(3):102185
Article Google Scholar
Li J, Han L, Li X, Zhu J, Yuan B, Gou Z (2022) An evaluation of deep neural network models for music classification using spectrograms. Multimed Tools Applic 81(4):4621–4647
Article Google Scholar
Li J, Han L, Wang Y, Yuan B, Yuan X, Yang Y, Yan H (2022) Combined angular margin and cosine margin softmax loss for music classification based on spectrograms. Neural Comput Appl 34(13):10337–10353
Article Google Scholar
Lidy T, Schindler A (2016) Parallel convolutional neural networks for music genre and mood classification. MIREX 2016:3
Google Scholar
Liu H, Fang Y, Huang Q (2019) Music emotion recognition using a variant of recurrent neural network. In: Proceedings of 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application. pp. 15–18
Liu H, Zhu X, Lei Z, Li SZ (2019) Adaptiveface: Adaptive margin and sampling for face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11947–11956
Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220
Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. In: Proceedings of the International Conference on Machine Learning, pp. 507–516
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Lee J, Park J, Kim KL, Nam J (2018) SampleCNN: End-to-end deep convolutional neural networks using very small filters for music classification. Appl Sci 8(1):150
Article Google Scholar
Lyu Q, Wu Z, Zhu J (2015) Polyphonic music modelling with LSTM-RTRBM. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 991–994
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213
Article Google Scholar
Nam J, Choi K, Lee J, Chou SY, Yang YH (2018) Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach. IEEE Signal Process Mag 36(1):41–51
Article Google Scholar
Pons J, Serra X (2019) Randomly weighted cnns for (music) audio classification. In: Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 336–340
Rajesh S, Nalini NJ (2020) Musical instrument emotion recognition using deep recurrent neural network. Procedia Comput Sci 167:16–25
Article Google Scholar
Ranjan R, Castillo CD, Chellappa R (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507
Russo M, Kraljević L, Stella M, Sikora M (2020) Cochleogram-based approach for detecting perceived emotions in music. Inf Process Manage 57(5):102270
Article Google Scholar
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823
Singh S, Kasana SS (2018) Efficient classification of the hyperspectral images using deep learning. Multimed Tools Applic 77(20):27061–27074
Article Google Scholar
Song G, Wang Z, Han F, Ding S, Iqbal MA (2018) Music auto-tagging using deep Recurrent Neural Networks. Neurocomputing 292:104–110
Article Google Scholar
Tang CP, Chui KL, Yu YK, Zeng Z, Wong KH (2018) Music genre classification using a hierarchical long short term memory (LSTM) model. In: Proceedings of the 3rd International Workshop on Pattern Recognition, pp. 334–340
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
Article Google Scholar
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Article Google Scholar
Wang F, Xiang X, Cheng J, Yuille AL (2017) Normface: L2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1041–1049
Wang H, Wang Y, Zhou Z, Ji X, Gong D, Zhou J, ... & Liu W (2018) Cosface: Large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274
Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, ... & Wu Y (2014) Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393
Wang J, Yu LC, Lai KR, Zhang X (2019) Tree-structured regional CNN-LSTM model for dimensional sentiment analysis. IEEE/ACM Trans Audio Speech Lang Process 28:581–591
Article Google Scholar
Wang Z, Muknahallipatna S, Fan M, Okray A, Lan C (2019) Music classification using an improved crnn with multi-directional spatial dependencies in both time and frequency dimensions. In: Proceedings of 2019 International Joint Conference on Neural Networks, pp. 1–8
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: Proceedings of the 14th European Conference on Computer Vision, pp. 499–515
Weng W, Wei B, Ke W, Fan Y, Wang J, Li Y (2023) Learning label-specific features with global and local label correlation for multi-label classification. Appl Intell 53(3):3017–3033
Article Google Scholar
Wu HH, Kao CC, Tang Q, Sun M, McFee B, Bello JP, Wang C (2021) Multi-task self-supervised pre-training for music classification. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 556–560
Yu Y (2021) Research on Music Emotion Classification Based on CNN-LSTM Network. In: Proceedings of the 5th Asian Conference on Artificial Intelligence Technology, pp. 473–476
Zhang W, Lei W, Xu X, Xing X (2016) Improved music genre classification with convolutional neural networks. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association, pp. 3304–3308
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323
Article Google Scholar
Zhao K, Li S, Cai J, Wang H, Wang J (2019) An emotional symbolic music generation system based on LSTM networks. In: Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, pp. 2039–2043
Zhou ZH, Feng J (2019) Deep forest. National Science Review 6(1):74–86
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of the Colleges and Universities in Anhui Province of China under Grant No. KJ2020A0035 and No. KJ2021A0640, and the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).

Author information

Authors and Affiliations

School of Computer and Information, Hohai University, Nanjing, China
Jingxian Li, Lixin Han, Xin Wang, Bing Hu & Shu Li
School of Software Engineering, Jinling Institute of Technology, Nanjing, China
Jingxian Li
School of Computer and Information, Anqing Normal University, Anqing, China
Yang Wang
School of Computer and Artificial Intelligence, Huaihua University, Huaihua, China
Jianhua Xia
College of Computer Science and Technology, Huaibei Normal University, Huaibei, China
Yi Yang
Department of Electrical Engineering, City University of Hong Kong, Hongkong, China
Hong Yan

Authors

Jingxian Li
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Han
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Xia
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Bing Hu
View author publications
You can also search for this author in PubMed Google Scholar
Shu Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jingxian Li or Lixin Han.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, J., Han, L., Wang, X. et al. A hybrid neural network model based on optimized margin softmax loss function for music classification. Multimed Tools Appl 83, 43871–43906 (2024). https://doi.org/10.1007/s11042-023-17056-4

Download citation

Received: 24 December 2022
Revised: 23 August 2023
Accepted: 14 September 2023
Published: 16 October 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17056-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid neural network model based on optimized margin softmax loss function for music classification

Abstract

Access this article

Similar content being viewed by others

Audio-Based Music Classification with DenseNet and Data Augmentation

Classification of Music Genres Based on Mel-Frequency Cepstrum Coefficients Using Deep Learning Models

Classifying Audio Music Genres Using CNN and RNN

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid neural network model based on optimized margin softmax loss function for music classification

Abstract

Access this article

Similar content being viewed by others

Audio-Based Music Classification with DenseNet and Data Augmentation

Classification of Music Genres Based on Mel-Frequency Cepstrum Coefficients Using Deep Learning Models

Classifying Audio Music Genres Using CNN and RNN

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation