Skip to main content
Log in

A hybrid neural network model based on optimized margin softmax loss function for music classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Music classification has achieved great progress due to the development of Convolutional Neural Networks (CNNs), which is important for music retrieval and recommendation. However, CNN cannot capture temporal information from music audio, which restricts the prediction performance of the model. To address the issue, we propose a Convolutional Neural Network-Long Short Term Memory (CNN-LSTM) model to learn local spatial features by CNN and learn temporal dependencies by LSTM. In addition, the traditional softmax loss function commonly lacks sufficient discrimination in music classification. Therefore, we propose an additive angular margin and cosine margin softmax (AACM-Softmax) loss function to improve classification results, which minimizes intra-class variances and maximizes inter-class variances simultaneously by enforcing combined margin penalties. Furthermore, we combine the CNN-LSTM model with AACM-Softmax loss function to comprehensively improve the classification performance by learning temporal-dependencies-included discriminative essential features. Extensive experiments on music genre datasets and music emotion datasets show that the proposed model consistently outperforms other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Data availability

The datasets analyzed in this study are public datasets, which are available in the public repository.

Notes

  1. https://www.kaggle.com/makvel/mer500.

References

  1. Abdulwahab HM, Ajitha S, Saif MAN (2022) Feature selection techniques in the context of big data: taxonomy and analysis. Appl Intell 52(12):13568–13613

    Article  Google Scholar 

  2. Alhagry S, Fahmy AA, El-Khoribi RA (2017) Emotion recognition based on EEG using LSTM recurrent neural network. Int J Adv Comput Sci Appl 8(10):355–358

    Google Scholar 

  3. Almalawi A, Khan AI, Alsolami F, Alkhathlan A, Fahad A, Irshad K, ... & Qaiyum S (2022) Arithmetic optimization algorithm with deep learning enabled airborne particle-bound metals size prediction model. Chemosphere 303:134960

  4. Bhattacharjee M, Prasanna SM, Guha P (2020) Speech/music classification using features from spectral peaks. IEEE/ACM Trans Audio Speech Lang Process 28:1549–1559

    Article  Google Scholar 

  5. Chen C, Li Q (2020) A multimodal music emotion classification method based on multifeature combined network classifier. Math Probl Eng 2020:1–11

    Google Scholar 

  6. Chen G, Parada C, Sainath TN (2015) Query-by-example keyword spotting using long short-term memory networks. In: Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5236–5240

  7. Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th International Society for Music Information Retrieval Conference, pp 805–811

  8. Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2392–2396

  9. Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 539–546

  10. Costa YM, Oliveira LS, Silla CN Jr (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Appl Soft Comput 52:28–38

    Article  Google Scholar 

  11. da Silva ACM, Coelho MAN, Neto RF (2020) A Music Classification model based on metric learning applied to MP3 audio files. Expert Syst Appl 144:113071

    Article  Google Scholar 

  12. Dai J, Liang S, Xue W, Ni C, Liu W (2016) Long short-term memory recurrent neural network based segment features for music genre classification. In: Proceedings of the 10th International Symposium on Chinese Spoken Language Processing, pp. 1–5

  13. Defferrard M, Benzi K, Vandergheynst P, Bresson X (2016) FMA: A dataset for music analysis. In: Proceedings of the 18th International Society for Music Information Retrieval Conference, pp. 316–323

  14. Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699

  15. Dhal P, Azad C (2021) A comprehensive survey on feature selection in the various fields of machine learning. Appl Intell 52(4):4543–4581

    Article  Google Scholar 

  16. Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (BCRSN): an efficient model for music emotion recognition. IEEE Trans Multimed 21(12):3150–3163

    Article  Google Scholar 

  17. Eck D, Schmidhuber J (2002) A first look at music composition using lstm recurrent neural networks. Technical report,  Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale 103(4):48–56

    Google Scholar 

  18. Ferraro A, Bogdanov D, Jay XS, Jeon H, Yoon J (2021) How low can you go? Reducing frequency and time resolution in current CNN architectures for music auto-tagging. In: Proceedings of the 28th European Signal Processing Conference, pp. 131–135

  19. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2016) LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst 28(10):2222–2232

    Article  MathSciNet  Google Scholar 

  20. Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1735–1742

  21. Han D, Kong Y, Han J, Wang G (2022) A survey of music emotion recognition. Front Comp Sci 16(6):166335

    Article  Google Scholar 

  22. Hizlisoy S, Yildirim S, Tufekci Z (2021) Music emotion recognition using convolutional long short term memory deep neural networks. Eng Sci Technol 24(3):760–767

    Google Scholar 

  23. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  24. Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: Proceedings of the 3rd International Workshop on Similarity-Based Pattern Recognition, pp. 84–92

  25. Islam N, Irshad K (2022) Artificial ecosystem optimization with Deep Learning Enabled Water Quality Prediction and Classification model. Chemosphere 309:136615

    Article  Google Scholar 

  26. Jakubik J (2017) Evaluation of gated recurrent neural networks in music classification tasks. In: Proceedings of the International Conference on Information Systems Architecture and Technology, pp. 27–37

  27. Khan AI, Alsolami F, Alqurashi F, Abushark YB, Sarker IH (2022) Novel energy management scheme in IoT enabled smart irrigation system using optimized intelligence methods. Eng Appl Artif Intell 114:104996

    Article  Google Scholar 

  28. Li C, Bao Z, Li L, Zhao Z (2020) Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf Process Manage 57(3):102185

    Article  Google Scholar 

  29. Li J, Han L, Li X, Zhu J, Yuan B, Gou Z (2022) An evaluation of deep neural network models for music classification using spectrograms. Multimed Tools Applic 81(4):4621–4647

    Article  Google Scholar 

  30. Li J, Han L, Wang Y, Yuan B, Yuan X, Yang Y, Yan H (2022) Combined angular margin and cosine margin softmax loss for music classification based on spectrograms. Neural Comput Appl 34(13):10337–10353

    Article  Google Scholar 

  31. Lidy T, Schindler A (2016) Parallel convolutional neural networks for music genre and mood classification. MIREX 2016:3

    Google Scholar 

  32. Liu H, Fang Y, Huang Q (2019) Music emotion recognition using a variant of recurrent neural network. In: Proceedings of 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application. pp. 15–18

  33. Liu H, Zhu X, Lei Z, Li SZ (2019) Adaptiveface: Adaptive margin and sampling for face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11947–11956

  34. Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220

  35. Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. In: Proceedings of the International Conference on Machine Learning, pp. 507–516

  36. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  37. Lee J, Park J, Kim KL, Nam J (2018) SampleCNN: End-to-end deep convolutional neural networks using very small filters for music classification. Appl Sci 8(1):150

    Article  Google Scholar 

  38. Lyu Q, Wu Z, Zhu J (2015) Polyphonic music modelling with LSTM-RTRBM. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 991–994

  39. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16(8):2203–2213

    Article  Google Scholar 

  40. Nam J, Choi K, Lee J, Chou SY, Yang YH (2018) Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach. IEEE Signal Process Mag 36(1):41–51

    Article  Google Scholar 

  41. Pons J, Serra X (2019) Randomly weighted cnns for (music) audio classification. In: Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 336–340

  42. Rajesh S, Nalini NJ (2020) Musical instrument emotion recognition using deep recurrent neural network. Procedia Comput Sci 167:16–25

    Article  Google Scholar 

  43. Ranjan R, Castillo CD, Chellappa R (2017) L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507

  44. Russo M, Kraljević L, Stella M, Sikora M (2020) Cochleogram-based approach for detecting perceived emotions in music. Inf Process Manage 57(5):102270

    Article  Google Scholar 

  45. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823

  46. Singh S, Kasana SS (2018) Efficient classification of the hyperspectral images using deep learning. Multimed Tools Applic 77(20):27061–27074

    Article  Google Scholar 

  47. Song G, Wang Z, Han F, Ding S, Iqbal MA (2018) Music auto-tagging using deep Recurrent Neural Networks. Neurocomputing 292:104–110

    Article  Google Scholar 

  48. Tang CP, Chui KL, Yu YK, Zeng Z, Wong KH (2018) Music genre classification using a hierarchical long short term memory (LSTM) model. In: Proceedings of the 3rd International Workshop on Pattern Recognition, pp. 334–340

  49. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302

    Article  Google Scholar 

  50. Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930

    Article  Google Scholar 

  51. Wang F, Xiang X, Cheng J, Yuille AL (2017) Normface: L2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1041–1049

  52. Wang H, Wang Y, Zhou Z, Ji X, Gong D, Zhou J, ... & Liu W (2018) Cosface: Large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274

  53. Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, ... & Wu Y (2014) Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393

  54. Wang J, Yu LC, Lai KR, Zhang X (2019) Tree-structured regional CNN-LSTM model for dimensional sentiment analysis. IEEE/ACM Trans Audio Speech Lang Process 28:581–591

    Article  Google Scholar 

  55. Wang Z, Muknahallipatna S, Fan M, Okray A, Lan C (2019) Music classification using an improved crnn with multi-directional spatial dependencies in both time and frequency dimensions. In: Proceedings of 2019 International Joint Conference on Neural Networks, pp. 1–8

  56. Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: Proceedings of the 14th European Conference on Computer Vision, pp. 499–515

  57. Weng W, Wei B, Ke W, Fan Y, Wang J, Li Y (2023) Learning label-specific features with global and local label correlation for multi-label classification. Appl Intell 53(3):3017–3033

    Article  Google Scholar 

  58. Wu HH, Kao CC, Tang Q, Sun M, McFee B, Bello JP, Wang C (2021) Multi-task self-supervised pre-training for music classification. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 556–560

  59. Yu Y (2021) Research on Music Emotion Classification Based on CNN-LSTM Network. In: Proceedings of the 5th Asian Conference on Artificial Intelligence Technology, pp. 473–476

  60. Zhang W, Lei W, Xu X, Xing X (2016) Improved music genre classification with convolutional neural networks. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association, pp. 3304–3308

  61. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323

    Article  Google Scholar 

  62. Zhao K, Li S, Cai J, Wang H, Wang J (2019) An emotional symbolic music generation system based on LSTM networks. In: Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, pp. 2039–2043

  63. Zhou ZH, Feng J (2019) Deep forest. National Science Review 6(1):74–86

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of the Colleges and Universities in Anhui Province of China under Grant No. KJ2020A0035 and No. KJ2021A0640, and the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jingxian Li or Lixin Han.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Han, L., Wang, X. et al. A hybrid neural network model based on optimized margin softmax loss function for music classification. Multimed Tools Appl 83, 43871–43906 (2024). https://doi.org/10.1007/s11042-023-17056-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17056-4

Keywords

Navigation