Speech/music classification using visual and spectral chromagram features

Birajdar, Gajanan K.; Patil, Mukesh D.

doi:10.1007/s12652-019-01303-4

Speech/music classification using visual and spectral chromagram features

Original Research
Published: 25 April 2019

Volume 11, pages 329–347, (2020)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

1099 Accesses
23 Citations
Explore all metrics

Abstract

Automatic speech/music classification is an important tool in multimedia content analysis and retrieval which efficiently categorizes input audio and store it into relevant classes. This article proposes use of chromagram textural and spectral features for speech and music classification. Chromagram textural feature set is based on transforming the input audio into a chromagram image representation and then extracting uniform local binary pattern textural descriptors. Chroma spectral features involves novel chroma bin features which exploits music tonality present in the music signal. The optimal feature subset from the original feature set is selected using eigenvector centrality based feature selection, removing the redundant and irrelevant features and further enhancing the prediction performance. The performance of the algorithm is evaluated using S&S, GTZAN and MUSAN databases providing the advantage and suitability of both chroma spectral and visual features for the classification task. Extensive experiments performed using support vector machine classifier shows that the chromagram textural descriptors outperform other state-of-the-art approaches. Besides, good results are also achieved in the mismatched training and testing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech and music classification using spectrogram based statistical descriptors and extreme learning machine

Article 28 November 2018

Better Than MFCC Audio Classification Features

Speech Recognition Combining MFCCs and Image Features

References

Akram T, Khan MA, Sharif M, Yasmin M (2018) Skin lesion segmentation and recognition using multichannel saliency estimation and m-SVM on selected serially fused features. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-1051-5
Article Google Scholar
Amin J, Sharif M, Raza M, Yasmin M (2018) Detection of brain tumor based on features fusion and machine learning. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-1092-9
Article Google Scholar
Bartsch MA, Wakefield GH (2005) Audio thumbnailing of popular music using chroma-based representations. IEEE Trans Multimed 7(1):96–104
Article Google Scholar
Birajdar GK, Patil MD (2018) Speech and music classification using spectrogram based statistical descriptors and extreme learning machine. Multimed Appl. https://doi.org/10.1007/s11042-018-6899-z
Article Google Scholar
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024
Article Google Scholar
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm. Accessed 10 Apr 2018
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Costa YMG, Oliveira LS, Silla CN (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Appl Soft Comput 52(Supplement C):28–38. https://doi.org/10.1016/j.asoc.2016.12.024
Article Google Scholar
Costa YMG, Oliveira LS, Koerich AL, Gouyon F, Martins JG (2012) Music genre classification using LBP textural features. Signal Process 92(11):2723–2737. https://doi.org/10.1016/j.sigpro.2012.04.023
Article Google Scholar
Didiot E, Illina I, Fohr D, Mella O (2010) A wavelet-based parameterization for speech/music discrimination. Comput Speech Lang 24(2):341–357. https://doi.org/10.1016/j.csl.2009.05.003
Article Google Scholar
Dighe P, Agarwal P, Karnick H, Thota S, Raj B (2013) Scale independent raga identification using chromagram patterns and swara based features. In: IEEE international conference on multimedia and expo, ICME. IEEE computer society, pp 1–4
Ding S, Zhu H, Jia W, Su C (2012) A survey on feature extraction for pattern recognition. Artif Intell Rev 37(3):169–180. https://doi.org/10.1007/s10462-011-9225-y
Article Google Scholar
Fuchs G (2015) A robust speech/music discriminator for switched audio coding, In: 23rd European signal processing conference (EUSIPCO). IEEE, pp 569–573. https://doi.org/10.1109/EUSIPCO.2015.7362447
Fujishima T (1999) Realtime chord recognition of musical sound: a system using common lisp music. In: International computer music conference. Michigan Publishing, ‎Ann Arbor, pp 464–467
Ghosal A, Dutta S (2017) Speech/music discrimination using perceptual feature. In: International conference on computational science and engineering. CRC Press, Boca Raton, pp 71–76
Chapter Google Scholar
Ghosal A, Dhara BC, Saha SK (2011) Speech/music classification using empirical mode decomposition. In: Second international conference on emerging applications of information technology (EAIT). IEEE, pp 49–52. https://doi.org/10.1109/EAIT.2011.19
Hirvonen T (2014) Speech/music classification of short audio segments. In: IEEE International symposium on multimedia. IEEE, pp 135–138. https://doi.org/10.1109/ISM.2014.27
Hussain MS, Haque MA (2018) Swishnet: a fast convolutional neural network for speech, music and noise classification and segmentation. CoRR. arXiv:abs/1812.00149
Jensen R, Shen Q (2008) Computational intelligence and feature selection. Wiley, Hoboken
Book Google Scholar
Kacprzak S, Ziółko M (2013) Speech, music discrimination via energy density analysis. In: Dediu AH, Martín-Vide C, Mitkov R, Truthe B (eds) Statistical language and speech processing. SLSP 2013. Lecture notes in computer science, vol 7978. Springer, Berlin, pp 135–142
Chapter Google Scholar
Kacprzak S, Chwiecko B, Ziółko B (2017) Speech/music discrimination for analysis of radio stations. In: International conference on systems, signals and image processing (IWSSIP). IEEE, pp 1–4. https://doi.org/10.1109/IWSSIP.2017.7965606
Khan MKS, Al-Khatib WG (2006) Machine-learning based classification of speech and music. Multimed Syst 12(1):55–67. https://doi.org/10.1007/s00530-006-0034-0
Article Google Scholar
Khonglah BK, Prasanna SRM (2016) Speech/music classification using speech-specific features. Digit Signal Process 48(Supplement C):71–83. https://doi.org/10.1016/j.dsp.2015.09.005
Article MathSciNet Google Scholar
Kos M, Kačič Z, Vlaj D (2013) Acoustic classification and segmentation using modified spectral roll-off and variance-based features. Digit Signal Process 23(2):659–674. https://doi.org/10.1016/j.dsp.2012.10.008
Article MathSciNet Google Scholar
Lavner Y, Ruinskiy D (2009) A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP J Audio Speech Music Process. https://doi.org/10.1155/2009/239892
Article Google Scholar
Lee Y-S, Chiang Y-L, Lin P-R, Lin C-H, Tai T-C (2016) Robust and efficient content-based music retrieval system. APSIPA Trans Signal Inf Process. https://doi.org/10.1017/ATSIP.2016.4
Article Google Scholar
Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577. https://doi.org/10.1007/s10115-017-1059-8
Article Google Scholar
Lim C, Chang H (2012) Enhancing support vector machine-based speech/music classification using conditional maximum a posteriori criterion. IET Signal Process 6:335–340
Article MathSciNet Google Scholar
Lim C, Chang J-H (2015) Efficient implementation techniques of an SVM-based speech/music classifier in SMV. Multimed Tools Appl 74(15):5375–5400. https://doi.org/10.1007/s11042-014-1859-8
Article Google Scholar
Miao J, Niu L (2016) A survey on feature selection. Procedia Comput Sci 91(Supplement C):919–926. https://doi.org/10.1016/j.procs.2016.07.111
Article Google Scholar
Müller M, Kurth F, Clausen M (2005) Audio matching via chroma-based statistical features. In: Proceedings of the 6th international conference on music information retrieval (ISMIR), pp 288–295. https://doi.org/10.5281/zenodo.1416800
Mulyadi AW, Machbub C, Prihatmanto AS, Sin B-K (2016) Design of music learning assistant based on audio music and music score recognition. J Korea Multimed Soc 19(5):826–836
Article Google Scholar
Nanni L, Costa YMG, Lucio DR, Silla CN, Brahnam S (2017) Combining visual and acoustic features for audio classification tasks. Pattern Recognit Lett 88(Supplement C):49–56. https://doi.org/10.1016/j.patrec.2017.01.013
Article Google Scholar
Nanni L, Costa YMG, Lumini A, Kim MY, Baek SR (2016) Combining visual and acoustic features for music genre classification. Expert Syst Appl 45:108–117. https://doi.org/10.1016/j.eswa.2015.09.018
Article Google Scholar
Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987. https://doi.org/10.1109/TPAMI.2002.1017623
Article MATH Google Scholar
Pikrakis A, Giannakopoulos T, Theodoridis S (2008) A speech/music discriminator of radio recordings based on dynamic programming and Bayesian networks. IEEE Trans Multimed 10(5):846–67. https://doi.org/10.1109/TMM.2008.922870
Article Google Scholar
Pinquier J, André-Obrecht R (2006) Audio indexing: primary components retrieval. Multimed Tools Appl 30(3):313–330. https://doi.org/10.1007/s11042-006-0027-1
Article Google Scholar
Prabukumar M, Agilandeeswari L, Ganesan K (2019) An intelligent lung cancer diagnosis system using cuckoo search optimization and support vector machine classifier. J Ambient Intell Humaniz Comput 10(1):267–293. https://doi.org/10.1007/s12652-017-0655-5
Article Google Scholar
Qazi KA, Nawaz T, Mehmood Z, Rashid M, Habib HA (2018) A hybrid technique for speech segregation and classification using a sophisticated deep neural network. PLoS One 13(3):1–15. https://doi.org/10.1371/journal.pone.0194151
Article Google Scholar
Ren J, Jiang X, Yuan J, Magnenat-Thalmann N (2017) Sound-event classification using robust texture features for robot hearing. IEEE Trans Multimed 19(3):447–458
Article Google Scholar
Reyes NR, Candeas PV, Galán SG, Muñoz JE (2010) Two-stage cascaded classification approach based on genetic fuzzy learning for speech/music discrimination. Eng Appl Artif Intell 23(2):151–159. https://doi.org/10.1016/j.engappai.2009.06.006
Article Google Scholar
Roffo G, Melzi S (2017) Ranking to learn: feature ranking and selection via Eigenvector centrality. In: New frontiers in mining complex patterns: 5th international workshop, NFMCP 2016. Springer International Publishing, Berlin, pp 19–35
Chapter Google Scholar
Ruiz-Reyes N, Vera-Candeas P, Muñoz JE, García-Galán S, Cañadas FJ (2009) New speech/music discrimination approach based on fundamental frequency estimation. Multimed Tools Appl 41(2):253–286. https://doi.org/10.1007/s11042-008-0228-x
Article Google Scholar
Saunders J (1996) Real-time discrimination of broadcast speech/music. Proc ICASSP 2:993–996
Google Scholar
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: Proceedings of the 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP ’97). ICASSP ’97, vol 2. IEEE Computer Society, Washington, D.C., pp 1331–1335
Sell G, Clark P (2014) Music tonality features for speech/music discrimination. In: IEEE international conference on acoustic, speech and signal processing (ICASSP). IEEE, pp 2489–2493. https://doi.org/10.1109/ICASSP.2014.6854048
Seo JS (2018) Speech/music classification based on the higher-order moments of subband energy. J Korea Multimed Soc 21:737–744
Google Scholar
Shepard RN (1964) Circularity in judgments of relative pitch. J Acoust Soc Am. https://doi.org/10.1121/1.1919362
Article Google Scholar
Shirazi J, Ghaemmaghami S (2010) Improvement to speech-music discrimination using sinusoidal model based features. Multimed Tools Appl 50(2):415–435. https://doi.org/10.1007/s11042-009-0416-3
Article Google Scholar
Snyder D, Chen G, Povey D (2015) MUSAN: a music, speech, and noise corpus. arXiv:1510.08484v1
Tsipas N, Vrysis L, Dimoulas C, Papanikolaou G (2017) Efficient audio-driven multimedia indexing through similarity-based speech/music discrimination. Multimed Tools Appl 76(24):25603–25621. https://doi.org/10.1007/s11042-016-4315-0
Article Google Scholar
Vapnik VN (1998) Statistical learning theory. Wiley-Interscience, New York
MATH Google Scholar
VenkateswarLal P, Nitta GR, Prasad A (2019) Ensemble of texture and shape descriptors using support vector machine classification for face recognition. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01192-7
Article Google Scholar
Wakefield GH (1999) Mathematical representation of joint time-chroma distributions. SPIE, pp 3807–38079. https://doi.org/10.1117/12.367679
Wang WQ, Go W, Ying DW (2003) A fast and robust speech/music discrimination approach. In: Fourth international conference on information, communications & signal processing, fourth IEEE Pacific-Rim conference on multimedia. ICICS-PCM 2003. IEEE, pp 1325–1329
Wu Q, Yan Q, Deng H, Wang J (2010) A combination of data mining method with decision trees building for speech/music discrimination. Comput Speech Lang 24(2):257–272. https://doi.org/10.1016/j.csl.2009.04.009
Article Google Scholar
Yang W, Tu W, Zheng J, Zhang X, Yang Y, Song Y (2018) An RNN-based speech-music discrimination used for hybrid audio coder. In: Schoeffmann K, Chalidabhongse TH, Ngo CW, Aramvith S, O’Connor NE, Ho Y-S, Gabbouj M, Elgammal A (eds) Multimed Model. Springer, Cham, pp 81–92
Chapter Google Scholar
Yang W, Krishnan S (2017) Combining temporal features by local binary pattern for acoustic scene classification. IEEE/ACM Trans Audio Speech Lang Process 25(6):1315–1321
Article Google Scholar
Žemgulys J, Raudonis V, Maskeliūnas R, Damaševičius R (2019) Recognition of basketball referee signals from real-time videos. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01209-1
Article Google Scholar
Zhang H, Yang X-K, Zhang W-Q, Zhang W-L, Liu J (2016) Application of i-vector in speech and music classification. In: IEEE international symposium on signal processing and information technology (ISSPIT). IEEE, pp 1–5. https://doi.org/10.1109/ISSPIT.2016.7885999
Zhou H, Sadka A, Jiang RM (2008) Feature extraction for speech and music discrimination. In: International workshop on content-based multimedia indexing. CBMI 2008. IEEE, pp 170–173. https://doi.org/10.1109/CBMI.2008.4564943

Download references

Acknowledgements

Authors would like to thank Eric Scheirer and Malcolm Slaney for making their speech/music database available for us. Also, we acknowledge the help of David Snyder and Guoguo Chen and Daniel Povey for providing MUSAN corpus. We are also thankful to the anonymous reviewers for their insightful and constructive comments.

Author information

Authors and Affiliations

Department of Electronics Engineering, Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, Maharashtra, 400706, India
Gajanan K. Birajdar
Department of Electronics and Telecommunication Engineering, Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, Maharashtra, 400706, India
Mukesh D. Patil

Authors

Gajanan K. Birajdar
View author publications
You can also search for this author in PubMed Google Scholar
Mukesh D. Patil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gajanan K. Birajdar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Birajdar, G.K., Patil, M.D. Speech/music classification using visual and spectral chromagram features. J Ambient Intell Human Comput 11, 329–347 (2020). https://doi.org/10.1007/s12652-019-01303-4

Download citation

Received: 15 May 2018
Accepted: 19 April 2019
Published: 25 April 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s12652-019-01303-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech/music classification using visual and spectral chromagram features

Abstract

Access this article

Similar content being viewed by others

Speech and music classification using spectrogram based statistical descriptors and extreme learning machine

Better Than MFCC Audio Classification Features

Speech Recognition Combining MFCCs and Image Features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech/music classification using visual and spectral chromagram features

Abstract

Access this article

Similar content being viewed by others

Speech and music classification using spectrogram based statistical descriptors and extreme learning machine

Better Than MFCC Audio Classification Features

Speech Recognition Combining MFCCs and Image Features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation