Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination

Tsipas, Nikolaos; Vrysis, Lazaros; Dimoulas, Charalampos; Papanikolaou, George

doi:10.1007/s11042-016-4315-0

Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination

Published: 10 January 2017

Volume 76, pages 25603–25621, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Nikolaos Tsipas ORCID: orcid.org/0000-0001-7232-8839¹,
Lazaros Vrysis¹,
Charalampos Dimoulas¹ &
…
George Papanikolaou¹

648 Accesses
Explore all metrics

Abstract

In this paper, an audio-driven algorithm for the detection of speech and music events in multimedia content is introduced. The proposed approach is based on the hypothesis that short-time frame-level discrimination performance can be enhanced by identifying transition points between longer, semantically homogeneous segments of audio. In this context, a two-step segmentation approach is employed in order to initially identify transition points between the homogeneous regions and subsequently classify the derived segments using a supervised binary classifier. The transition point detection mechanism is based on the analysis and composition of multiple self-similarity matrices, generated using different audio feature sets. The implemented technique aims at discriminating events focusing on transition point detection with high temporal resolution, a target that is also reflected in the adopted assessment methodology. Thereafter, multimedia indexing can be efficiently deployed (for both audio and video sequences), incorporating the processes of high resolution temporal segmentation and semantic annotation extraction. The system is evaluated against three publicly available datasets and experimental results are presented in comparison with existing implementations. The proposed algorithm is provided as an open source software package in order to support reproducible research and encourage collaboration in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relevance-based quantization of scattering features for unsupervised mining of environmental audio

Article Open access 29 September 2018

Automatic environmental sound concepts discovery for video retrieval

Article 14 March 2016

Audio Fingerprinting System to Detect and Match Audio Recordings

Notes

References

Ajmera J, McCowan IA, Bourlard H (2001) Speech/music discrimination using entropy and dynamism features in a hmm classification framewor. Tech. rep., IDIAP
Carey MJ, Parris ES, Lloyd-Thomas H (1999) A comparison of features for speech, music discrimination. In: IEEE international conference on acoustics, speech, and signal processing, 1999. Proceedings. 1999, vol 1. IEEE, pp 149–152
Continuous frequency activation yaafe plugin repository. https://github.com/mcrg-fhstp/cba-yaafe-extension. [Online; accessed 30-July-2016]
Dimoulas CA, Symeonidis AL (2015) Syncing shared multimedia through audiovisual bimodal segmentation. IEEE MultiMedia 22(3):26–42
Article Google Scholar
El-Maleh K, Klein M, Petrucci G, Kabal P (2000) Speech/music discrimination for multimedia applications. In: IEEE international conference on acoustics, speech, and signal processing, 2000. ICASSP’00. Proceedings. 2000, vol 6. IEEE, pp 2445–2448
Elizalde B, Friedland G (2013) Lost in segmentation: three approaches for speech/non-speech detection in consumer-produced videos. In: 2013 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6
Foote J (1999) Visualizing music and audio using self-similarity. In: Proceedings of the seventh ACM international conference on multimedia (part 1). ACM, pp 77–80
Foote J (2000) Automatic audio segmentation using a measure of audio novelty. In: IEEE international conference on multimedia and expo, 2000. ICME 2000. 2000, vol 1. IEEE, pp 452–455
Gtzan music speech dataset. http://marsyasweb.appspot.com/download/data_sets/. [Online; accessed 30-July-2016]
Jiang H, Lin T, Zhang H (2000) Video segmentation with the support of audio segmentation and classification. In: Proceedings of IEEE ICME
Jun S, Rho S, Hwang E (2015) Music structure analysis using self-similarity matrix and two-stage categorization. Multimedia Tools and Applications 74(1):287–302
Article Google Scholar
Khonglah BK, Prasanna SM (2016) Speech/music classification using speech-specific features. Digital Signal Process 48:71–83
Article MathSciNet Google Scholar
Kinnunen T, Chernenko E, Tuononen M, Fränti P, Li H (2007) Voice activity detection using mfcc features and support vector machine. In: International conference on speech and computer (SPECOM07), Moscow, Russia, vol 2, pp 556–561
Kotsakis R, Kalliris G, Dimoulas C (2012) Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification. Speech Comm 54(6):743–762
Article Google Scholar
Labrosa music-speech corpus. http://labrosa.ee.columbia.edu/sounds/musp/scheislan.html. [Online; accessed 30-July-2016]
Lavner Y, Ruinskiy D (2009) A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP Journal on Audio, Speech, and Music Processing 2009(1):1
Article Google Scholar
Lim C, Chang JH (2015) Efficient implementation techniques of an svm-based speech/music classifier in smv. Multimedia Tools and Applications 74(15):5375–5400
Article Google Scholar
Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: ISMIR, pp 441–446
Minami K, Akutsu A, Hamada H, Tonomura Y (1998) Video handling with music and speech detection. IEEE Multimedia 5(3):17–25
Article Google Scholar
Mirex 2015 muspeak sample dataset. http://www.music-ir.org/mirex/wiki/2015:music/speech_classification_and_detection#dataset_2. [Online; accessed 30-July-2016]
Miro XA, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: a review of recent research. IEEE Trans Audio Speech Lang Process 20(2):356–370
Article Google Scholar
Moattar M, Homayounpour M (2009) A simple but efficient real-time voice activity detection algorithm. In: Signal processing conference, 2009 17th European. IEEE, pp 2549–2553
Panagiotakis C, Tziritas G (2005) A speech/music discriminator based on rms and zero-crossings. IEEE Trans Multimedia 7(1):155–166
Article Google Scholar
Peakutils. http://pythonhosted.org/peakutils/. [Online; accessed 30-July-2016]
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al. (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830
MathSciNet MATH Google Scholar
Pikrakis A, Giannakopoulos T, Theodoridis S (2006) Speech/music discrimination for radio broadcasts using a hybrid hmm-bayesian network architecture. In: Signal processing conference, 2006 14th European. IEEE, pp 1–5
Pikrakis A, Giannakopoulos T, Theodoridis S (2008) A speech/music discriminator of radio recordings based on dynamic programming and bayesian networks. IEEE Trans Multimedia 10(5):846–857
Article Google Scholar
Pikrakis A, Theodoridis S (2014) Speech-music discrimination: a deep learning perspective. In: 2014 22nd European signal processing conference (EUSIPCO). IEEE, pp 616–620
Ramalingam T, Dhanalakshmi P (2014) Speech/music classification using wavelet based feature extraction techniques. J Comput Sci 10(1):34–44
Article Google Scholar
Sang-Kyun K, Chang JH (2009) Speech/music classification enhancement for 3gpp2 smv codec based on support vector machine. IEICE Trans Fundam Electron Commun Comput Sci 92(2):630–632
Google Scholar
Saunders J (1996) Real-time discrimination of broadcast speech/music. In: ICASSP, vol 96, pp 993–996
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing, 1997. ICASSP-97., 1997, vol 2. IEEE, pp 1331–1334
Sell G, Clark P (2014) Music tonality features for speech/music discrimination. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2489– 2493
Seyerlehner K, Pohle T, Schedl M, Widmer G (2007) Automatic music detection in television productions. In: Proceedings of the 10th international conference on digital audio effects (DAFx’07). Citeseer
Shirazi J, Ghaemmaghami S (2010) Improvement to speech-music discrimination using sinusoidal model based features. Multimedia Tools and Applications 50(2):415–435
Article Google Scholar
Speech - music discrimination demo version 1.0. http://cgi.di.uoa.gr/~sp_mu/download.html. [Online; accessed 30-July-2016]
Tsipas N, Vrysis L, Dimoulas C, Papanikolaou G (2015) Mirex 2015: Methods for speech/music detection and classification. In: Proceedings of the music information retrieval evaluation eXchange (MIREX). Malaga
Google Scholar
Tsipas N, Vrysis L, Dimoulas CA, Papanikolaou G (2015) Content-based music structure analysis using vector quantization. In: Audio engineering society convention 138. Audio engineering society
Tsipas N, Zapartas P, Vrysis L, Dimoulas C (2015) Augmenting social multimedia semantic interaction through audio-enhanced web-tv services. In: Proceedings of the audio mostly 2015 on interaction with sound. ACM, p 34
Wang W, Gao W, Ying D (2003) A fast and robust speech/music discrimination approach. In: Proceedings of the 2003 joint conference of the fourth international conference on information, communications and signal processing, 2003 and fourth pacific rim conference on multimedia, vol 3. IEEE, pp 1325–1329
Wieser E, Husinsky M, Seidl M (2014) Speech/music discrimination in a large database of radio broadcasts from the wild. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2134–2138
Zhang T, Kuo CCJ (2001) Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing 9 (4):441–457
Article Google Scholar

Download references

Author information

Authors and Affiliations

Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece
Nikolaos Tsipas, Lazaros Vrysis, Charalampos Dimoulas & George Papanikolaou

Authors

Nikolaos Tsipas
View author publications
You can also search for this author inPubMed Google Scholar
Lazaros Vrysis
View author publications
You can also search for this author inPubMed Google Scholar
Charalampos Dimoulas
View author publications
You can also search for this author inPubMed Google Scholar
George Papanikolaou
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nikolaos Tsipas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsipas, N., Vrysis, L., Dimoulas, C. et al. Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination. Multimed Tools Appl 76, 25603–25621 (2017). https://doi.org/10.1007/s11042-016-4315-0

Download citation

Received: 31 July 2016
Revised: 13 December 2016
Accepted: 26 December 2016
Published: 10 January 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s11042-016-4315-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Relevance-based quantization of scattering features for unsupervised mining of environmental audio

Automatic environmental sound concepts discovery for video retrieval

Audio Fingerprinting System to Detect and Match Audio Recordings

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now