Skip to main content
Log in

Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, an audio-driven algorithm for the detection of speech and music events in multimedia content is introduced. The proposed approach is based on the hypothesis that short-time frame-level discrimination performance can be enhanced by identifying transition points between longer, semantically homogeneous segments of audio. In this context, a two-step segmentation approach is employed in order to initially identify transition points between the homogeneous regions and subsequently classify the derived segments using a supervised binary classifier. The transition point detection mechanism is based on the analysis and composition of multiple self-similarity matrices, generated using different audio feature sets. The implemented technique aims at discriminating events focusing on transition point detection with high temporal resolution, a target that is also reflected in the adopted assessment methodology. Thereafter, multimedia indexing can be efficiently deployed (for both audio and video sequences), incorporating the processes of high resolution temporal segmentation and semantic annotation extraction. The system is evaluated against three publicly available datasets and experimental results are presented in comparison with existing implementations. The proposed algorithm is provided as an open source software package in order to support reproducible research and encourage collaboration in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://github.com/nicktgr15/similarity-based-speech-music-discrimination/blob/master/datasets/featureplans/featureplan

  2. https://github.com/nicktgr15/similarity-based-speech-music-discrimination/tree/master/datasets

  3. https://github.com/nicktgr15/similarity-based-speech-music-discrimination

References

  1. Ajmera J, McCowan IA, Bourlard H (2001) Speech/music discrimination using entropy and dynamism features in a hmm classification framewor. Tech. rep., IDIAP

  2. Carey MJ, Parris ES, Lloyd-Thomas H (1999) A comparison of features for speech, music discrimination. In: IEEE international conference on acoustics, speech, and signal processing, 1999. Proceedings. 1999, vol 1. IEEE, pp 149–152

  3. Continuous frequency activation yaafe plugin repository. https://github.com/mcrg-fhstp/cba-yaafe-extension. [Online; accessed 30-July-2016]

  4. Dimoulas CA, Symeonidis AL (2015) Syncing shared multimedia through audiovisual bimodal segmentation. IEEE MultiMedia 22(3):26–42

    Article  Google Scholar 

  5. El-Maleh K, Klein M, Petrucci G, Kabal P (2000) Speech/music discrimination for multimedia applications. In: IEEE international conference on acoustics, speech, and signal processing, 2000. ICASSP’00. Proceedings. 2000, vol 6. IEEE, pp 2445–2448

  6. Elizalde B, Friedland G (2013) Lost in segmentation: three approaches for speech/non-speech detection in consumer-produced videos. In: 2013 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6

  7. Foote J (1999) Visualizing music and audio using self-similarity. In: Proceedings of the seventh ACM international conference on multimedia (part 1). ACM, pp 77–80

  8. Foote J (2000) Automatic audio segmentation using a measure of audio novelty. In: IEEE international conference on multimedia and expo, 2000. ICME 2000. 2000, vol 1. IEEE, pp 452–455

  9. Gtzan music speech dataset. http://marsyasweb.appspot.com/download/data_sets/. [Online; accessed 30-July-2016]

  10. Jiang H, Lin T, Zhang H (2000) Video segmentation with the support of audio segmentation and classification. In: Proceedings of IEEE ICME

  11. Jun S, Rho S, Hwang E (2015) Music structure analysis using self-similarity matrix and two-stage categorization. Multimedia Tools and Applications 74(1):287–302

    Article  Google Scholar 

  12. Khonglah BK, Prasanna SM (2016) Speech/music classification using speech-specific features. Digital Signal Process 48:71–83

    Article  MathSciNet  Google Scholar 

  13. Kinnunen T, Chernenko E, Tuononen M, Fränti P, Li H (2007) Voice activity detection using mfcc features and support vector machine. In: International conference on speech and computer (SPECOM07), Moscow, Russia, vol 2, pp 556–561

  14. Kotsakis R, Kalliris G, Dimoulas C (2012) Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification. Speech Comm 54(6):743–762

    Article  Google Scholar 

  15. Labrosa music-speech corpus. http://labrosa.ee.columbia.edu/sounds/musp/scheislan.html. [Online; accessed 30-July-2016]

  16. Lavner Y, Ruinskiy D (2009) A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP Journal on Audio, Speech, and Music Processing 2009(1):1

    Article  Google Scholar 

  17. Lim C, Chang JH (2015) Efficient implementation techniques of an svm-based speech/music classifier in smv. Multimedia Tools and Applications 74(15):5375–5400

    Article  Google Scholar 

  18. Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: ISMIR, pp 441–446

  19. Minami K, Akutsu A, Hamada H, Tonomura Y (1998) Video handling with music and speech detection. IEEE Multimedia 5(3):17–25

    Article  Google Scholar 

  20. Mirex 2015 muspeak sample dataset. http://www.music-ir.org/mirex/wiki/2015:music/speech_classification_and_detection#dataset_2. [Online; accessed 30-July-2016]

  21. Miro XA, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: a review of recent research. IEEE Trans Audio Speech Lang Process 20(2):356–370

    Article  Google Scholar 

  22. Moattar M, Homayounpour M (2009) A simple but efficient real-time voice activity detection algorithm. In: Signal processing conference, 2009 17th European. IEEE, pp 2549–2553

  23. Panagiotakis C, Tziritas G (2005) A speech/music discriminator based on rms and zero-crossings. IEEE Trans Multimedia 7(1):155–166

    Article  Google Scholar 

  24. Peakutils. http://pythonhosted.org/peakutils/. [Online; accessed 30-July-2016]

  25. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al. (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830

    MathSciNet  MATH  Google Scholar 

  26. Pikrakis A, Giannakopoulos T, Theodoridis S (2006) Speech/music discrimination for radio broadcasts using a hybrid hmm-bayesian network architecture. In: Signal processing conference, 2006 14th European. IEEE, pp 1–5

  27. Pikrakis A, Giannakopoulos T, Theodoridis S (2008) A speech/music discriminator of radio recordings based on dynamic programming and bayesian networks. IEEE Trans Multimedia 10(5):846–857

    Article  Google Scholar 

  28. Pikrakis A, Theodoridis S (2014) Speech-music discrimination: a deep learning perspective. In: 2014 22nd European signal processing conference (EUSIPCO). IEEE, pp 616–620

  29. Ramalingam T, Dhanalakshmi P (2014) Speech/music classification using wavelet based feature extraction techniques. J Comput Sci 10(1):34–44

    Article  Google Scholar 

  30. Sang-Kyun K, Chang JH (2009) Speech/music classification enhancement for 3gpp2 smv codec based on support vector machine. IEICE Trans Fundam Electron Commun Comput Sci 92(2):630–632

    Google Scholar 

  31. Saunders J (1996) Real-time discrimination of broadcast speech/music. In: ICASSP, vol 96, pp 993–996

  32. Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: IEEE international conference on acoustics, speech, and signal processing, 1997. ICASSP-97., 1997, vol 2. IEEE, pp 1331–1334

  33. Sell G, Clark P (2014) Music tonality features for speech/music discrimination. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2489– 2493

  34. Seyerlehner K, Pohle T, Schedl M, Widmer G (2007) Automatic music detection in television productions. In: Proceedings of the 10th international conference on digital audio effects (DAFx’07). Citeseer

  35. Shirazi J, Ghaemmaghami S (2010) Improvement to speech-music discrimination using sinusoidal model based features. Multimedia Tools and Applications 50(2):415–435

    Article  Google Scholar 

  36. Speech - music discrimination demo version 1.0. http://cgi.di.uoa.gr/~sp_mu/download.html. [Online; accessed 30-July-2016]

  37. Tsipas N, Vrysis L, Dimoulas C, Papanikolaou G (2015) Mirex 2015: Methods for speech/music detection and classification. In: Proceedings of the music information retrieval evaluation eXchange (MIREX). Malaga

    Google Scholar 

  38. Tsipas N, Vrysis L, Dimoulas CA, Papanikolaou G (2015) Content-based music structure analysis using vector quantization. In: Audio engineering society convention 138. Audio engineering society

  39. Tsipas N, Zapartas P, Vrysis L, Dimoulas C (2015) Augmenting social multimedia semantic interaction through audio-enhanced web-tv services. In: Proceedings of the audio mostly 2015 on interaction with sound. ACM, p 34

  40. Wang W, Gao W, Ying D (2003) A fast and robust speech/music discrimination approach. In: Proceedings of the 2003 joint conference of the fourth international conference on information, communications and signal processing, 2003 and fourth pacific rim conference on multimedia, vol 3. IEEE, pp 1325–1329

  41. Wieser E, Husinsky M, Seidl M (2014) Speech/music discrimination in a large database of radio broadcasts from the wild. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2134–2138

  42. Zhang T, Kuo CCJ (2001) Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing 9 (4):441–457

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaos Tsipas.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsipas, N., Vrysis, L., Dimoulas, C. et al. Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination. Multimed Tools Appl 76, 25603–25621 (2017). https://doi.org/10.1007/s11042-016-4315-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-4315-0

Keywords

Navigation