Abstract
State-of-the-art audio segmentation strategies obtain good results when performing simple tasks but its performance is degraded when segmenting real-world scenarios such as radio and television programmes; this issue can be partially solved by performing a fusion of different audio segmentation strategies. Hence, a framework to perform decision-level fusion in the audio segmentation task is presented in this paper. First, the class-conditional probabilities of each audio segmentation strategy are estimated from a confusion matrix obtained by performing audio segmentation in a training dataset. Performance measures are extracted from these class-conditional probabilities, which are used to compute different estimates of the classifier’s reliability; specifically, reliability estimates based on precision, recall, accuracy, F-score and mutual information were proposed. These reliability estimates are used as weights in a weighted majority voting fusion strategy. The validity of the proposed fusion scheme and reliability estimates was assessed in the framework of Albayzin 2010, 2012 and 2014 audio segmentation evaluations, which consisted in segmenting collections of radio and television programmes. The experimental results showed that this simple fusion strategy improves the performance achieved by the individual audio segmentation strategies and by other well-known decision-level fusion strategies.







Similar content being viewed by others
References
Anguera X, Hernando J (2004) XBIC: Nueva Medida para segmentación de locutor hacia el indexado automático de la señal de voz. In: III Jornadas en tecnología del habla, 237–242
Butko T, Nadeu C (2011) Audio segmentation of broadcast news in the albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech and Music Processing 2011(1)
Butko T, Nadeu C, Schulz H (2010) Albayzin-2010 audio segmentation evaluation: Evaluation setup and results. In: Proceedings of FALA 2010 - VI jornadas en tecnología del habla and II iberian SLTech workshop, 305–308
Castan D, Ortega A, Miguel A, Lleida E (2014) Audio segmentation-by-classification approach based on factor analysis in broadcast news domain. EURASIP Journal on Audio, Speech and Music Processing 2014(34)
Castanedo F (2013) A review of data fusion techniques. Sci World J:2013
Cettolo M, Vescovi M (2003) Efficient audio segmentation algorithms based on the BIC. In: Proceedings of ICASSP VI, 537–540
Cho S, Kim J (1995) Multiple network fusion using fuzzy logic. IEEE Trans Neural Netw 6(2):497–501
Comon P (1994) Independent component analysis - a new concept? Signal Process 36:287– 314
Delacourt P, Kryze D, Wellekens CJ (2000) DISTBIC: a speaker-based segmentation for audio data indexing. Speech Comm 32(1-2):111–126
Do CT, Barras C, Lee VB, Sarkar AK (2013) Augmenting short-term cepstral features with long-term discriminative features for speaker verification of telephone data. In: Proceedings of interspeech, 2484–2488
Franco-Pedroso J, Gomez-Rincon E, Ramos D, Gonzalez-Rodriguez J (2014) ATVS-UAM system description for the albayzin 2014 audio segmentation evaluation. In: Proceedings of iberspeech 2014: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 247–252
Gunatilaka AH, Baertlein BA (2001) Feature-Level And Decision-Level fusion of noncoincidently sampled sensors for land mine detection. IEEE Trans Pattern Anal Mach Intell 23(6):577–589
Hall M (1998) Correlation-based feature subset selection for machine learning. Ph.D. Thesis, University of Waikato, Hamilton, New Zealand
Huang YS, Suen CY (1993) The Behavior-Knowledge space method for combination of multiple classifiers. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp 347–352
Kasapoglu NG, Anfinsen SN, Eltoft T (2012) Fusion of optical and multifrequency PolSAR data for forest classification. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp 3355–3358
Kittler J, Hatef M, Duln P, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Koa AH, Sabourina R, de Souza Britto Jr. A, Oliveira L (2007) Pairwise fusion matrix for combining classifiers. Pattern Recogn 40(8):2198–2210
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley-Science
Kuncheva L, Rodriguez J (2014) A weighted voting framework for classifiers ensembles. Knowl Inf Syst 38(2)
Littlestone N, Warmuth M (1994) Weighted majority algorithm. Inf Comput:212–261
Lopez-Otero P, Docio-Fernandez L, Garcia-Mateo C (2014) GTM-UVIgo System for Albayzin 2014 Audio Segmentation Evaluation. In: Proceedings of iberspeech 2014: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 253–262
Meinedo H, Neto J (2005) A Stream-Based audio segmentation, classification and clustering Pre-Processing system for broadcast news using ANN models. In: Proceedings of interspeech, 237–240
Metze F, Rawat S, Wang Y (2014) Improved audio features for Large-Scale multimedia event detection. In: IEEE International conference on multimedia and expo, ICME, 1–6
Molina L (2002) Feature selection algorithms: a survey and experimental evaluation. In: Proceedings of IEEE international conference on data mining, 306–313
Ortega A, Castan D, Miguel A, Lleida E (2014) The albayzin 2014 audio segmentation evaluation. In: Proceedings of iberspeech: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 283–289
Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45
Ramona M, Richard G (2009) Comparison of different strategies for a SVM-based audio segmentation. In: Proceedings of the european signal processing conference (EUSIPCO)
Rodriguez-Fuentes L, Penagarikano M, Varona A, Diez M, Bordel G (2012) GTTS Systems for the albayzin 2012 audio segmentation evaluation. In: Proceedings of iberspeech 2012: VII jornadas en tecnología del habla and III iberian SLTech workshop, 590–595
Ross A, Govindarajan R (2005) Feature level fusion using hand and face biometrics. In: Proceedings of SPIE conference on biometric technology for human identification II 5779, 196–204
Rybach D, Gollan C, Schlüter R, Ney H (2009) Audio segmentation for speech recognition using segment features. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), 4197–4200
Schuller B, Metze F, Steidl S, Batliner A, Eyben F, Polzehl T (2010) Late fusion of individual engines for improved recognition of negative emotion in speech - learning vs. democratic vote. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), 5230–5233
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Seyerlehner K, Pohle T, Schedl M, Widmer G (2007) Automatic music detection in television productions. In: Proceedings of the 10th international conference on digital audio effects (DAFx-07)
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
Silvestre-Cerdà J, Giménez A, Andrés-Ferrer J, Civera J, Juan A (2012) Albayzin evaluation: the PRHLT-UPV audio segmentation system. In: Proceedings of iberspeech: VII jornadas en tecnología del habla and III iberian SLTech workshop, 596–600
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Tao Q, Veldhuis R (2009) Threshold-optimized decision-level fusion and its application to biometrics. Pattern Recogn 42:823–836
Tavarez D, Navas E, Alonso A, Erro D, Saratxaga I, Hernaez I (2014) Aholab audio segmentation system for albayzin 2014 evaluation campaign. In: Proceedings of iberspeech 2014: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 273–282
Tulys P, Akkermans A, Kevenaar T, Schrijen G, Bazen A, Veldhuis R (2005) Practical biometric authentication with template protection. In: Proceedings of 5th international conference on audio- and video-based personal authentication, 436–446
Tzanetakis G (2002) Manipulation, analysis and retrieval systems for audio signals. Ph.D. Thesis, Princeton University
Young SJ, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P (2006) The HTK book version 3.4, Cambridge University Press
Acknowledgments
This work has been supported by the European Regional Development Fund, the Galician Regional Government (GRC2014/024, ’Consolidation of Research Units: AtlantTIC Project’ CN2012/160) and the Spanish Government (‘SpeechTech4All Project’ TEC2012-38939-C03-01).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lopez-Otero, P., Docio-Fernandez, L. & Garcia-Mateo, C. Ensemble audio segmentation for radio and television programmes. Multimed Tools Appl 76, 7421–7444 (2017). https://doi.org/10.1007/s11042-016-3386-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3386-2