Abstract
In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection. In our work, we consider that detecting speaker roles like Anchor, Journalist and Other is a first step to enrich interaction sequences between speakers. Our work relies on the assumption of the existence of clues about speaker roles in temporal, prosodic and basic signal features extracted from audio files and from speaker segmentations. Each speaker is therefore represented by a 36-feature vector. Contrarily to most of the state-of-the-art propositions we do not use the structure of the document to recognize the roles of the interveners. We investigate the influence of two dimensionality reduction techniques (Principal Component Analysis and Linear Discriminant Analysis) and different classification methods (Gaussian Mixture Models, K-nearest neighbours and Support Vectors Machines). Experiments are done on the 13-h corpus of the ESTER2 evaluation campaign. The best result reaches about 82% of well recognized roles. This corresponds to more than 89% of speech duration correctly labelled.



Similar content being viewed by others
References
Banerjee S, Rudnicky AI (2006) You are what you say: using meeting participants speech to detect their roles and expertise. In: NAACL-HLT workshop on analyzing conversations in text and speech. New York, USA
Barzilay R, Collins M, Hirschberg J, Whittaker S (2000) The rules behind roles: identifying speaker role in radio broadcasts. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence. AAAI Press/The MIT Press, pp 679–684
Béchet F, Gorin AL, Wright JH, Hakkani-Tur D (2004) Detecting and extracting named entities from spontaneous speech in a mixed initiative spoken dialogue context: how may I help you? Speech Commun 42(2):207–225
Bigot B, Ferrané I (2008) From audio content analysis to conversational speech detection and characterization. In: ACM SIGIR workshop: searching spontaneous conversational speech (SSCS), Singapore, pp 62–65
Bigot B, Ferrané I, Al Abidin Ibrahim Z (2008) Towards the detection and the characterization of conversational speech zones in audiovisual documents. In: International workshop on content-based multimedia indexing (CBMI). IEEE, pp 162–169
Cai R, Lu L, Hanjalic A (2005) Unsupervised content discovery in composite audio. In: MULTIMEDIA ’05: proceedings of the 13th annual ACM international conference on multimedia, pp 628–637
Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419, 27–27
Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm
de Cheveigné A, Kawahara H (2002) Yin, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience
El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech and signal processing, pp 4097–4100
Estève Y, Bazillon T, Antoine J-Y, Béchet F, Farinas J (2010) The EPAC corpus: manual and automatic annotations of conversational speech in french broadcast news. In: Proceedings of the seventh language evaluation and resources conference. ELRA, Valletta, Malta
Favre S, Vinciarelli A, Dielmann A (2009) Automatic role recognition in multiparty recordings using social networks and probabilistic sequential models. In: ACM international conference on multimedia. Beijing
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals Eugen 7:179–188
Fürnkranz J (2001) Round robin rule learning. In: ICML 01: proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 146–153
Galliano S, Geoffrois E, Gravier G, Bonastre J-F, Mostefa D, Choukri K (2006) Corpus description of the ESTER evaluation campaign for the rich transcription of french broadcast news. In: Proceedings of the language evaluation and resources conference
Hsueh P-Y, Moore JD (2007) Combining multiple knowledge sources for dialogue segmentation in multimedia archives. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, Prague, pp 1016–1023
Lamel L, Gauvain J-L (2005) Alternate phone models for conversational speech. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol 1, pp 1005–1008
Li B, Errico JH, Pan H, Sezan I (2004) Bridging the semantic gap in sports video retrieval and summarization. J Vis Commun Image Represent 15(3):393–424
Liu Y (2006) Initial study on automatic identification of speaker role in broadcast news speech. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for Computational Linguistics, New York, pp 81–84
Luz S (2009) Locating case discussion segments in recorded medical team meetings. In: SSCS ’09: proceedings of the third workshop on searching spontaneous conversational speech. ACM, New York, pp 21–30
Mccowan I, Lathoud G, Lincoln M, Lisowska A, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus. In: Noldus LPJJ, Grieco F, Loijens LWS, Zimmerman PH (eds) Proceedings measuring behavior 2005, 5th international conference on methods and techniques in behavioral research. Noldus Information Technology, Wageningen
Popescu A-M, Etzioni O (2005) Extracting product features and opinions from reviews. In: HLT ’05: proceedings of the conference on human language technology and empirical methods in natural language processing, pp 339–346
Rouas J-L, Farinas J, Pellegrino F, André-Obrecht R (2005) Rhythmic unit extraction and modelling for automatic language identification. Speech Commun 47(4):436–456
Stolcke A, Shriberg E, Hakkani-Tür D, Tür G, Rivlin Z, Sönmez K (1999) Combining words and speech prosody for automatic topic segmentation. In: Proceedings of DARPA broadcast news transcription and understanding workshop, pp 61–64
Vinciarelli A (2007) Speakers role recognition in multiparty audio recordings using social network analysis and duration distribution modeling. IEEE Trans Multimedia 9(6):1215–1226
Zhao R, Grosky W (2002) Narrowing the semantic gap—improved text-based web document retrieval using visual features. IEEE Trans Multimedia 4(2):189–200
Acknowledgement
This work is conducted within the EPAC Project—ANR-06-CIS6-MDCA-006.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Bigot, B., Ferrané, I., Pinquier, J. et al. Detecting individual role using features extracted from speaker diarization results. Multimed Tools Appl 60, 347–369 (2012). https://doi.org/10.1007/s11042-010-0609-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-010-0609-9