Skip to main content
Log in

Detecting individual role using features extracted from speaker diarization results

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection. In our work, we consider that detecting speaker roles like Anchor, Journalist and Other is a first step to enrich interaction sequences between speakers. Our work relies on the assumption of the existence of clues about speaker roles in temporal, prosodic and basic signal features extracted from audio files and from speaker segmentations. Each speaker is therefore represented by a 36-feature vector. Contrarily to most of the state-of-the-art propositions we do not use the structure of the document to recognize the roles of the interveners. We investigate the influence of two dimensionality reduction techniques (Principal Component Analysis and Linear Discriminant Analysis) and different classification methods (Gaussian Mixture Models, K-nearest neighbours and Support Vectors Machines). Experiments are done on the 13-h corpus of the ESTER2 evaluation campaign. The best result reaches about 82% of well recognized roles. This corresponds to more than 89% of speech duration correctly labelled.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://epac.univ-lemans.fr/

References

  1. Banerjee S, Rudnicky AI (2006) You are what you say: using meeting participants speech to detect their roles and expertise. In: NAACL-HLT workshop on analyzing conversations in text and speech. New York, USA

  2. Barzilay R, Collins M, Hirschberg J, Whittaker S (2000) The rules behind roles: identifying speaker role in radio broadcasts. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence. AAAI Press/The MIT Press, pp 679–684

  3. Béchet F, Gorin AL, Wright JH, Hakkani-Tur D (2004) Detecting and extracting named entities from spontaneous speech in a mixed initiative spoken dialogue context: how may I help you? Speech Commun 42(2):207–225

    Article  Google Scholar 

  4. Bigot B, Ferrané I (2008) From audio content analysis to conversational speech detection and characterization. In: ACM SIGIR workshop: searching spontaneous conversational speech (SSCS), Singapore, pp 62–65

  5. Bigot B, Ferrané I, Al Abidin Ibrahim Z (2008) Towards the detection and the characterization of conversational speech zones in audiovisual documents. In: International workshop on content-based multimedia indexing (CBMI). IEEE, pp 162–169

  6. Cai R, Lu L, Hanjalic A (2005) Unsupervised content discovery in composite audio. In: MULTIMEDIA ’05: proceedings of the 13th annual ACM international conference on multimedia, pp 628–637

  7. Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419, 27–27

  8. Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm

  9. de Cheveigné A, Kawahara H (2002) Yin, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930

    Article  Google Scholar 

  10. Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience

  11. El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech and signal processing, pp 4097–4100

  12. Estève Y, Bazillon T, Antoine J-Y, Béchet F, Farinas J (2010) The EPAC corpus: manual and automatic annotations of conversational speech in french broadcast news. In: Proceedings of the seventh language evaluation and resources conference. ELRA, Valletta, Malta

  13. Favre S, Vinciarelli A, Dielmann A (2009) Automatic role recognition in multiparty recordings using social networks and probabilistic sequential models. In: ACM international conference on multimedia. Beijing

  14. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals Eugen 7:179–188

    Article  Google Scholar 

  15. Fürnkranz J (2001) Round robin rule learning. In: ICML 01: proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 146–153

    Google Scholar 

  16. Galliano S, Geoffrois E, Gravier G, Bonastre J-F, Mostefa D, Choukri K (2006) Corpus description of the ESTER evaluation campaign for the rich transcription of french broadcast news. In: Proceedings of the language evaluation and resources conference

  17. Hsueh P-Y, Moore JD (2007) Combining multiple knowledge sources for dialogue segmentation in multimedia archives. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, Prague, pp 1016–1023

    Google Scholar 

  18. Lamel L, Gauvain J-L (2005) Alternate phone models for conversational speech. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol 1, pp 1005–1008

  19. Li B, Errico JH, Pan H, Sezan I (2004) Bridging the semantic gap in sports video retrieval and summarization. J Vis Commun Image Represent 15(3):393–424

    MATH  Google Scholar 

  20. Liu Y (2006) Initial study on automatic identification of speaker role in broadcast news speech. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for Computational Linguistics, New York, pp 81–84

    Chapter  Google Scholar 

  21. Luz S (2009) Locating case discussion segments in recorded medical team meetings. In: SSCS ’09: proceedings of the third workshop on searching spontaneous conversational speech. ACM, New York, pp 21–30

    Chapter  Google Scholar 

  22. Mccowan I, Lathoud G, Lincoln M, Lisowska A, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus. In: Noldus LPJJ, Grieco F, Loijens LWS, Zimmerman PH (eds) Proceedings measuring behavior 2005, 5th international conference on methods and techniques in behavioral research. Noldus Information Technology, Wageningen

  23. Popescu A-M, Etzioni O (2005) Extracting product features and opinions from reviews. In: HLT ’05: proceedings of the conference on human language technology and empirical methods in natural language processing, pp 339–346

  24. Rouas J-L, Farinas J, Pellegrino F, André-Obrecht R (2005) Rhythmic unit extraction and modelling for automatic language identification. Speech Commun 47(4):436–456

    Article  Google Scholar 

  25. Stolcke A, Shriberg E, Hakkani-Tür D, Tür G, Rivlin Z, Sönmez K (1999) Combining words and speech prosody for automatic topic segmentation. In: Proceedings of DARPA broadcast news transcription and understanding workshop, pp 61–64

  26. Vinciarelli A (2007) Speakers role recognition in multiparty audio recordings using social network analysis and duration distribution modeling. IEEE Trans Multimedia 9(6):1215–1226

    Article  Google Scholar 

  27. Zhao R, Grosky W (2002) Narrowing the semantic gap—improved text-based web document retrieval using visual features. IEEE Trans Multimedia 4(2):189–200

    Article  Google Scholar 

Download references

Acknowledgement

This work is conducted within the EPAC Project—ANR-06-CIS6-MDCA-006.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin Bigot.

Appendix

Appendix

Table 20 The temporal feature set definitions for a speaker segment cluster
Table 21 Speaker role classification using (a) the temporal feature subset, (b) the signal feature subset, (c) the prosodic feature subset and (d) the prosodic and temporal feature subsets with the automatic speaker segmentation: speaker role accuracy σ a , a 95% confidence interval and the speech duration accuracy τ a

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bigot, B., Ferrané, I., Pinquier, J. et al. Detecting individual role using features extracted from speaker diarization results. Multimed Tools Appl 60, 347–369 (2012). https://doi.org/10.1007/s11042-010-0609-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-010-0609-9

Keywords

Navigation