Detecting individual role using features extracted from speaker diarization results

Bigot, Benjamin; Ferrané, Isabelle; Pinquier, Julien; André-Obrecht, Régine

doi:10.1007/s11042-010-0609-9

Detecting individual role using features extracted from speaker diarization results

Published: 30 September 2010

Volume 60, pages 347–369, (2012)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Benjamin Bigot¹,
Isabelle Ferrané¹,
Julien Pinquier¹ &
…
Régine André-Obrecht¹

242 Accesses
Explore all metrics

Abstract

In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such high-level event detection. In our work, we consider that detecting speaker roles like Anchor, Journalist and Other is a first step to enrich interaction sequences between speakers. Our work relies on the assumption of the existence of clues about speaker roles in temporal, prosodic and basic signal features extracted from audio files and from speaker segmentations. Each speaker is therefore represented by a 36-feature vector. Contrarily to most of the state-of-the-art propositions we do not use the structure of the document to recognize the roles of the interveners. We investigate the influence of two dimensionality reduction techniques (Principal Component Analysis and Linear Discriminant Analysis) and different classification methods (Gaussian Mixture Models, K-nearest neighbours and Support Vectors Machines). Experiments are done on the 13-h corpus of the ESTER2 evaluation campaign. The best result reaches about 82% of well recognized roles. This corresponds to more than 89% of speech duration correctly labelled.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker Discrimination Using Several Classifiers and a Relativistic Speaker Characterization

Intra-Speaker Variability Assessment for Speaker Recognition in Degraded Conditions: A Case of African Tone Languages

Mining speech signal patterns for robust speaker variability classification

Article 14 September 2022

Notes

http://epac.univ-lemans.fr/

References

Banerjee S, Rudnicky AI (2006) You are what you say: using meeting participants speech to detect their roles and expertise. In: NAACL-HLT workshop on analyzing conversations in text and speech. New York, USA
Barzilay R, Collins M, Hirschberg J, Whittaker S (2000) The rules behind roles: identifying speaker role in radio broadcasts. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence. AAAI Press/The MIT Press, pp 679–684
Béchet F, Gorin AL, Wright JH, Hakkani-Tur D (2004) Detecting and extracting named entities from spontaneous speech in a mixed initiative spoken dialogue context: how may I help you? Speech Commun 42(2):207–225
Article Google Scholar
Bigot B, Ferrané I (2008) From audio content analysis to conversational speech detection and characterization. In: ACM SIGIR workshop: searching spontaneous conversational speech (SSCS), Singapore, pp 62–65
Bigot B, Ferrané I, Al Abidin Ibrahim Z (2008) Towards the detection and the characterization of conversational speech zones in audiovisual documents. In: International workshop on content-based multimedia indexing (CBMI). IEEE, pp 162–169
Cai R, Lu L, Hanjalic A (2005) Unsupervised content discovery in composite audio. In: MULTIMEDIA ’05: proceedings of the 13th annual ACM international conference on multimedia, pp 628–637
Canseco L, Lamel L, Gauvain J-L (2005) A comparative study using manual and automatic transcriptions for diarization. In: IEEE workshop on automatic speech recognition and understanding, pp 415–419, 27–27
Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm
de Cheveigné A, Kawahara H (2002) Yin, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930
Article Google Scholar
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience
El-Khoury E, Senac C, Pinquier J (2009) Improved speaker diarization system for meetings. In: IEEE international conference on acoustics, speech and signal processing, pp 4097–4100
Estève Y, Bazillon T, Antoine J-Y, Béchet F, Farinas J (2010) The EPAC corpus: manual and automatic annotations of conversational speech in french broadcast news. In: Proceedings of the seventh language evaluation and resources conference. ELRA, Valletta, Malta
Favre S, Vinciarelli A, Dielmann A (2009) Automatic role recognition in multiparty recordings using social networks and probabilistic sequential models. In: ACM international conference on multimedia. Beijing
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals Eugen 7:179–188
Article Google Scholar
Fürnkranz J (2001) Round robin rule learning. In: ICML 01: proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 146–153
Google Scholar
Galliano S, Geoffrois E, Gravier G, Bonastre J-F, Mostefa D, Choukri K (2006) Corpus description of the ESTER evaluation campaign for the rich transcription of french broadcast news. In: Proceedings of the language evaluation and resources conference
Hsueh P-Y, Moore JD (2007) Combining multiple knowledge sources for dialogue segmentation in multimedia archives. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, Prague, pp 1016–1023
Google Scholar
Lamel L, Gauvain J-L (2005) Alternate phone models for conversational speech. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, vol 1, pp 1005–1008
Li B, Errico JH, Pan H, Sezan I (2004) Bridging the semantic gap in sports video retrieval and summarization. J Vis Commun Image Represent 15(3):393–424
MATH Google Scholar
Liu Y (2006) Initial study on automatic identification of speaker role in broadcast news speech. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for Computational Linguistics, New York, pp 81–84
Chapter Google Scholar
Luz S (2009) Locating case discussion segments in recorded medical team meetings. In: SSCS ’09: proceedings of the third workshop on searching spontaneous conversational speech. ACM, New York, pp 21–30
Chapter Google Scholar
Mccowan I, Lathoud G, Lincoln M, Lisowska A, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus. In: Noldus LPJJ, Grieco F, Loijens LWS, Zimmerman PH (eds) Proceedings measuring behavior 2005, 5th international conference on methods and techniques in behavioral research. Noldus Information Technology, Wageningen
Popescu A-M, Etzioni O (2005) Extracting product features and opinions from reviews. In: HLT ’05: proceedings of the conference on human language technology and empirical methods in natural language processing, pp 339–346
Rouas J-L, Farinas J, Pellegrino F, André-Obrecht R (2005) Rhythmic unit extraction and modelling for automatic language identification. Speech Commun 47(4):436–456
Article Google Scholar
Stolcke A, Shriberg E, Hakkani-Tür D, Tür G, Rivlin Z, Sönmez K (1999) Combining words and speech prosody for automatic topic segmentation. In: Proceedings of DARPA broadcast news transcription and understanding workshop, pp 61–64
Vinciarelli A (2007) Speakers role recognition in multiparty audio recordings using social network analysis and duration distribution modeling. IEEE Trans Multimedia 9(6):1215–1226
Article Google Scholar
Zhao R, Grosky W (2002) Narrowing the semantic gap—improved text-based web document retrieval using visual features. IEEE Trans Multimedia 4(2):189–200
Article Google Scholar

Download references

Acknowledgement

This work is conducted within the EPAC Project—ANR-06-CIS6-MDCA-006.

Author information

Authors and Affiliations

IRIT—Université de Toulouse, 118, route de Narbonne, 31062, Cedex 09, France
Benjamin Bigot, Isabelle Ferrané, Julien Pinquier & Régine André-Obrecht

Authors

Benjamin Bigot
View author publications
You can also search for this author inPubMed Google Scholar
Isabelle Ferrané
View author publications
You can also search for this author inPubMed Google Scholar
Julien Pinquier
View author publications
You can also search for this author inPubMed Google Scholar
Régine André-Obrecht
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Benjamin Bigot.

Appendix

Table 20 The temporal feature set definitions for a speaker segment cluster

Full size table

Table 21 Speaker role classification using (a) the temporal feature subset, (b) the signal feature subset, (c) the prosodic feature subset and (d) the prosodic and temporal feature subsets with the automatic speaker segmentation: speaker role accuracy σ _a, a 95% confidence interval and the speech duration accuracy τ _a

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bigot, B., Ferrané, I., Pinquier, J. et al. Detecting individual role using features extracted from speaker diarization results. Multimed Tools Appl 60, 347–369 (2012). https://doi.org/10.1007/s11042-010-0609-9

Download citation

Published: 30 September 2010
Issue Date: September 2012
DOI: https://doi.org/10.1007/s11042-010-0609-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting individual role using features extracted from speaker diarization results

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speaker Discrimination Using Several Classifiers and a Relativistic Speaker Characterization

Intra-Speaker Variability Assessment for Speaker Recognition in Degraded Conditions: A Case of African Tone Languages

Mining speech signal patterns for robust speaker variability classification

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now