Elsevier

Speech Communication

Volume 56, January 2014, Pages 1-18
Speech Communication

Characterizing and detecting spontaneous speech: Application to speaker role recognition

https://doi.org/10.1016/j.specom.2013.07.007Get rights and content

Highlights

  • We present a study and an evaluation of various spontaneous speech features.

  • We propose a two-level strategy to assign a spontaneity level to each speech segment.

  • A spontaneous speech detection system can be used for speaker role detection.

Abstract

Processing spontaneous speech is one of the many challenges that automatic speech recognition systems have to deal with. The main characteristics of this kind of speech are disfluencies (filled pause, repetition, false start, etc.) and many studies have focused on their detection and correction. Spontaneous speech is defined in opposition to prepared speech, where utterances contain well-formed sentences close to those found in written documents.

Acoustic and linguistic features made available by the use of an automatic speech recognition system are proposed to characterize and detect spontaneous speech segments from large audio databases. To better define this notion of spontaneous speech, segments of an 11-hour corpus (French Broadcast News) had been manually labeled according to three classes of spontaneity.

Firstly, we present a study of these features. We then propose a two-level strategy to automatically assign a class of spontaneity to each speech segment. The proposed system reaches a 73.0% precision and a 73.5% recall on high spontaneous speech segments, and a 66.8% precision and a 69.6% recall on prepared speech segments.

A quantitative study shows that the classes of spontaneity are useful information to characterize the speaker roles. This is confirmed by extending the speech spontaneity characterization approach to build an efficient automatic speaker role recognition system.

Introduction

Extracting information from large audio databases becomes a very challenging task since the amount of available audio data continues to grow (for example podcast, online video-sharing, etc.). The extraction of the audio document structure as well as the linguistic content is needed to retrieve high-level information. For example, a part of this information retrieval process is adding sentence punctuation and boundaries in automatic transcriptions. This segmentation process is very important for many tasks such as speech summarization, speech-to-speech translation or the distillation task as defined in the GALE program in Hakkani-Tür and Tür (2007). Nevertheless, the difficulty of providing this structure depends on other phenomena, such as the type of speech. Indeed, this process is much more difficult in the presence of spontaneous speech, as this kind of speech is characterized by ungrammaticality and disfluencies. Moreover, in order to cluster some documents according to their content or structure, the presence of spontaneous speech segments should be an interesting descriptor. It is therefore useful to characterize the spontaneity level of speech segments at an early stage in order to adapt automatic speech recognition (ASR) systems, as presented in Dufour et al. (2010a).

The type of speech inside broadcast audio data can switch between parts of prepared speech (news presentations, reports, etc.) and more spontaneous ones (interviews, debates, dialogues, etc.). The main characteristics of spontaneous speech are disfluencies (filled pause, repetition, repair and false start), and many studies have focused on their detection and their correction (Goto et al., 1999, Liu et al., 2005, Lease et al., 2006) as pointed out by the NIST Rich Transcription Fall 2004 evaluation. All these studies show an important drop in performance between results obtained on reference transcriptions and those obtained on automatic transcriptions. This could be explained by the noise generated by the ASR systems on spontaneous speech segments, which produce higher Word Error Rates (WER) than those obtained on prepared speech. A segment refers to a portion of audio signal in an audio file. Segments can contain speech, music, etc. In speech recognition, automatic segmentation splits audio documents into segments which last between 5 and 20 s. These segments may contain long pauses which generally define their boundaries.

In addition to disfluencies, spontaneous speech is also characterized by ungrammaticality and a language register different from the one that can be found in written texts, as shown in Boula de Mareüil et al. (2005). Depending on the speaker, the emotional state, and the context, the language used can be very different.

In this study we define spontaneous speech as unprepared speech, in opposition to prepared speech where utterances contain well-formed sentences. Prepared speech is produced by speakers who have enough time to prepare their intervention.

We propose to consider a set of acoustic and linguistic features for characterizing spontaneous speech limited to features that can be extracted from an automatic speech recognition processing only. This choice was motivated by the availability of these features when audio documents are indexed from their lexical content thanks to automatic transcriptions. The relevance of these features is estimated on an 11-hour corpus (French Broadcast News) manually labeled according to 3 classes of spontaneity. We then propose a speech spontaneity characterization system which will automatically assign a class of spontaneity to each speech segment. Indeed, this method involves two major approaches:

  • Local process: individual classification of each speech segment according to its class of spontaneity using the set of acoustic and linguistic features studied.

  • Global decision: the nature of the contiguous neighboring speech segments is taken into account. Thereby, the categorization of each speech segment has an impact on the categorization of the other ones.

We also propose to apply our automatic speech spontaneity characterization system on a manually labeled speaker role corpus to assess our speech spontaneity detection method on a new task to see if a correlation exists between a speaker role and a class of spontaneity. Then, we propose to directly use this speech spontaneity characterization method to recognize speaker roles.

This article is an extension of a previous work presented in Dufour et al., 2009a, Dufour et al., 2011, and provides more details about our speech spontaneity detection method based on acoustic and linguistic features made available during a speech recognition process. More, in this article, a fully automatic speaker role recognition system is presented with experimental results. Section 2 presents a study of the extracted features to characterize the level of spontaneity of each speech segment. We also describe the correlation between the Word-Error-Rate obtained by a state-of-the-art ASR decoder on this broadcast news corpus and the level of spontaneity. We then propose, in Section 3, a two-step automatic speech spontaneity characterization system: the first step individually classifies each speech segment with a class of spontaneity, while the second one takes advantage of a global decision process to improve the spontaneity speech characterization. A study of the speech spontaneity and speaker role relationship is presented in Section 4. The speaker role recognition system based on our speech spontaneity detection method is finally presented in Section 4.4.

Section snippets

Levels of spontaneity

By defining spontaneous speech as unprepared speech, it is possible to follow a definition proposed by Luzzati (2004) that defined a spontaneous utterance as: “a statement conceived and perceived during its utterance”. This definition illustrates the classification subjectivity between prepared and spontaneous speech. Ideally, to annotate a speech corpus with labels representing the fluency of each speech segment, each speaker would have to annotate his own utterances. As this seems not

General approach

We propose to treat type of the speech detection as a multiclass classification problem. The main idea is to combine all the extracted features (acoustic, linguistic and ASR confidence measures) in order to label each speech segment with a class of spontaneity, as presented in Dufour et al. (2009b). This combination will be made through a classification process, which will provide the most likely class of spontaneity based on extracted features. Fig. 1 summarizes the followed approach to assign

Applying spontaneous speech detection to characterize speaker roles

In this part, the spontaneous speech detection system is applied to an annotated corpus in speaker roles, with two main objectives:

  • First, we want to see if a class of spontaneity might be useful information to help to characterize the speaker role in an audio document. This first analysis follows some previous works realized on speaker role in Barzilay et al., 2000, Liu, 2006 or topic identification in Peskin et al., 1993, McDonough et al., 1994. The main goal of this field is to extract

Conclusion

We firstly proposed an analysis of various acoustic and linguistic features extracted from an automatic speech recognition processing in order to characterize and detect spontaneous speech segments from large audio databases. To better define this notion of spontaneous speech, speech segments of an 11-hour corpus (French Broadcast News) had been manually labeled according to levels of spontaneity. This manual labeling helped to define three classes of spontaneity: prepared, low spontaneous and

References (42)

  • M. Mohri et al.

    Weighted finite-state transducers in speech recognition

    Computer Speech and Language

    (2002)
  • Amaral, R., Trancoso, I., 2003. Segmentation and indexation of broadcast news. In: ISCA Workshop on Multilingual Spoken...
  • Barzilay, R., Collins, M., Hirschberg, J., Whittaker, S., 2000. The rules behind roles: Identifying speaker role in...
  • Bazillon, T., Estève, Y., Luzzati, D., 2008. Manual vs assisted transcription of prepared and spontaneous speech. In:...
  • Bigot, B., Ferrané, I., Pinquier, J., André-Obrecht, R., 2010. Speaker role recognition to help spontaneous...
  • Boula de Mareüil, P., Habert, B., Bénard, F., Adda-Decker, M., Barras, C., Adda, G., Paroubek, P., 2005. A quantitative...
  • Caelen-Haumont, G., 2002. Perlocutory values and functions of melisms in spontaneous dialogue. In: Proceedings of the...
  • J. Cohen

    A coefficient of agreement for nominal scales

    Educational and Psychological Measurement

    (1960)
  • Damnati, G., Charlet, D., 2011. Robust speaker turn role labeling of tv broadcast news shows. In: International...
  • Deléglise, P., Estève, Y., Meignier, S., Merlin, T., 2009. Improvements to the LIUM French ASR system based on CMU...
  • Duez, D., 1982. Salient pauses and non salient pauses in three speech style. In: Language and Speech, vol. 25, pp....
  • Dufour, R., Estève, Y., Deléglise, P., Béchet, F., 2009a. Local and global models for spontaneous speech segment...
  • Dufour, R., Jousse, V., Estève, Y., Béchet, F., Linarès, G., 2009b. Spontaneous speech characterization and detection...
  • Dufour, R., Bougares, F., Estève, Y., Deléglise, P., 2010a. Unsupervised model adaptation on targeted speech segments...
  • Dufour, R., Estève, Y., Deléglise, P., Béchet, F., 2010b. Automatic indexing of speech segments with spontaneity levels...
  • Dufour, R., Estève, Y., Deléglise, P., 2011. Investigation of Spontaneous Speech Characterization Applied to Speaker...
  • Estève, Y., Bazillon, T., Antoine, J.-Y., Béchet, F., Farinas, J., 2010. The EPAC corpus: manual and automatic...
  • B.D. Eugenio et al.

    The kappa statistic: a second look

    Computational Linguistics

    (2004)
  • Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J., Gravier, G., 2005. The ESTER phase II evaluation...
  • Garg, P.N., Favre, S., Salamin, H., Hakkani-Tür, D., Vinciarelli, A., 2008. Role recognition for meeting participants:...
  • Goto, M., Itou, K., Hayamizu, S.A., 1999. A Real-time Filled Pause Detection System for Spontaneous Speech Recognition....
  • Cited by (19)

    • Positioning oneself in different roles: Structural and lexical measures of power relations between speakers in Map Task Corpus

      2020, Speech Communication
      Citation Excerpt :

      For example, Dufour et al. (2009) examined filled pauses, repetitions, false start, bags of n-grams (from one to three words), and average length of syntactic chunks on the segment. Dufour et al. (2014) proposed an original approach to detect speaker roles using automatic spontaneous speech detection systems. Experiments showed that features and approaches initially designed to detect speech spontaneity in audio documents could be directly applied to classify speaker roles.

    • Unsupervised classification of speaker roles in multi-participant conversational speech

      2017, Computer Speech and Language
      Citation Excerpt :

      Results revealed an accuracy of 74% in recognizing the formal roles and an accuracy of 66% in correctly identifying the social roles. Dufour et al. (2014) first inputted acoustic and linguistic features to a classifier (i.e. ICSIBOOST) for detecting the spontaneity of each speech segment, and then proposed to directly apply their two-step spontaneity detection mehtod for speaker role recognition based on the link between speech spontaneity and speaker roles. They found that some speaker roles have a predominant class of spontaneity (for example, the prepared class for the Commentator role).

    • Can We Use Speaker Embeddings On Spontaneous Speech Obtained From Medical Conversations To Predict Intelligibility?

      2023, 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
    • A Novel Scheme to Classify Read and Spontaneous Speech

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text