DISTBIC: A speaker-based segmentation for audio data indexing

doi:10.1016/S0167-6393(00)00027-3

Speech Communication

Volume 32, Issues 1–2, September 2000, Pages 111-126

https://doi.org/10.1016/S0167-6393(00)00027-3 Get rights and content

Abstract

In this paper, we address the problem of speaker-based segmentation, which is the first necessary step for several indexing tasks. It aims to extract homogeneous segments containing the longest possible utterances produced by a single speaker. In our context, no assumption is made about prior knowledge of the speaker or speech signal characteristics (neither speaker model, nor speech model). However, we assume that people do not speak simultaneously and that we have no real-time constraints. We review existing techniques and propose a new segmentation method, which combines two different segmentation techniques. This method, called DISTBIC, is organized into two passes: first the most likely speaker turns are detected, and then they are validated or discarded. The advantage of our algorithm is its efficiency in detecting speaker turns even close to one another (i.e., separated by a few seconds).

Zusammenfassung

Dieser Artikel beschreibt Sprecher basierte Segmentierung, den ersten Schritt beim Indexieren von Sprechern. Das Ziel besteht darin, möglichst lange homogene Segmente zu extrahieren, die Laute eines einzelnen Sprechers enthalten. Wir legen zugrunde, daß keinerlei Sprachcharakteristik des Sprechers bekannt ist (weder Sprechermodel noch Sprachmodel). Außerdem wird die Annahme gemacht, daß immer nur ein Sprecher zur Zeit spricht und daß keine Echtzeitanforderungen vorhanden sind. Wir stellen existierende Segmentierungtechniken vor und schlagen eine neue Methode vor, welche zwei gebräuchliche Methoden kombiniert. Unsere Methode (DISTBIC) ist in zwei Phasen aufgeteilt: erst werden die wahrscheinlichsten Sprecherwechsel gefunden, die dann entweder validiert oder verworfen werden. Der Vorteil unseres Algorithmuses liegt in seiner Effizienz Sprecherwechsel aufzufinden, besonders wenn sie sehr nahe beieinander liegen (d.h. Abstände von wenigen Sekunden).

Résumé

Dans cet article, nous nous intéressons au problème de la segmentation en locuteurs, étape préliminaire nécessaire à plusieurs tâches d'indexation. Le but de la segmentation en locuteurs est d'extraire des segments homogènes ne contenant les paroles que d'un seul locuteur et aussi longs que possible. Dans notre contexte, nous faisons l'hypothèse qu'aucune connaissance a priori des locuteurs ou des caractéristiques du signal n'est à notre disposition (pas de modèle de locuteur, pas de modèle de parole). Nous supposons néanmoins que les personnes ne parlent pas simultanément et que nous n'avons pas de contrainte de temps réel. Nous présentons les techniques de segmentation existantes et nous proposons une nouvelle méthode qui combine les avantages de deux techniques de segmentation. Cette nouvelle méthode de segmentation, appelée DISTBIC, s'opère en deux passes: les changements de locuteurs les plus probables sont tout d'abord détectés et ils sont ensuite validés ou annulés au cours de la deuxième passe. L'avantage de notre algorithme est son efficacité à détecter des changements de locuteurs proches les uns des autres (i.e. espacés de quelques secondes).

Introduction

With the ever increasing number of TV channels and broadcasting radio stations and thanks to the currently available huge storage means, many hours of TV and radio broadcasts are collected every year by information national heritage institutions, like the Institut National de l'Audiovisuel (INA) in France or the BBC archives in the UK. For example, INA possesses 45 years of TV archives consisting of 300,000 hours of national TV programs and 60 years of radio archives consisting of 400,000 hours of radio programs. Moreover, due to systematic digitalization of information, multimedia databases are on a rocketing increase.

Besides the storage and architecture problems underlying the design of such databases, another crucial problem is information retrieval: how to formulate a query in a convenient way and how to efficiently and quickly find the searched information whatever it could be: text, drawings, image, video, audio, music or speech. Pre-indexing is necessary to facilitate and speed-up any kind of query.

Clearly, access to audio documents is much more difficult than access to text; although text retrieval must cope with some variability of spelling by proposing different approximated solutions to the user, it is still easier to detect a name or a string of words in a text than to recognize a speaker or to spot a word within an audio recording, or to recognize a spoken sentence in a large lexicon. Also, listening to an audio recording takes much more time than reading a text. Consequently, it is essential to be able to directly access the significant segments rather than listening to the whole audio recording to retrieve pertinent information.

Audio document indexing associates with each audio document a file describing its structure in terms of retrieval keys. Phoneme strings can be keys for retrieval of a word or a sentence in a speech file (word- and sentence-spotting). Topic spotting plays an essential role in document filtering and understanding. Another key could be speaker identity. The presence of a given speaker in a conversation could be detected if this speaker's voice characteristics have been a priori enrolled. Automatic analysis of conversations recordings requires segmentation into segments containing only one speaker and segment clustering into one-speaker sets.

In this paper, we mainly address the specific problem of audio database segmentation with respect to speakers which is an essential initial step towards full indexing. To stay close to the application, no assumption is made about prior knowledge of the speaker or speech signal characteristics. However, we assume that people do not speak simultaneously. Additionally, since the construction of an index file is an off-line process, we have no real-time constraints. The problem of speaker-based segmentation and indexing is stated in Section 2. Possible application fields are described in Section 2.2. The hypotheses made for this work are discussed: they place our work in the perspective of approaches followed by other authors. Section 3 presents a brief description of the pioneering indexing tool of BBN for application in air traffic control. Section 4 deals with the segmentation operation which is central in this paper and makes a short review of different proposed techniques including the inspiring technique used by Chen and Gopalakrishnan (1998) at IBM. At this point, a new original technique DISTBIC is proposed using a two-pass approach. Since no prior knowledge about speakers is used, our solution turns out to be close to a general change detection algorithm. However, the application to sequences of feature vectors extracted from the speech waveform and containing speaker information puts specific tunings forward and keeps the general principle in the background. Different criteria are presented as well as the complete algorithm for speaker turn detection. The role of computational effort required is not crucial while the completeness of the segmentation, whatever the speaker intervention lengths are, is essential. Improving this completeness is the aim of the proposed algorithm. The results of DISTBIC are reported in Section 6. We conclude and describe perspectives towards the complete realization of an indexing tool in Section 7.

Section snippets

Description of the problem

Audio speaker indexing consists of the analysis of a speaker sequence. In other words, the question is to know who is speaking and when. Associating speech segments to the same speaker is as important as speaker identification: this information allows the understanding of the structure of a conversation between several persons. Most of the time, no a priori knowledge is available on the content of the recording: neither the number of different speakers nor their identities. As a consequence, no

A pioneering application

The aim of the pioneering work at BBN (Gish et al., 1991) is to automatically retrieve instructions given to pilots among recorded dialogs between pilots and air traffic controllers to improve air traffic at Dallas-Fort Worth airport. Air traffic controllers may all use the same radio-channel so that several of them are engaged in the dialog. Segmentation and indexing constitute the first step of this study. Next steps are: reconstitution of a dialog between one pilot and one controller, flight

Segmentation

Segmentation may use different features of the discourse:

•
silence detection,
•
speaker turn detection,
•
frame identification requiring classes of models: speakers, contents (music, speech, noise,…). This approach requires training material for building the models and cannot be used for general segmentation without a priori knowledge; it is useful in a second step to refine segmentation with models trained on clustered segments.

DISTBIC: a new two-pass segmentation technique

The method proposed in this paper is based on a two-step analysis: a first pass uses a distance computation to determine the turn candidates and a second pass uses the BIC (in fact, the third pass of BIC) to validate or discard these candidates. Our segmentation technique shows less dependence on the average segment size.

Experiments and results

In order to fully evaluate the DISTBIC segmentation technique, we first perform several tests on the possible configurations that form this technique. For example, the most accurate distance measure is first determined by these pre-tests. Once this optimal DISTBIC procedure is constituted, we compare it with the BIC procedure in Section 6.3.2. Finally, a more thorough analysis of DISTBIC results is conducted on TV news in Section 6.3.3.

Conclusion and further work

We proposed a segmentation technique composed of a distance-based algorithm followed by a BIC-based algorithm. This segmentation technique proved to be as accurate as the BIC procedure in the case of conversations containing long segments and to give better results than the BIC procedure when applied to conversations containing short segments. Our experiments showed that parameters mainly depend on the length of speech segments contained in the conversation. A problem still remains: parameters

Acknowledgements

The authors would like to thank S. Marchand-Maillet for his help and are grateful to the anonymous reviewers for helpful comments on an earlier version of this paper.

References (18)

F. Bimbot et al.
Second-order statistical measures for text-independent speaker identification
Speech Communication
(1995)
H.S.M. Beigi et al.
Speaker, channel and environment change detection
In: World Congress of Automation
(1998)
Bonastre, J.-F., Delacourt, P., Fredouille, C., Merlin, T., Wellekens, C.J., 2000. A speaker tracking system based on...
S.S. Chen et al.
Speaker environment and channel change detection and clustering via the Bayesian Information Criterion
In: DARPA Speech Recognition Workshop
(1998)
J.-L. Gauvain et al.
Partitioning and transcription of broadcast news data
International Conference on Spoken Language Processing
(1998)
Gish, H., Schmidt, N., 1994. Text-independent speaker identification. In: IEEE Signal Processing Magazine, October,...
Gish, H., Siu, M.-H., Rohlicek, R., 1991. Segregation of speakers for speech recognition and speaker identification....
Godfrey, J.J., Holliman, E.C., McDaniel, J., 1992. SWITCHBOARD: telephone speech corpus for research and development....
Liu, D., Kubala, F., 1999. Fast speaker change detection for broadcast news transcription and indexing. In: Eurospeech....

There are more references available in the full text version of this article.

Cited by (212)

A review of speaker diarization: Recent advances with deep learning
2022, Computer Speech and Language
Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.
A workflow for the automated detection and classification of female gibbon calls from long-term acoustic recordings
2023, Frontiers in Ecology and Evolution
Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization
2023, Multimedia Tools and Applications
Speech Recognition Via Machine Learning in Recording Studio
2023, Lecture Notes in Networks and Systems
Text And Language Independent Speaker Recognition System for Robust Environment
2023, 2023 International Conference on Recent Advances in Electrical, Electronics, Ubiquitous Communication, and Computational Intelligence, RAEEUCCI 2023
Speaker Diarization in Overlapped Speech
2022, 2022 19th IEEE International Multi-Conference on Systems, Signals and Devices, SSD 2022

View all citing articles on Scopus

^☆: The financial support of this project from the Centre National d'Etudes des Télécommunications (CNET) under the Grant No. 98 1B is gratefully acknowledged.

View full text

DISTBIC: A speaker-based segmentation for audio data indexing☆

Abstract

Zusammenfassung

Résumé

Introduction

Section snippets

Description of the problem

A pioneering application

Segmentation

DISTBIC: a new two-pass segmentation technique

Experiments and results

Conclusion and further work

Acknowledgements

Speech Communication

Speaker, channel and environment change detection

In: World Congress of Automation

Speaker environment and channel change detection and clustering via the Bayesian Information Criterion

In: DARPA Speech Recognition Workshop

Partitioning and transcription of broadcast news data

International Conference on Spoken Language Processing