DISTBIC: A speaker-based segmentation for audio data indexing☆
Introduction
With the ever increasing number of TV channels and broadcasting radio stations and thanks to the currently available huge storage means, many hours of TV and radio broadcasts are collected every year by information national heritage institutions, like the Institut National de l'Audiovisuel (INA) in France or the BBC archives in the UK. For example, INA possesses 45 years of TV archives consisting of 300,000 hours of national TV programs and 60 years of radio archives consisting of 400,000 hours of radio programs. Moreover, due to systematic digitalization of information, multimedia databases are on a rocketing increase.
Besides the storage and architecture problems underlying the design of such databases, another crucial problem is information retrieval: how to formulate a query in a convenient way and how to efficiently and quickly find the searched information whatever it could be: text, drawings, image, video, audio, music or speech. Pre-indexing is necessary to facilitate and speed-up any kind of query.
Clearly, access to audio documents is much more difficult than access to text; although text retrieval must cope with some variability of spelling by proposing different approximated solutions to the user, it is still easier to detect a name or a string of words in a text than to recognize a speaker or to spot a word within an audio recording, or to recognize a spoken sentence in a large lexicon. Also, listening to an audio recording takes much more time than reading a text. Consequently, it is essential to be able to directly access the significant segments rather than listening to the whole audio recording to retrieve pertinent information.
Audio document indexing associates with each audio document a file describing its structure in terms of retrieval keys. Phoneme strings can be keys for retrieval of a word or a sentence in a speech file (word- and sentence-spotting). Topic spotting plays an essential role in document filtering and understanding. Another key could be speaker identity. The presence of a given speaker in a conversation could be detected if this speaker's voice characteristics have been a priori enrolled. Automatic analysis of conversations recordings requires segmentation into segments containing only one speaker and segment clustering into one-speaker sets.
In this paper, we mainly address the specific problem of audio database segmentation with respect to speakers which is an essential initial step towards full indexing. To stay close to the application, no assumption is made about prior knowledge of the speaker or speech signal characteristics. However, we assume that people do not speak simultaneously. Additionally, since the construction of an index file is an off-line process, we have no real-time constraints. The problem of speaker-based segmentation and indexing is stated in Section 2. Possible application fields are described in Section 2.2. The hypotheses made for this work are discussed: they place our work in the perspective of approaches followed by other authors. Section 3 presents a brief description of the pioneering indexing tool of BBN for application in air traffic control. Section 4 deals with the segmentation operation which is central in this paper and makes a short review of different proposed techniques including the inspiring technique used by Chen and Gopalakrishnan (1998) at IBM. At this point, a new original technique DISTBIC is proposed using a two-pass approach. Since no prior knowledge about speakers is used, our solution turns out to be close to a general change detection algorithm. However, the application to sequences of feature vectors extracted from the speech waveform and containing speaker information puts specific tunings forward and keeps the general principle in the background. Different criteria are presented as well as the complete algorithm for speaker turn detection. The role of computational effort required is not crucial while the completeness of the segmentation, whatever the speaker intervention lengths are, is essential. Improving this completeness is the aim of the proposed algorithm. The results of DISTBIC are reported in Section 6. We conclude and describe perspectives towards the complete realization of an indexing tool in Section 7.
Section snippets
Description of the problem
Audio speaker indexing consists of the analysis of a speaker sequence. In other words, the question is to know who is speaking and when. Associating speech segments to the same speaker is as important as speaker identification: this information allows the understanding of the structure of a conversation between several persons. Most of the time, no a priori knowledge is available on the content of the recording: neither the number of different speakers nor their identities. As a consequence, no
A pioneering application
The aim of the pioneering work at BBN (Gish et al., 1991) is to automatically retrieve instructions given to pilots among recorded dialogs between pilots and air traffic controllers to improve air traffic at Dallas-Fort Worth airport. Air traffic controllers may all use the same radio-channel so that several of them are engaged in the dialog. Segmentation and indexing constitute the first step of this study. Next steps are: reconstitution of a dialog between one pilot and one controller, flight
Segmentation
Segmentation may use different features of the discourse:
- •
silence detection,
- •
speaker turn detection,
- •
frame identification requiring classes of models: speakers, contents (music, speech, noise,…). This approach requires training material for building the models and cannot be used for general segmentation without a priori knowledge; it is useful in a second step to refine segmentation with models trained on clustered segments.
DISTBIC: a new two-pass segmentation technique
The method proposed in this paper is based on a two-step analysis: a first pass uses a distance computation to determine the turn candidates and a second pass uses the BIC (in fact, the third pass of BIC) to validate or discard these candidates. Our segmentation technique shows less dependence on the average segment size.
Experiments and results
In order to fully evaluate the DISTBIC segmentation technique, we first perform several tests on the possible configurations that form this technique. For example, the most accurate distance measure is first determined by these pre-tests. Once this optimal DISTBIC procedure is constituted, we compare it with the BIC procedure in Section 6.3.2. Finally, a more thorough analysis of DISTBIC results is conducted on TV news in Section 6.3.3.
Conclusion and further work
We proposed a segmentation technique composed of a distance-based algorithm followed by a BIC-based algorithm. This segmentation technique proved to be as accurate as the BIC procedure in the case of conversations containing long segments and to give better results than the BIC procedure when applied to conversations containing short segments. Our experiments showed that parameters mainly depend on the length of speech segments contained in the conversation. A problem still remains: parameters
Acknowledgements
The authors would like to thank S. Marchand-Maillet for his help and are grateful to the anonymous reviewers for helpful comments on an earlier version of this paper.
References (18)
- et al.
Second-order statistical measures for text-independent speaker identification
Speech Communication
(1995) - et al.
Speaker, channel and environment change detection
In: World Congress of Automation
(1998) - Bonastre, J.-F., Delacourt, P., Fredouille, C., Merlin, T., Wellekens, C.J., 2000. A speaker tracking system based on...
- et al.
Speaker environment and channel change detection and clustering via the Bayesian Information Criterion
In: DARPA Speech Recognition Workshop
(1998) - et al.
Partitioning and transcription of broadcast news data
International Conference on Spoken Language Processing
(1998) - Gish, H., Schmidt, N., 1994. Text-independent speaker identification. In: IEEE Signal Processing Magazine, October,...
- Gish, H., Siu, M.-H., Rohlicek, R., 1991. Segregation of speakers for speech recognition and speaker identification....
- Godfrey, J.J., Holliman, E.C., McDaniel, J., 1992. SWITCHBOARD: telephone speech corpus for research and development....
- Liu, D., Kubala, F., 1999. Fast speaker change detection for broadcast news transcription and indexing. In: Eurospeech....
Cited by (212)
A review of speaker diarization: Recent advances with deep learning
2022, Computer Speech and LanguageA workflow for the automated detection and classification of female gibbon calls from long-term acoustic recordings
2023, Frontiers in Ecology and EvolutionSpeech refinement using Bi-LSTM and improved spectral clustering in speaker diarization
2023, Multimedia Tools and ApplicationsSpeech Recognition Via Machine Learning in Recording Studio
2023, Lecture Notes in Networks and SystemsText And Language Independent Speaker Recognition System for Robust Environment
2023, 2023 International Conference on Recent Advances in Electrical, Electronics, Ubiquitous Communication, and Computational Intelligence, RAEEUCCI 2023Speaker Diarization in Overlapped Speech
2022, 2022 19th IEEE International Multi-Conference on Systems, Signals and Devices, SSD 2022
- ☆
The financial support of this project from the Centre National d'Etudes des Télécommunications (CNET) under the Grant No. 98 1B is gratefully acknowledged.