Lightly supervised and unsupervised acoustic model training

doi:10.1006/csla.2001.0186

Computer Speech & Language

Volume 16, Issue 1, January 2002, Pages 115-129

https://doi.org/10.1006/csla.2001.0186 Get rights and content

Abstract

The last decade has witnessed substantial progress in speech recognition technology, with today’s state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators and substantial amounts of supervision. This paper describes some recent experiments using lightly supervised and unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated broadcast news data from the Darpa TDT-2 corpus. The hypothesized transcription is optionally aligned with closed-captions or transcripts to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data. These experiments demonstrate that light or no supervision can dramatically reduce the cost of building acoustic models.

References (20)

C. Barras et al.
Transcriber: development and use of a tool for assisting speech corpora production
Speech Communication
(January 2001)
C.J. Leggetter et al.
Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
Computer Speech & Language
(1995)
G. Adda, M. Jardino, J. L. Gauvain, Proceedings of ESCA European Conference on Speech Communication and Technology’99,...
C. Cieri, D. Graff, M. Liberman, Proceedings of the DARPA Broadcast News Workshop, Herndon, VA (see also...
P. Clarkson, R. Rosenfeld, European Conference on Speech Communication and Technology’97, September 1997, 2707,...
J. Garofolo, C. Auzanne, E. Voorhees, W. Fisher, Proceedings of the 8th Text Retrieval Conference TREC-8, November...
J. L. Gauvain, L. Lamel, Proceedings of the International Conference on Spoken Language Processing’2000, October 2000,...
J. L. Gauvain, G. Adda, L. Lamel, M. Adda-Decker, Proceedings of the ARPA Speech Recognition Workshop, February 1997,...
J. L. Gauvain, L. Lamel, G. Adda, International Conference on Spoken Language Processing’98, December 1998, 1335,...

There are more references available in the full text version of this article.

Cited by (256)

Automatic detection of behavioural codes in team interactions
2022, Computer Speech and Language
This paper investigates the feasibility of the task of automatic behaviour coding of spoken interactions in teamwork settings. We introduce the coding schema used to classify the behaviours of the group members and the corpus we collected to assess the coding schema reliability in real teamwork meetings. The behaviours embedded in spoken utterances are modelled using a discriminative approach based on conditional random fields, and state-of-the-art neural networks based models. Moreover, we fine-tune publicly available language models to fit our target domain and task and demonstrate how this type of knowledge transfer improves classification models’ generalisation capacity. To utilise public resources, the AMI corpus was used for deploying the proposed framework. However, the models were evaluated on both AMI (matched task) and recordings of students solving an engineering challenge (mismatched task). Evaluation results reveal that neural networks are the best performing models in matched tasks, but that $C R F$ models outperform them in mismatched tasks. Mitigating the effect of noisy data, by implementing a lightly supervised approach leads to relative improvements of 32% and 22%, in $F 1$ measures of $C R F$ and $B E R T$ , respectively. The proposed classifiers are used as a part of technological support to the training programme in collaborative skills for undergraduate students.
Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
2021, Speech Communication
Citation Excerpt :
The FCN and CNSA show similar trends across all experimental conditions, therefore we can perhaps infer that both of these classifiers are overfitting the training data. Since they are overfitting, we would normally expect to see improvements on these classifiers by applying web data augmentation (Lamel et al., 2002; Gorin et al., 2016). However, we barely gained any benefit from data augmentation, apparently because the in-domain and out-domain datasets had different acoustic feature distributions.
Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.
Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
2024, Information (Switzerland)
AI-based language tutoring systems with end-to-end automatic speech recognition and proficiency evaluation
2024, ETRI Journal
Dysarthric Speech Recognition Using Pseudo-Labeling, Self-Supervised Feature Learning, and a Joint Multi-Task Learning Approach
2024, IEEE Access
Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
2023, arXiv

View all citing articles on Scopus

View full text

Regular ArticleLightly supervised and unsupervised acoustic model training

Abstract

Speech Communication

Computer Speech & Language

Regular Article
Lightly supervised and unsupervised acoustic model training