Regular Article
Lightly supervised and unsupervised acoustic model training

https://doi.org/10.1006/csla.2001.0186Get rights and content

Abstract

The last decade has witnessed substantial progress in speech recognition technology, with today’s state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators and substantial amounts of supervision. This paper describes some recent experiments using lightly supervised and unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated broadcast news data from the Darpa TDT-2 corpus. The hypothesized transcription is optionally aligned with closed-captions or transcripts to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data. These experiments demonstrate that light or no supervision can dramatically reduce the cost of building acoustic models.

References (20)

  • C. Barras et al.

    Transcriber: development and use of a tool for assisting speech corpora production

    Speech Communication

    (January 2001)
  • C.J. Leggetter et al.

    Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

    Computer Speech & Language

    (1995)
  • G. Adda, M. Jardino, J. L. Gauvain, Proceedings of ESCA European Conference on Speech Communication and Technology’99,...
  • C. Cieri, D. Graff, M. Liberman, Proceedings of the DARPA Broadcast News Workshop, Herndon, VA (see also...
  • P. Clarkson, R. Rosenfeld, European Conference on Speech Communication and Technology’97, September 1997, 2707,...
  • J. Garofolo, C. Auzanne, E. Voorhees, W. Fisher, Proceedings of the 8th Text Retrieval Conference TREC-8, November...
  • J. L. Gauvain, L. Lamel, Proceedings of the International Conference on Spoken Language Processing’2000, October 2000,...
  • J. L. Gauvain, G. Adda, L. Lamel, M. Adda-Decker, Proceedings of the ARPA Speech Recognition Workshop, February 1997,...
  • J. L. Gauvain, L. Lamel, G. Adda, International Conference on Spoken Language Processing’98, December 1998, 1335,...
There are more references available in the full text version of this article.

Cited by (256)

  • Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations

    2021, Speech Communication
    Citation Excerpt :

    The FCN and CNSA show similar trends across all experimental conditions, therefore we can perhaps infer that both of these classifiers are overfitting the training data. Since they are overfitting, we would normally expect to see improvements on these classifiers by applying web data augmentation (Lamel et al., 2002; Gorin et al., 2016). However, we barely gained any benefit from data augmentation, apparently because the in-domain and out-domain datasets had different acoustic feature distributions.

View all citing articles on Scopus
View full text