Elsevier

Computer Speech & Language

Volume 57, September 2019, Pages 20-40
Computer Speech & Language

Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages

https://doi.org/10.1016/j.csl.2019.02.002Get rights and content

Abstract

We present a method enabling the unsupervised discovery of sub-word units (SWUs) and associated pronunciation lexicons for use in automatic speech recognition (ASR) systems. This includes a novel SWU discovery approach based on self-organising HMM-GMM states that are agglomeratively tied across words as well as a novel pronunciation lexicon induction approach that iteratively reduces pronunciation variation by means of model pruning. Our approach relies only on recorded speech and associated orthographic transcriptions and does not require alphabetic graphemes. We apply our methods to corpora of recorded radio broadcasts in Ugandan English, Luganda and Acholi, of which the latter two are under-resourced. The speech is conversational and contains high levels of background noise, and therefore presents a challenge to automatic lexicon induction. We demonstrate that our proposed method is able to discover lexicons that perform as well as baseline expert systems for Acholi, and close to this level for the other two languages when used to train DNN-HMM ASR systems. This demonstrates the potential of the method to enable and accelerate ASR for under-resourced languages for which a phone inventory and pronunciation lexicon are not available by eliminating the dependence on human expertise this usually requires.

Introduction

We present a set of approaches for the development of automatic speech recognition (ASR) systems in an under-resourced setting. We constrain ourselves to using a training corpus that consist only of recorded speech and associated orthographic transcriptions and where a phone set and lexicon are not available. Although we demonstrate that the incorporation of graphemic knowledge can improve the performance of our systems, we do not rely on the availability of alphabetic graphemes. As such, the approaches described here should in principle be suited to languages that use a logographic orthography.

The development of sub-word unit inventories and associated pronunciations is generally a time consuming and expensive process which requires the availability of linguists that are familiar with the task language. If the steps of sub-word unit discovery and pronunciation lexicon generation could be automated with satisfactory performance, then it would greatly streamline the process of implementing ASR in an under-resourced setting. In some cases (e.g. where trained linguists are not available) it would enable the implemention of ASR where it would otherwise be infeasible. It may even be possible that such a data-driven approach to lexicon generation could improve ASR performance even for well resourced tasks, but we do not assess that here.

The main objective of this work is to demonstrate the feasibility of performing ASR with an automatically induced pronunciation lexicon in a truly under-resourced setting. Many studies have addressed parts of this problem, but these typically focus only on unit segmentation and discovery, while neglecting the subsequent development of pronunciation lexicons to perform ASR. Other studies have focused on designing lexicons, but have typically adapted pre-existing phonemes or alphabetic graphemes for use as sub-word units (SWU’s). In the studies that have applied automated pronunciation lexicon design with automatically discovered SWU’s to ASR, the datasets used are low-perplexity (e.g. connected digits or weather query corpora) or represent ideal scenarios such as read speech that may not be representative of the results that would be obtained in an under-resourced large-vocabulary continuous speech task. In contrast, we will show that both fully automatic SWU discovery and lexicon induction of a sufficiently high quality is feasible even in the challenging scenario in which the data consists of low-quality recordings of large-vocabulary spontaneous speech and where no graphemic information is used. To this end, we present a new SWU discovery approach based on self-organising HMM-GMM states that are agglomeratively tied across words. In addition, we present a novel pronunciation lexicon induction method that models variation using an HMM that generates discrete SWU sequences and that iteratively reduces pronunciation variation by pruning the model’s state emission distributions.

Section snippets

Automatic lexicon induction

The task of automatically generating a pronunciation lexicon from a word-annotated speech corpus requires addressing a number of subtasks. First, a set of sub-word units needs to be established. Much previous work on acoustics-driven automatic lexicon generation begins with a small seed lexicon and then expands the vocabulary by means of a large word-annotated speech corpus (Zhang, Manohar, Povey, Khudanpur, 2017, Chen, Povey, Khudanpur, 2016, Goel, Thomas, Agarwal, Akyazi, Burget, Feng,

Proposed approach to SWU discovery and lexicon induction

In this section, we describe the approach we have developed to discover sets of sub-word units and induce associated pronunciation lexicons using training corpora that are limited to recorded speech and the associated orthographic transcriptions. Since we do not assume the availability of any word boundary information, an initial automatic word-level segmentation is performed. Then, we jointly induce an initial lexicon and associated SWU inventory, using a limited vocabulary and associated

Datasets

The datasets used in this study are summarised in Table 1. They have been compiled from recordings of Ugandan community radio stations broadcasting in Ugandan English, Luganda and Acholi, and have been orthographically annotated by mother-tongue speakers (Saeb, Menon, Cameron, Kibira, Quinn, Niesler, 2017, Menon, Saeb, Cameron, Kibira, Quinn, Niesler, 2017). Luganda and Acholi are both severely under-resourced Ugandan languages, with practically no resources available other than the datasets

Summary and conclusions

We have presented a novel sub-word unit discovery and lexicon induction approach that requires as input only recorded speech utterances and their associated orthographic transcriptions. This includes a new method of SWU discovery based on self-organising HMM-GMM states that are agglomeratively tied across words. In addition, we present a novel pronunciation lexicon induction method that models variation using an HMM that generates discrete SWU sequences and that iteratively reduces

Acknowledgements

The presented study was supported by Telkom South Africa. All experiments were performed using the University of Stellenbosch’s Rhasatsha HPC or the facilities at the Centre for High Performance Computing (CHPC).

References (48)

  • Chan, W., Jaitly, N., Le, Q. V., Vinyals, O., 2015. Listen, Attend and Spell. CoRR abs/1508.01211. URL:...
  • ChenG. et al.

    Acoustic data-driven pronunciation lexicon generation for logographic languages

    Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2016)
  • Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., 2015. Attention-based Models for Speech Recognition....
  • N. Goel et al.

    Approaches to automatic lexicon learning with limited training examples

    Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing

    (2010)
  • G. Goussard et al.

    Automatic discovery of subword units and pronunciations for automatic speech recognition using TIMIT

    Proceedings of the Annual Symposium of the Pattern Recognition Society of South Africa (PRASA)

    (2010)
  • Graves, A., 2012. Sequence Transduction with Recurrent Neural Networks. CoRR abs/1211.3711 URL:...
  • A. Graves et al.

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    Proceedings of the 23rd International Conference on Machine Learning

    (2006)
  • A. Graves et al.

    Towards end-to-end speech recognition with recurrent neural networks

    Proceedings of the 31st International Conference on International Conference on Machine Learning

    (2014)
  • Graves, A., Mohamed, A., Hinton, G. E., 2013. Speech Recognition with Deep Recurrent Neural Networks. CoRR...
  • D. Harwath et al.

    Speech recognition without a lexicon – bridging the gap between graphemic and phonetic systems

    Proceedings of Interspeech

    (2014)
  • A. Jansen et al.

    Towards unsupervised training of speaker independent acoustic models

    Proceedings of Interspeech

    (2011)
  • LeeC.-y. et al.

    A nonparametric Bayesian approach to acoustic model discovery

    Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers

    (2012)
  • C.-y. Lee et al.

    Unsupervised lexicon discovery from acoustic input

    Trans. Assoc. Comput. Linguist

    (2015)
  • LeeC.-y. et al.

    Joint learning of phonetic units and word pronunciations for ASR

    Proceedings of Empirical Methods on Natural Language Processing (EMNLP)

    (2013)
  • Cited by (4)

    ☆This paper has been recommended for acceptance by R. K. Moore.

    View full text