Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages

doi:10.1016/j.csl.2019.02.002

Computer Speech & Language

Volume 57, September 2019, Pages 20-40

https://doi.org/10.1016/j.csl.2019.02.002 Get rights and content

Abstract

We present a method enabling the unsupervised discovery of sub-word units (SWUs) and associated pronunciation lexicons for use in automatic speech recognition (ASR) systems. This includes a novel SWU discovery approach based on self-organising HMM-GMM states that are agglomeratively tied across words as well as a novel pronunciation lexicon induction approach that iteratively reduces pronunciation variation by means of model pruning. Our approach relies only on recorded speech and associated orthographic transcriptions and does not require alphabetic graphemes. We apply our methods to corpora of recorded radio broadcasts in Ugandan English, Luganda and Acholi, of which the latter two are under-resourced. The speech is conversational and contains high levels of background noise, and therefore presents a challenge to automatic lexicon induction. We demonstrate that our proposed method is able to discover lexicons that perform as well as baseline expert systems for Acholi, and close to this level for the other two languages when used to train DNN-HMM ASR systems. This demonstrates the potential of the method to enable and accelerate ASR for under-resourced languages for which a phone inventory and pronunciation lexicon are not available by eliminating the dependence on human expertise this usually requires.

Introduction

We present a set of approaches for the development of automatic speech recognition (ASR) systems in an under-resourced setting. We constrain ourselves to using a training corpus that consist only of recorded speech and associated orthographic transcriptions and where a phone set and lexicon are not available. Although we demonstrate that the incorporation of graphemic knowledge can improve the performance of our systems, we do not rely on the availability of alphabetic graphemes. As such, the approaches described here should in principle be suited to languages that use a logographic orthography.

The development of sub-word unit inventories and associated pronunciations is generally a time consuming and expensive process which requires the availability of linguists that are familiar with the task language. If the steps of sub-word unit discovery and pronunciation lexicon generation could be automated with satisfactory performance, then it would greatly streamline the process of implementing ASR in an under-resourced setting. In some cases (e.g. where trained linguists are not available) it would enable the implemention of ASR where it would otherwise be infeasible. It may even be possible that such a data-driven approach to lexicon generation could improve ASR performance even for well resourced tasks, but we do not assess that here.

The main objective of this work is to demonstrate the feasibility of performing ASR with an automatically induced pronunciation lexicon in a truly under-resourced setting. Many studies have addressed parts of this problem, but these typically focus only on unit segmentation and discovery, while neglecting the subsequent development of pronunciation lexicons to perform ASR. Other studies have focused on designing lexicons, but have typically adapted pre-existing phonemes or alphabetic graphemes for use as sub-word units (SWU’s). In the studies that have applied automated pronunciation lexicon design with automatically discovered SWU’s to ASR, the datasets used are low-perplexity (e.g. connected digits or weather query corpora) or represent ideal scenarios such as read speech that may not be representative of the results that would be obtained in an under-resourced large-vocabulary continuous speech task. In contrast, we will show that both fully automatic SWU discovery and lexicon induction of a sufficiently high quality is feasible even in the challenging scenario in which the data consists of low-quality recordings of large-vocabulary spontaneous speech and where no graphemic information is used. To this end, we present a new SWU discovery approach based on self-organising HMM-GMM states that are agglomeratively tied across words. In addition, we present a novel pronunciation lexicon induction method that models variation using an HMM that generates discrete SWU sequences and that iteratively reduces pronunciation variation by pruning the model’s state emission distributions.

Section snippets

Automatic lexicon induction

The task of automatically generating a pronunciation lexicon from a word-annotated speech corpus requires addressing a number of subtasks. First, a set of sub-word units needs to be established. Much previous work on acoustics-driven automatic lexicon generation begins with a small seed lexicon and then expands the vocabulary by means of a large word-annotated speech corpus (Zhang, Manohar, Povey, Khudanpur, 2017, Chen, Povey, Khudanpur, 2016, Goel, Thomas, Agarwal, Akyazi, Burget, Feng,

Proposed approach to SWU discovery and lexicon induction

In this section, we describe the approach we have developed to discover sets of sub-word units and induce associated pronunciation lexicons using training corpora that are limited to recorded speech and the associated orthographic transcriptions. Since we do not assume the availability of any word boundary information, an initial automatic word-level segmentation is performed. Then, we jointly induce an initial lexicon and associated SWU inventory, using a limited vocabulary and associated

Datasets

The datasets used in this study are summarised in Table 1. They have been compiled from recordings of Ugandan community radio stations broadcasting in Ugandan English, Luganda and Acholi, and have been orthographically annotated by mother-tongue speakers (Saeb, Menon, Cameron, Kibira, Quinn, Niesler, 2017, Menon, Saeb, Cameron, Kibira, Quinn, Niesler, 2017). Luganda and Acholi are both severely under-resourced Ugandan languages, with practically no resources available other than the datasets

Summary and conclusions

We have presented a novel sub-word unit discovery and lexicon induction approach that requires as input only recorded speech utterances and their associated orthographic transcriptions. This includes a new method of SWU discovery based on self-organising HMM-GMM states that are agglomeratively tied across words. In addition, we present a novel pronunciation lexicon induction method that models variation using an HMM that generates discrete SWU sequences and that iteratively reduces

Acknowledgements

The presented study was supported by Telkom South Africa. All experiments were performed using the University of Stellenbosch’s Rhasatsha HPC or the facilities at the Centre for High Performance Computing (CHPC).

References (48)

M. Bacchiani et al.
Joint lexicon, acoustic unit inventory and model design
Speech Commun.
(1999)
M. Bisani et al.
Joint-sequence models for grapheme-to-phoneme conversion
Speech Commun.
(2008)
L. Ondel et al.
Variational inference for acoustic unit discovery
Procedia Comput. Sci
(2016)
M. Razavi
An HMM-based formalism for automatic subword unit derivation and pronunciation generation
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2015)
Soltau, H., Liao, H., Sak, H., 2016. Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech...
W. Agenbag et al.
Automatic segmentation and clustering of speech using sparse coding and metaheuristic search
Proceedings of Interspeech
(2015)
W. Agenbag et al.
Refining sparse coding sub-word unit inventories with lattice-constrained viterbi training
Proceedings of the Workshop on Spoken Language Technologies for Under-resourced Languages
(2016)
K. Beulen et al.
Automatic question generation for decision tree based state tying
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181)
(1998)
L. ten Bosch et al.
A computational model for unsupervised word discovery
Proceedings of Interspeech
(2007)
ChanW. et al.
Listen, attend and spell: a neural network for large vocabulary conversational speech recognition
Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2016)

Chan, W., Jaitly, N., Le, Q. V., Vinyals, O., 2015. Listen, Attend and Spell. CoRR abs/1508.01211. URL:...

ChenG. et al.

Acoustic data-driven pronunciation lexicon generation for logographic languages

Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

(2016)

Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., 2015. Attention-based Models for Speech Recognition....

N. Goel et al.

Approaches to automatic lexicon learning with limited training examples

Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing

(2010)

G. Goussard et al.

Automatic discovery of subword units and pronunciations for automatic speech recognition using TIMIT

Proceedings of the Annual Symposium of the Pattern Recognition Society of South Africa (PRASA)

(2010)

Graves, A., 2012. Sequence Transduction with Recurrent Neural Networks. CoRR abs/1211.3711 URL:...

A. Graves et al.

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Proceedings of the 23rd International Conference on Machine Learning

(2006)

A. Graves et al.

Towards end-to-end speech recognition with recurrent neural networks

Proceedings of the 31st International Conference on International Conference on Machine Learning

(2014)

Graves, A., Mohamed, A., Hinton, G. E., 2013. Speech Recognition with Deep Recurrent Neural Networks. CoRR...

D. Harwath et al.

Speech recognition without a lexicon – bridging the gap between graphemic and phonetic systems

Proceedings of Interspeech

(2014)

A. Jansen et al.

Towards unsupervised training of speaker independent acoustic models

Proceedings of Interspeech

(2011)

LeeC.-y. et al.

A nonparametric Bayesian approach to acoustic model discovery

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers

(2012)

C.-y. Lee et al.

Unsupervised lexicon discovery from acoustic input

Trans. Assoc. Comput. Linguist

(2015)

LeeC.-y. et al.

Joint learning of phonetic units and word pronunciations for ASR

Proceedings of Empirical Methods on Natural Language Processing (EMNLP)

(2013)

Cited by (4)

Automatic Evaluation Method of Oral English Based on Multiplex Mode
2022, Wireless Communications and Mobile Computing
Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning
2021, Communications in Computer and Information Science
Synthetic data augmentation for improving low-resource ASR
2019, 2019 IEEE Western New York Image and Signal Processing Workshop, WNYISPW 2019
Improving automatically induced lexicons for highly agglutinating languages using data-driven morphological segmentation
2019, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

: ☆This paper has been recommended for acceptance by R. K. Moore.

View full text

Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages

Abstract

Introduction

Section snippets

Automatic lexicon induction

Proposed approach to SWU discovery and lexicon induction

Datasets

Summary and conclusions

Acknowledgements

Speech Commun.

Speech Commun.

Procedia Comput. Sci

Automatic segmentation and clustering of speech using sparse coding and metaheuristic search

Proceedings of Interspeech

Refining sparse coding sub-word unit inventories with lattice-constrained viterbi training

Proceedings of the Workshop on Spoken Language Technologies for Under-resourced Languages

Automatic question generation for decision tree based state tying

Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181)

A computational model for unsupervised word discovery

Proceedings of Interspeech

Listen, attend and spell: a neural network for large vocabulary conversational speech recognition

Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Acoustic data-driven pronunciation lexicon generation for logographic languages

Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Approaches to automatic lexicon learning with limited training examples

Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing

Automatic discovery of subword units and pronunciations for automatic speech recognition using TIMIT

Proceedings of the Annual Symposium of the Pattern Recognition Society of South Africa (PRASA)

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Proceedings of the 23rd International Conference on Machine Learning

Towards end-to-end speech recognition with recurrent neural networks

Proceedings of the 31st International Conference on International Conference on Machine Learning

Speech recognition without a lexicon – bridging the gap between graphemic and phonetic systems

Proceedings of Interspeech

Towards unsupervised training of speaker independent acoustic models

Proceedings of Interspeech

A nonparametric Bayesian approach to acoustic model discovery

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers

Unsupervised lexicon discovery from acoustic input

Trans. Assoc. Comput. Linguist

Joint learning of phonetic units and word pronunciations for ASR

Proceedings of Empirical Methods on Natural Language Processing (EMNLP)