Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition system☆
Introduction
We will describe the development of large vocabulary speech recognition systems for South African English (SAE), which is considered an under-resourced variety of English because exceedingly little annotated speech data is currently available (Davel et al., 2011, Kamper et al., 2012a). However, although SAE may be considered under-resourced, other varieties of English, notably North American (US) English, have abundant resources for the development of speech technology. The primary aim of the research presented in this paper is to determine how best to capitalise on these existing and extensive language, pronunciation and acoustic modelling resources in the development of our South African (SA) speech transcription system. We consider the following two research scenarios and have structured the paper accordingly.
Firstly, we investigate the performance penalty incurred when a specific SA system component is absent and its US counterpart is used instead. To achieve this, we perform a balanced comparison in which SA and US systems are developed under equivalent model training conditions using speech corpora of similar size and character. We highlight language, pronunciation and acoustic differences through cross-domain experiments in which SA language models, pronunciation dictionaries and acoustic models are replaced by their US counterparts and vice-versa. By balancing the training conditions of the SA and US systems we try to minimise performance differences due to mismatches in training corpus nature and size. The results of this investigation identify the components upon which future SA resource collection and system development efforts should focus and the components which can feasibly be replaced by their US counterparts.
Secondly, we determine whether the extensive US acoustic and language modelling resources could be used to improve on the performance of an SA system. Here the US dataset is not limited artificially in size. Rather, the complete and much larger US dataset is considered and experiments are performed in order to determine the extent to which the US resources can be useful in the development of the SA system. These experiments reflect a typical under-resourced setting in which the in-domain data is limited but can be supplemented from extensive out-of-domain sources.
Section snippets
Background
Several studies have considered modelling approaches for different varieties of the same language. For example, Chengalvarayan (2001) dealt with the recognition of American, Australian and British varieties of English and showed that a single acoustic model obtained by pooling data outperformed a system employing separate models for each variety in parallel. Other authors have considered adaptation approaches in which a model trained on one variety is adapted using data from another variety.
South African acoustic data
The work presented here is based on a recently compiled corpus of SA broadcast news (Kamper et al., 2012a). The broadcast news domain is attractive for the development of large vocabulary speech recognition systems in under-resourced environments because it provides both a ready source of audio data as well as a variety of speech styles and quality. The SA corpus consists of approximately 20 h of audio recordings from one of the country's main radio news channels, SAFM. News bulletins were
General experimental procedure
All acoustic models (AMs) were developed following the same procedure, which is similar to that proposed in Hain et al. (2010). Audio data was converted into a stream of 39-dimensional mel-frequency perceptual linear prediction (MF-PLP) feature vectors (Woodland et al., 1997). Cepstral means were subtracted on a per-utterance basis and subsequently cepstral variance normalisation was performed on a per-bulletin basis. Using the HTK tools (Young et al., 2009), state-clustered phonetic
Substituting SA with US resources
In this section we investigate the effect of replacing SA language models, pronunciation dictionaries and acoustic models with their US counterparts and vice-versa. The aim is to assess the performance penalties involved when incorporating US system components into our SA system. The results will enable us to differentiate between speech resources that can feasibly be inherited from the US domain and speech resources which are best developed separately for the SA domain. For the experiments in
Augmenting SA with US resources
In Section 5 we considered the case in which an SA language, pronunciation or acoustic model was assumed to be unavailable and consequently a US counterpart was used instead. In this section we consider the case in which the US resources are available in addition to the SA training material. The aim is to determine whether performance improvements can be obtained by augmenting the available SA resources with US resources during system development.
Overall conclusions
We have presented an experimental evaluation of the use of North American (US) resources in the development of a South African (SA) large vocabulary speech recognition system.
Speech recognition results showed that a US recognition system in its unmodified form is not suitable for use within the South African domain. Directed experiments indicated that differences between the two domains are present in language modelling data, in pronunciations, as well as in acoustic modelling data. In a
Acknowledgements
This research was supported financially by the Royal Society and the South African National Research Foundation (NRF) (UID68470) under a South Africa – UK Science Network grant. Parts of this work were executed using the High Performance Computer (HPC) facility at Stellenbosch University. The authors would like to thank Alison Wileman, for her hard work on the South African English pronunciation dictionary and transcriptions, and Matt Gibson, for his helpful comments and suggestions.
References (39)
- et al.
Multidialectal Spanish acoustic modeling for speech recognition
Speech Commun.
(2009) - et al.
An empirical study of smoothing techniques for language modeling
Comput. Speech Lang.
(1999) - et al.
Multi-accent acoustic modelling of South African English
Speech Commun.
(2012) - et al.
Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition
Speech Commun.
(2005) - et al.
Language-independent and language-adaptive acoustic modeling for speech recognition
Speech Commun.
(2001) - et al.
Subspace-GMM acoustic models for under-resourced languages: feasibility study
- et al.
Retrieval of broadcast news documents with the THISL system
- et al.
Language modeling for automatic Turkish broadcast news transcription
- et al.
Bootstrap estimates for confidence intervals in ASR performance evaluation
White South African English: phonology
Segmentation, classification and clustering of an Italian broadcast news corpus
Accent-independent universal HMM-based speech recognizer for American, Australian and British English
Efficient harvesting of internet audio for resource-scarce ASR
Woefzela – an open-source platform for ASR data collection in the developing world
Modeling Northern and Southern varieties of Dutch for STT
Speaker-independent upfront dialect adaptation in a large vocabulary continuous speech recognizer
1997 English broadcast news speech (HUB4)
Progress in the CU-HTK broadcast news transcription system
IEEE Trans. Acoust. Speech Signal Process.
Where are we in transcribing French broadcast news?
Cited by (19)
Feature learning for efficient ASR-free keyword spotting in low-resource languages
2022, Computer Speech and LanguageCitation Excerpt :In contrast, keyword spotting in Luganda represents a practical application of a radio-browsing system in a truly low-resource setting. For English, we use a corpus of South African Broadcast News (SABN) which consists of 23 h of speech compiled from news bulletins broadcast between 1996 and 2006 by one of South Africa’s main radio news channels (Kamper et al., 2014). The corpus contains a mix of newsreader speech, interviews and crossings to reporters.
Code-switched automatic speech recognition in five South African languages
2022, Computer Speech and LanguageCitation Excerpt :Language-dependent phone labels were therefore used for all the experiments reported on in this paper. Pronunciations for English words were sourced from the pronunciation dictionary of the South African Broadcast News Corpus (Kamper et al., 2014). Pronunciations for the Bantu language words were obtained from the NCHLT dictionaries.
Automatic Speech Recognition of English-isiZulu Code-switched Speech from South African Soap Operas
2016, Procedia Computer ScienceFeature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders
2019, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
- ☆
This paper has been recommended for acceptance by ‘Saraclar Murat’.