Elsevier

Computer Speech & Language

Volume 28, Issue 6, November 2014, Pages 1255-1268
Computer Speech & Language

Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition system

https://doi.org/10.1016/j.csl.2014.04.005Get rights and content

Highlights

  • We try to develop or improve a South African (SA) speech recogniser with US data.

  • The SA domain is under-resourced while the US domain is very well-resourced.

  • US pronunciations and language models can feasibly replace SA counterparts.

  • US acoustic data used in a SA system results in a large performance penalty.

  • US acoustic and language model data slightly improve a SA system by adaptation.

Abstract

South African English is currently considered an under-resourced variety of English. Extensive speech resources are, however, available for North American (US) English. In this paper we consider the use of these US resources in the development of a South African large vocabulary speech recognition system. Specifically we consider two research questions. Firstly, we determine the performance penalties that are incurred when using US instead of South African language models, pronunciation dictionaries and acoustic models. Secondly, we determine whether US acoustic and language modelling data can be used in addition to the much more limited South African resources to improve speech recognition performance. In the first case we find that using a US pronunciation dictionary or a US language model in a South African system results in fairly small penalties. However, a substantial penalty is incurred when using a US acoustic model. In the second investigation we find that small but consistent improvements over a baseline South African system can be obtained by the additional use of US acoustic data. Larger improvements are obtained when complementing the South African language modelling data with US and/or UK material. We conclude that, when developing resources for an under-resourced variety of English, the compilation of acoustic data should be prioritised, language modelling data has a weaker effect on performance and the pronunciation dictionary the smallest.

Introduction

We will describe the development of large vocabulary speech recognition systems for South African English (SAE), which is considered an under-resourced variety of English because exceedingly little annotated speech data is currently available (Davel et al., 2011, Kamper et al., 2012a). However, although SAE may be considered under-resourced, other varieties of English, notably North American (US) English, have abundant resources for the development of speech technology. The primary aim of the research presented in this paper is to determine how best to capitalise on these existing and extensive language, pronunciation and acoustic modelling resources in the development of our South African (SA) speech transcription system. We consider the following two research scenarios and have structured the paper accordingly.

Firstly, we investigate the performance penalty incurred when a specific SA system component is absent and its US counterpart is used instead. To achieve this, we perform a balanced comparison in which SA and US systems are developed under equivalent model training conditions using speech corpora of similar size and character. We highlight language, pronunciation and acoustic differences through cross-domain experiments in which SA language models, pronunciation dictionaries and acoustic models are replaced by their US counterparts and vice-versa. By balancing the training conditions of the SA and US systems we try to minimise performance differences due to mismatches in training corpus nature and size. The results of this investigation identify the components upon which future SA resource collection and system development efforts should focus and the components which can feasibly be replaced by their US counterparts.

Secondly, we determine whether the extensive US acoustic and language modelling resources could be used to improve on the performance of an SA system. Here the US dataset is not limited artificially in size. Rather, the complete and much larger US dataset is considered and experiments are performed in order to determine the extent to which the US resources can be useful in the development of the SA system. These experiments reflect a typical under-resourced setting in which the in-domain data is limited but can be supplemented from extensive out-of-domain sources.

Section snippets

Background

Several studies have considered modelling approaches for different varieties of the same language. For example, Chengalvarayan (2001) dealt with the recognition of American, Australian and British varieties of English and showed that a single acoustic model obtained by pooling data outperformed a system employing separate models for each variety in parallel. Other authors have considered adaptation approaches in which a model trained on one variety is adapted using data from another variety.

South African acoustic data

The work presented here is based on a recently compiled corpus of SA broadcast news (Kamper et al., 2012a). The broadcast news domain is attractive for the development of large vocabulary speech recognition systems in under-resourced environments because it provides both a ready source of audio data as well as a variety of speech styles and quality. The SA corpus consists of approximately 20 h of audio recordings from one of the country's main radio news channels, SAFM. News bulletins were

General experimental procedure

All acoustic models (AMs) were developed following the same procedure, which is similar to that proposed in Hain et al. (2010). Audio data was converted into a stream of 39-dimensional mel-frequency perceptual linear prediction (MF-PLP) feature vectors (Woodland et al., 1997). Cepstral means were subtracted on a per-utterance basis and subsequently cepstral variance normalisation was performed on a per-bulletin basis. Using the HTK tools (Young et al., 2009), state-clustered phonetic

Substituting SA with US resources

In this section we investigate the effect of replacing SA language models, pronunciation dictionaries and acoustic models with their US counterparts and vice-versa. The aim is to assess the performance penalties involved when incorporating US system components into our SA system. The results will enable us to differentiate between speech resources that can feasibly be inherited from the US domain and speech resources which are best developed separately for the SA domain. For the experiments in

Augmenting SA with US resources

In Section 5 we considered the case in which an SA language, pronunciation or acoustic model was assumed to be unavailable and consequently a US counterpart was used instead. In this section we consider the case in which the US resources are available in addition to the SA training material. The aim is to determine whether performance improvements can be obtained by augmenting the available SA resources with US resources during system development.

Overall conclusions

We have presented an experimental evaluation of the use of North American (US) resources in the development of a South African (SA) large vocabulary speech recognition system.

Speech recognition results showed that a US recognition system in its unmodified form is not suitable for use within the South African domain. Directed experiments indicated that differences between the two domains are present in language modelling data, in pronunciations, as well as in acoustic modelling data. In a

Acknowledgements

This research was supported financially by the Royal Society and the South African National Research Foundation (NRF) (UID68470) under a South Africa – UK Science Network grant. Parts of this work were executed using the High Performance Computer (HPC) facility at Stellenbosch University. The authors would like to thank Alison Wileman, for her hard work on the South African English pronunciation dictionary and transcriptions, and Matt Gibson, for his helpful comments and suggestions.

References (39)

  • M. Cettolo

    Segmentation, classification and clustering of an Italian broadcast news corpus

  • R. Chengalvarayan

    Accent-independent universal HMM-based speech recognizer for American, Australian and British English

  • M.H. Davel et al.

    Efficient harvesting of internet audio for resource-scarce ASR

  • N.J. De Vries et al.

    Woefzela – an open-source platform for ASR data collection in the developing world

  • J. Despres et al.

    Modeling Northern and Southern varieties of Dutch for STT

  • V. Fischer et al.

    Speaker-independent upfront dialect adaptation in a large vocabulary continuous speech recognizer

  • J. Fiscus et al.

    1997 English broadcast news speech (HUB4)

    (1998)
  • M.J.F. Gales et al.

    Progress in the CU-HTK broadcast news transcription system

    IEEE Trans. Acoust. Speech Signal Process.

    (2006)
  • J.L. Gauvain et al.

    Where are we in transcribing French broadcast news?

  • Cited by (19)

    • Feature learning for efficient ASR-free keyword spotting in low-resource languages

      2022, Computer Speech and Language
      Citation Excerpt :

      In contrast, keyword spotting in Luganda represents a practical application of a radio-browsing system in a truly low-resource setting. For English, we use a corpus of South African Broadcast News (SABN) which consists of 23 h of speech compiled from news bulletins broadcast between 1996 and 2006 by one of South Africa’s main radio news channels (Kamper et al., 2014). The corpus contains a mix of newsreader speech, interviews and crossings to reporters.

    • Code-switched automatic speech recognition in five South African languages

      2022, Computer Speech and Language
      Citation Excerpt :

      Language-dependent phone labels were therefore used for all the experiments reported on in this paper. Pronunciations for English words were sourced from the pronunciation dictionary of the South African Broadcast News Corpus (Kamper et al., 2014). Pronunciations for the Bantu language words were obtained from the NCHLT dictionaries.

    • Feature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders

      2019, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    View all citing articles on Scopus

    This paper has been recommended for acceptance by ‘Saraclar Murat’.

    View full text