Elsevier

Pattern Recognition Letters

Volume 131, March 2020, Pages 213-218
Pattern Recognition Letters

Creating speaker independent ASR system through prosody modification based data augmentation

https://doi.org/10.1016/j.patrec.2019.12.019Get rights and content

Highlights

  • Developing automatic speech recognition (ASR) system that are least affected by speaker-dependent variations.

  • Prosody-modification-based data augmentation is explored for creating Speaker Independent ASR system.

  • Prosody factors such as pitch and duration is explored to get pattern variations for training Speaker Independent ASR.

  • Relative improvements of 11:5% and 27% are obtained on decoding both adults’ and children’s test sets, respectively.

Abstract

In this paper, the effect of prosody-modification-based data augmentation is explored in the context of automatic speech recognition (ASR). The primary motive is to develop ASR systems that are less affected by speaker-dependent acoustic variations. Two factors contributing towards inter-speaker variability that are focused on in this paper are pitch and speaking-rate variations. In order to simulate such an ASR task, we have trained an ASR system on adults’ speech and tested it using speech data from adult as well as child speakers. Compared to adults’ speech test case, the recognition rates are noted to be extremely degraded when the test speech is from child speakers. The observed degradation is basically due to large differences in pitch and speaking-rate between adults’ and children’s speech. To overcome this problem, pitch and speaking-rate of the training speech are modified to create new versions of the data. The original and the modified versions are then pooled together in order to capture greater acoustic variability. The ASR system trained on augmented data is noted to be more robust towards speaker-dependent variations. Relative improvements of 11.5% and 27.0% over the baseline are obtained on decoding adults’ and children’s speech test sets, respectively.

Introduction

Since the introduction of hidden Markov models (HMM) to better capture the temporal variations, research on automatic speech recognition (ASR) has witnessed a tremendous growth. At the same time, several different approaches to model the observation densities of the HMM states have also been proposed [31]. The state-of-the-art ASR systems employ deep neural networks (DNN) [5] for modeling the observation densities of the HMM states. In those cases where the training data sufficiently represents the acoustic attributes expected in the test data, the DNN-based ASR systems are able to robustly learn the internal variability and produce high recognition rates. However, when the acoustic properties differ significantly across the training and test datasets, the recognition performance is reported to degrade. To overcome this issue, researchers have resorted to multi-style training [13]. Since obtaining real multi-style training data is extremely difficult, creating more data by simulation is the employed alternative. For example, in [7], [10], [16], simulated data was augmented to the original training set in order to improve the recognition rates under noisy and reverberant test conditions. Motivated by those studies, we have explored prosody-modification-based data augmentation in this paper to enhance the speaker-independence of ASR systems. In other words, the primary objective of this paper is to enable ASR systems to deal with the speech data from male/female speakers belonging to various age groups. For simulating such extreme differences, we have explored the task of transcribing speech data from a very heterogeneous group of speakers including children. In general, speaker-independence is a highly desired attribute in the context of speech-based user applications [23].

Speaker-independence can be enhanced by pooling large amount of speech data from a varied group of speakers. This helps in modeling the acoustic variations resulting from factors like age and gender differences. In this paper, we have assumed that other factors contributing towards inter-speaker variability are already well represented and focus only on pitch and speaking-rate variations. When the available speech data is insufficient in representing a particular speaker-dependent acoustic attribute, severely degraded recognition performance is observed if that attribute is dominant in the test data. For example, consider the task of decoding speech from adult female speakers using an ASR system trained on data from adult male speakers. Speech data from adult males does not effectively capture the acoustic variability inherent in adult females’ speech. Consequently, error rates are observed to be higher. Similarly, children’s speech is substantially different from adults’ in terms of both speaking-rate and pitch. The pitch for children’s speech is significantly higher than that for the adult male/female speakers [3], [9], [11], [28]. Further, the speaking-rate of children is slower than that of adults [4], [11], [30]. Therefore, when children’s speech is transcribed using an ASR system trained on adults’ speech, error rates are noted to increase significantly [12], [25], [27], [28].

In this study, in order to mitigate the ill-effects of the acoustic mismatch resulting from increased inter-speaker variability, pitch and duration of speech data collected from adult speakers are modified. The original and modified versions are then pooled together and the statistical model parameters are trained on the mixed data. A prosody modification technique based on anchoring of glottal closure instants (GCIs) is used to change the speaking-rate and pitch. The approach based on zero-frequency filtering (ZFF) is employed to compute the GCI locations [14]. The experimental evaluations presented in this paper show that prosody-modification-based data augmentation helps in enhancing the recognition performance. Significant improvements are noted not only in the case of children’s speech but also for the task of transcribing adults’ speech. In other words, the proposed approach helps in improving children’s ASR even when no domain specific data is used without hampering the recognition performance for adults’ speech. To the best of our knowledge, prosody modification-based data augmentation has not been studied for improving children’s ASR. Moreover, the employed ZFF-GCI-based prosody modification approach is faster and more accurate than other existing methods [18]. At the same time, unlike our earlier work on children’s ASR [26], prosody modification is not applied on the test data. Avoiding extra processing of the test data reduces the decoding time. Furthermore, since both pitch and speaking-rate are modified simultaneously, there is no need of vocal tract length perturbation [8].

The remainder of this paper is organized as follows: In Section 2, the proposed data augmentation approach based on prosody modification is outlined. In Section 3, the experimental evaluations are presented. Finally, the paper is concluded in Section 4.

Section snippets

Need for prosody-modification-based data augmentation

Due to the paucity of publicly available speech data from child speakers, developing well optimized ASR systems for children’s speech is a very challenging problem. On the other hand, large amounts of speech data from adult speakers are freely available [15], [21]. Therefore, in order to deal with data scarcity, one can transcribe children’s speech on adult data trained ASR system. But this approach leads to very high error rates [4], [22]. The observed poor performances are primarily due to

Experimental evaluations

In this section, the datasets used for the experimental evaluations are described first. This is followed by the description of the ASR systems employed for evaluation. Next, the experimental results demonstrating the effectiveness of prosody-modification-based data augmentation are presented.

Conclusion

The effectiveness of prosody-modification-based data augmentation has been studied in this work in order to enhance the robustness of ASR systems towards speaker-dependent acoustic variations. In this work, we have focused mainly on two factors contributing towards inter-speaker variability. Those are pitch and speaking-rate variations. To simulate such an ASR task, we have trained an ASR system on adults’ speech and tested it using speech data from adult as well as child speakers. Due to the

Declaration of Competing Interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

References (31)

  • R. Smits et al.

    Determination of instants of significant excitation in speech using group delay function

    IEEE Trans. Speech Audio Process.

    (1995)
  • A. Batliner et al.

    The PF_STAR children’s speech corpus

    Proc. INTERSPEECH

    (2005)
  • K. Deepak et al.

    Speech and EGG polarity detection using Hilbert Envelope

    Proc. TENCON

    (2015)
  • S. Eguchi et al.

    Development of speech sounds in children.

    Acta oto-laryngologica

    (1969)
  • M. Gerosa et al.

    A review of ASR technologies for children’s speech

    Proc. Workshop on Child, Computer and Interaction

    (2009)
  • G.E. Hinton et al.

    Deep neural networks for acoustic modeling in speech recognition

    Signal Process. Mag.

    (2012)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • R. Hsiao et al.

    Robust speech recognition in unknown reverberant and noisy conditions

    Proc. ASRU

    (2015)
  • N. Jaitly et al.

    Vocal tract length perturbation (vtlp) improves speech recognition

    Proc. ICML Workshop on Deep Learning for Audio, Speech and Language

    (2013)
  • R.D. Kent

    Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies

    J. Speech Hearing Res.

    (1976)
  • T. Ko et al.

    A study on data augmentation of reverberant speech for robust speech recognition

    Proc. ICASSP

    (2017)
  • S. Lee et al.

    Acoustics of children’s speech: developmental changes of temporal and spectral parameters

    J. Acoust. Soc. Am.

    (1999)
  • H. Liao et al.

    Large vocabulary automatic speech recognition for children

    Proc. INTERSPEECH

    (2015)
  • R. Lippmann et al.

    Multi-style training for robust isolated-word speech recognition

    Proc. ICASSP

    (1987)
  • K.S.R. Murthy et al.

    Epoch extraction from speech signals

    IEEE Trans. Audio Speech Lang. Process.

    (2008)
  • Cited by (0)

    View full text