Creating speaker independent ASR system through prosody modification based data augmentation
Introduction
Since the introduction of hidden Markov models (HMM) to better capture the temporal variations, research on automatic speech recognition (ASR) has witnessed a tremendous growth. At the same time, several different approaches to model the observation densities of the HMM states have also been proposed [31]. The state-of-the-art ASR systems employ deep neural networks (DNN) [5] for modeling the observation densities of the HMM states. In those cases where the training data sufficiently represents the acoustic attributes expected in the test data, the DNN-based ASR systems are able to robustly learn the internal variability and produce high recognition rates. However, when the acoustic properties differ significantly across the training and test datasets, the recognition performance is reported to degrade. To overcome this issue, researchers have resorted to multi-style training [13]. Since obtaining real multi-style training data is extremely difficult, creating more data by simulation is the employed alternative. For example, in [7], [10], [16], simulated data was augmented to the original training set in order to improve the recognition rates under noisy and reverberant test conditions. Motivated by those studies, we have explored prosody-modification-based data augmentation in this paper to enhance the speaker-independence of ASR systems. In other words, the primary objective of this paper is to enable ASR systems to deal with the speech data from male/female speakers belonging to various age groups. For simulating such extreme differences, we have explored the task of transcribing speech data from a very heterogeneous group of speakers including children. In general, speaker-independence is a highly desired attribute in the context of speech-based user applications [23].
Speaker-independence can be enhanced by pooling large amount of speech data from a varied group of speakers. This helps in modeling the acoustic variations resulting from factors like age and gender differences. In this paper, we have assumed that other factors contributing towards inter-speaker variability are already well represented and focus only on pitch and speaking-rate variations. When the available speech data is insufficient in representing a particular speaker-dependent acoustic attribute, severely degraded recognition performance is observed if that attribute is dominant in the test data. For example, consider the task of decoding speech from adult female speakers using an ASR system trained on data from adult male speakers. Speech data from adult males does not effectively capture the acoustic variability inherent in adult females’ speech. Consequently, error rates are observed to be higher. Similarly, children’s speech is substantially different from adults’ in terms of both speaking-rate and pitch. The pitch for children’s speech is significantly higher than that for the adult male/female speakers [3], [9], [11], [28]. Further, the speaking-rate of children is slower than that of adults [4], [11], [30]. Therefore, when children’s speech is transcribed using an ASR system trained on adults’ speech, error rates are noted to increase significantly [12], [25], [27], [28].
In this study, in order to mitigate the ill-effects of the acoustic mismatch resulting from increased inter-speaker variability, pitch and duration of speech data collected from adult speakers are modified. The original and modified versions are then pooled together and the statistical model parameters are trained on the mixed data. A prosody modification technique based on anchoring of glottal closure instants (GCIs) is used to change the speaking-rate and pitch. The approach based on zero-frequency filtering (ZFF) is employed to compute the GCI locations [14]. The experimental evaluations presented in this paper show that prosody-modification-based data augmentation helps in enhancing the recognition performance. Significant improvements are noted not only in the case of children’s speech but also for the task of transcribing adults’ speech. In other words, the proposed approach helps in improving children’s ASR even when no domain specific data is used without hampering the recognition performance for adults’ speech. To the best of our knowledge, prosody modification-based data augmentation has not been studied for improving children’s ASR. Moreover, the employed ZFF-GCI-based prosody modification approach is faster and more accurate than other existing methods [18]. At the same time, unlike our earlier work on children’s ASR [26], prosody modification is not applied on the test data. Avoiding extra processing of the test data reduces the decoding time. Furthermore, since both pitch and speaking-rate are modified simultaneously, there is no need of vocal tract length perturbation [8].
The remainder of this paper is organized as follows: In Section 2, the proposed data augmentation approach based on prosody modification is outlined. In Section 3, the experimental evaluations are presented. Finally, the paper is concluded in Section 4.
Section snippets
Need for prosody-modification-based data augmentation
Due to the paucity of publicly available speech data from child speakers, developing well optimized ASR systems for children’s speech is a very challenging problem. On the other hand, large amounts of speech data from adult speakers are freely available [15], [21]. Therefore, in order to deal with data scarcity, one can transcribe children’s speech on adult data trained ASR system. But this approach leads to very high error rates [4], [22]. The observed poor performances are primarily due to
Experimental evaluations
In this section, the datasets used for the experimental evaluations are described first. This is followed by the description of the ASR systems employed for evaluation. Next, the experimental results demonstrating the effectiveness of prosody-modification-based data augmentation are presented.
Conclusion
The effectiveness of prosody-modification-based data augmentation has been studied in this work in order to enhance the robustness of ASR systems towards speaker-dependent acoustic variations. In this work, we have focused mainly on two factors contributing towards inter-speaker variability. Those are pitch and speaking-rate variations. To simulate such an ASR task, we have trained an ASR system on adults’ speech and tested it using speech data from adult as well as child speakers. Due to the
Declaration of Competing Interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
References (31)
- et al.
Determination of instants of significant excitation in speech using group delay function
IEEE Trans. Speech Audio Process.
(1995) - et al.
The PF_STAR children’s speech corpus
Proc. INTERSPEECH
(2005) - et al.
Speech and EGG polarity detection using Hilbert Envelope
Proc. TENCON
(2015) - et al.
Development of speech sounds in children.
Acta oto-laryngologica
(1969) - et al.
A review of ASR technologies for children’s speech
Proc. Workshop on Child, Computer and Interaction
(2009) - et al.
Deep neural networks for acoustic modeling in speech recognition
Signal Process. Mag.
(2012) - et al.
Long short-term memory
Neural Comput.
(1997) - et al.
Robust speech recognition in unknown reverberant and noisy conditions
Proc. ASRU
(2015) - et al.
Vocal tract length perturbation (vtlp) improves speech recognition
Proc. ICML Workshop on Deep Learning for Audio, Speech and Language
(2013) Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies
J. Speech Hearing Res.
(1976)