Abstract
The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.
Similar content being viewed by others
References
Koerner E, Tsujino H, Masutani T )1997( A cortical type modular neural network for hypothetical reasoning. Neural Netw 10(5):791–814
Koerner E, Gewaltig MO, Koerner U, Richter A, Rodemann T (1999) A model of computation in neocortical architecture. Neural Netw 12:989–1005
Beaufays F, Bourlard H, Franco H, Morgan N (2001) Neural networks in automatic speech recognition. IDIAP Research Report 01–09
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput Speech Lang 9(2):171–185
Gauvian J (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2(2):291–298
Tsao Y, Lee SM, Chou FC, Lee LS (2001) Segmental eigenvoice for rapid speaker adaptation, proceeding of eurospeech, Aalborg, Denmark, CD-ROM
Pusateri EJ, Hazen TJ (2002) Rapid speaker adaptation using speaker clustering. Proceeding of ICSLP, Denver, Colorado, pp 61–64
Huang C, Chen T, Li S, Chang E, Zhou J (2001) Analysis of speaker variability. Proc Eurospeech 2:1377–1380
Karimi K (2003) Implementing the speaker features for quality improvement of the speech recognition models. MS thesis, Amirkabir University of Technology, Biomedical Engineering Faculty, summer 2003 (in Persian)
Seyyedsalehi SA (1995) Continuous Persian speech recognition using functional model of human brain in speech perception. Ph.D. thesis, Tarbiyat Modarres University, Technical Department (in Persian)
Bijankhan M et al (1994) FARSDAT—the speech database of Farsi spoken language, SST-94, Perth, pp 826–831
Rahiminejad M (2002) Development and enhancement of current feature extraction methods in speech recognition systems. MS thesis, Amirkabir University of Technology, Biomedical Engineering Faculty (in Persian)
Acknowledgments
The fulfillment of the present study has been achieved by support and cooperation of the Research Center of Intelligent signal Processing (RCISP).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nejadgholi, I., Seyyedsalehi, S.A. Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks. Neural Comput & Applic 18, 45–55 (2009). https://doi.org/10.1007/s00521-007-0151-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-007-0151-5