Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

Nejadgholi, Isar; Seyyedsalehi, Seyyed Ali

doi:10.1007/s00521-007-0151-5

Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

Original Article
Published: 14 December 2007

Volume 18, pages 45–55, (2009)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Isar Nejadgholi¹ &
Seyyed Ali Seyyedsalehi¹

225 Accesses
15 Citations
Explore all metrics

Abstract

The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fundamentals of Artificial Neural Networks and Deep Learning

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Amandeep Singh Dhanjal & Williamjeet Singh

References

Koerner E, Tsujino H, Masutani T )1997( A cortical type modular neural network for hypothetical reasoning. Neural Netw 10(5):791–814
Article Google Scholar
Koerner E, Gewaltig MO, Koerner U, Richter A, Rodemann T (1999) A model of computation in neocortical architecture. Neural Netw 12:989–1005
Article Google Scholar
Beaufays F, Bourlard H, Franco H, Morgan N (2001) Neural networks in automatic speech recognition. IDIAP Research Report 01–09
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput Speech Lang 9(2):171–185
Article Google Scholar
Gauvian J (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2(2):291–298
Article Google Scholar
Tsao Y, Lee SM, Chou FC, Lee LS (2001) Segmental eigenvoice for rapid speaker adaptation, proceeding of eurospeech, Aalborg, Denmark, CD-ROM
Pusateri EJ, Hazen TJ (2002) Rapid speaker adaptation using speaker clustering. Proceeding of ICSLP, Denver, Colorado, pp 61–64
Huang C, Chen T, Li S, Chang E, Zhou J (2001) Analysis of speaker variability. Proc Eurospeech 2:1377–1380
Google Scholar
Karimi K (2003) Implementing the speaker features for quality improvement of the speech recognition models. MS thesis, Amirkabir University of Technology, Biomedical Engineering Faculty, summer 2003 (in Persian)
Seyyedsalehi SA (1995) Continuous Persian speech recognition using functional model of human brain in speech perception. Ph.D. thesis, Tarbiyat Modarres University, Technical Department (in Persian)
Bijankhan M et al (1994) FARSDAT—the speech database of Farsi spoken language, SST-94, Perth, pp 826–831
Rahiminejad M (2002) Development and enhancement of current feature extraction methods in speech recognition systems. MS thesis, Amirkabir University of Technology, Biomedical Engineering Faculty (in Persian)

Download references

Acknowledgments

The fulfillment of the present study has been achieved by support and cooperation of the Research Center of Intelligent signal Processing (RCISP).

Author information

Authors and Affiliations

Biomedical Engineering Faculty, Amirkabir University of Technology, Tehran, Iran
Isar Nejadgholi & Seyyed Ali Seyyedsalehi

Authors

Isar Nejadgholi
View author publications
You can also search for this author in PubMed Google Scholar
Seyyed Ali Seyyedsalehi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isar Nejadgholi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nejadgholi, I., Seyyedsalehi, S.A. Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks. Neural Comput & Applic 18, 45–55 (2009). https://doi.org/10.1007/s00521-007-0151-5

Download citation

Received: 01 April 2004
Accepted: 01 January 2007
Published: 14 December 2007
Issue Date: January 2009
DOI: https://doi.org/10.1007/s00521-007-0151-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

Abstract

Access this article

Similar content being viewed by others

Fundamentals of Artificial Neural Networks and Deep Learning

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

Abstract

Access this article

Similar content being viewed by others

Fundamentals of Artificial Neural Networks and Deep Learning

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation