Skip to main content
Log in

Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Koerner E, Tsujino H, Masutani T )1997( A cortical type modular neural network for hypothetical reasoning. Neural Netw 10(5):791–814

    Article  Google Scholar 

  2. Koerner E, Gewaltig MO, Koerner U, Richter A, Rodemann T (1999) A model of computation in neocortical architecture. Neural Netw 12:989–1005

    Article  Google Scholar 

  3. Beaufays F, Bourlard H, Franco H, Morgan N (2001) Neural networks in automatic speech recognition. IDIAP Research Report 01–09

  4. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput Speech Lang 9(2):171–185

    Article  Google Scholar 

  5. Gauvian J (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 2(2):291–298

    Article  Google Scholar 

  6. Tsao Y, Lee SM, Chou FC, Lee LS (2001) Segmental eigenvoice for rapid speaker adaptation, proceeding of eurospeech, Aalborg, Denmark, CD-ROM

  7. Pusateri EJ, Hazen TJ (2002) Rapid speaker adaptation using speaker clustering. Proceeding of ICSLP, Denver, Colorado, pp 61–64

  8. Huang C, Chen T, Li S, Chang E, Zhou J (2001) Analysis of speaker variability. Proc Eurospeech 2:1377–1380

    Google Scholar 

  9. Karimi K (2003) Implementing the speaker features for quality improvement of the speech recognition models. MS thesis, Amirkabir University of Technology, Biomedical Engineering Faculty, summer 2003 (in Persian)

  10. Seyyedsalehi SA (1995) Continuous Persian speech recognition using functional model of human brain in speech perception. Ph.D. thesis, Tarbiyat Modarres University, Technical Department (in Persian)

  11. Bijankhan M et al (1994) FARSDAT—the speech database of Farsi spoken language, SST-94, Perth, pp 826–831

  12. Rahiminejad M (2002) Development and enhancement of current feature extraction methods in speech recognition systems. MS thesis, Amirkabir University of Technology, Biomedical Engineering Faculty (in Persian)

Download references

Acknowledgments

The fulfillment of the present study has been achieved by support and cooperation of the Research Center of Intelligent signal Processing (RCISP).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isar Nejadgholi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nejadgholi, I., Seyyedsalehi, S.A. Nonlinear normalization of input patterns to speaker variability in speech recognition neural networks. Neural Comput & Applic 18, 45–55 (2009). https://doi.org/10.1007/s00521-007-0151-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-007-0151-5

Keywords

Navigation