Abstract
This paper describes an efficient constructive training algorithm using a Multi Layer Perceptron (MLP) neural network dedicated for Isolated Word Recognition (IWR) systems. Incremental training procedure was employed and this approach was based on novel hidden neurons recruiting for a single hidden-layer. During Neural Network (NN) training phase, the number of pronunciation samples extracted from the Training Data (TD) was sequentially increased. Optimal structure of the NN classifier with optimized TD size was obtained using this proposed MLP constructive training algorithm.
Isolated word recognition system based on MLP neural network was then constructed and tested for recognizing ten words extracted from TIMIT database. Mel Frequency Cepstral Coefficient (MFCC) feature extraction method was employed including energy, first and second derivative coefficients.
A proposed Frame-by-Frame Neural Network (FFNN) classification method was explored and compared with the Conventional Neural Network (CNN) classification approach. Principal Component Analysis (PCA) technique was also investigated in order to reduce both TD size as well as recognition system complexity.
Experimental results showed superior performance of the proposed FFNN classifier compared to the CNN counter part which was illustrated by the significant improvement obtained in terms of recognition rate.
Similar content being viewed by others
References
Bourlard, H. A., & Morgan, N. (1998). Hybrid HMM/ANN systems for speech recognition: Overview and new research directions. In Lecture notes comput. sci. (Vol. 1387, pp. 389–417).
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28, 357–366.
Furui, S. (1986). Speaker independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, and Speech Signal Processing, 34(1), 52–59.
Gandhiraj, R., & Sathidevi, P. S. (2007). Auditory-based wavelet packet filterbank for speech recognition using neural network. In Proceedings of the 15th international conference on advanced computing and communications, Dec. 18–21, Guwahati, India (pp. 666–673).
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87, 1738–1752.
Hermansky, H. (1997). The modulation spectrum in the automatic recognition of speech. In Proceedings of the IEEE workshop on automatic speech recognition and understanding, Dec. 14–17, Santa Barbara, CA (pp. 140–147).
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feed forward networks are universal approximators. Neural Networks, 2, 359–366.
Juang, C. F., Chiou, C. T., & Lai, C. L. (2007). Hierarchical singleton-type recurrent neural fuzzy networks for noisy speech recognition. IEEE Transactions on Neural Networks, 18, 833–843.
Kandali, A. B., Routray, A., & Basu, T. K. (2009). Vocal emotion recognition in five native languages of Assam using new wavelet features. International Journal of Speech Technology, 12, 1–13.
Kuang, Z., & Kuh, A. (1992). A combined self-organizing feature map and multilayer perceptron for isolated word recognition. IEEE Transactions on Signal Processing, 40(11), 2651–2657.
Lee, L. M., & Wang, H. C. (1994). A study on adaptations of cepstral and delta cepstral coefficients for noisy speech recognition. In Proc. int. conf. on spoken language processing, Yokohama, Japan (Vol. 3, pp. 1011–1014).
Lee, T., Ching, P. C., & Chan, L. W. (1998). Isolated word recognition using modular recurrent neural networks. Pattern Recognition, 31, 751–760.
Levin, E. (1990). Word recognition using hidden control neural network architecture. In Proceedings of the IEEE international conference acoustics, speech, and signal processing (ICASSP’90), Apr. 3–6, Albuquerque, NM (pp. 433–436).
Li, Y. X., Kwong, S., He, Q. H., He, J., & Yang, J. C. (2010). Genetic algorithm based simultaneous optimization of feature subsets and hidden Markov model parameters for discrimination between speech and non-speech events. International Journal of Speech Technology, 13, 61–73.
Liang, Q., & Harris, J. G. (2003). The feature of artificial neural networks and speech recognition. In C. T. Leondes (Ed.), Intelligent systems: technology and applications signal, image, and speech processing (Vol. 3, pp. 215–236). Boca Raton: CRC Press.
Lim, C. P., Woo, S. C., Loh, A. S., & Osman, R. (2000). Speech recognition using artificial neural networks. In Proceeding of the first international conference on web information systems engineering, Jun 19–21, Hong Kong (Vol. 1, pp. 419–423).
Lippmann, R. P. (1989). Pattern classification using neural networks. IEEE Communications Magazine, 27, 47–50, 59–64.
Liu, D., Chang, T. S., & Zhang, Y. (2002). A constructive algorithm for feed forward neural networks with incremental training. IEEE Transactions on Circuits and Systems, 49, 1876–1879.
Makhoul, J. (1975). Linear prediction: a tutorial review. Proceedings of the IEEE, 63(4), 561–580.
Masmoudi, S., Chtourou, M., & Hamida, A. B. (2009). Isolated word recognition using MLP neural network constructive training algorithm. In Proceeding of the 6 th international multi-conference on systems, signals and devices, SSD’09, March 23–26, Djerba, Tunisia (pp. 1–6).
Morgan, N., & Bourlard, H. A. (1995). Neural networks for statistical recognition of continuous speech. Proceedings of the IEEE, 83, 742–772.
Puurula, A., & Compernolle, D. V. (2010). Dual stream speech recognition using articulatory syllable models. International Journal of Speech Technology, 13(4), 219–230.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286.
Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1), 43–49.
Schwenk, H., & Gauvain, J. L. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP’02), May 13–17, Orlando, FL, USA (pp. 765–768).
Tebelskis, J., & Waibel, A. (1990). Large vocabulary recognition using linked predictive neural networks. In Proceedings of the IEEE international conference acoustic speech signal processing, April 3–6, Albuquerque, NM (pp. 437–440).
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustic, Speech and Signal Procesing, 37(3), 328–339.
Wang, L., Chen, K., & Chi, H. (2002). Capture interspeaker information with a neural network for speaker identification. IEEE Transactions on Neural Networks, 13, 436–445.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Masmoudi, S., Frikha, M., Chtourou, M. et al. Efficient MLP constructive training algorithm using a neuron recruiting approach for isolated word recognition system. Int J Speech Technol 14, 1–10 (2011). https://doi.org/10.1007/s10772-010-9082-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-010-9082-0