Skip to main content

Advertisement

Log in

Front end analysis of speech recognition: a review

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Automatic speech recognition (ASR) has made great strides with the development of digital signal processing hardware and software. But despite of all these advances, machines can not match the performance of their human counterparts in terms of accuracy and speed, especially in case of speaker independent speech recognition. So, today significant portion of speech recognition research is focused on speaker independent speech recognition problem. Before recognition, speech processing has to be carried out to get a feature vectors of the signal. So, front end analysis plays a important role. The reasons are its wide range of applications, and limitations of available techniques of speech recognition. So, in this report we briefly discuss the different aspects of front end analysis of speech recognition including sound characteristics, feature extraction techniques, spectral representations of the speech signal etc. We have also discussed the various advantages and disadvantages of each feature extraction technique, along with the suitability of each method to particular application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Agrafiotis, D. K. (2003). Stochastic proximity embedding. Journal of Computational Chemistry, 24(10), 1215–1221.

    Google Scholar 

  • Allen, J. B. (1985). Cochlear modeling. IEEE ASSP Magazine, 3(3), 3–29.

    Google Scholar 

  • Alwan, A. (1989). Perceputal cues for place of articulation for the voiced pharyngealand uvular consonants. The Journal of the Acoustical Society of America, 86, 549–556.

    Google Scholar 

  • Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12(10), 2385–2404.

    Google Scholar 

  • Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems (Vol. 14, pp. 585–591). Cambridge: MIT Press.

    Google Scholar 

  • Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159.

    Google Scholar 

  • Bishop, C., Svensen, M., & Williams, C. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234.

    Google Scholar 

  • Bocchieri, E. L., & Doddington, G. R. (1986). Frame specific statistical features for speaker-independent speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(4), 755–764.

    Google Scholar 

  • Brand, M. (2002). Charting a manifold. In Advances in neural information processing systems (Vol. 15, pp. 985–992). Cambridge: MIT Press.

    Google Scholar 

  • Brand, M. (2004). From subspaces to submanifolds. In Proc. of the 15th British machine vision conference, London, UK.

    Google Scholar 

  • Campbell, J., & Tremain, T. E. (1986). Voiced/unvoiced classification of speech with applications to the U.S. government LPC-10E algorithm. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 473–476), Tokyo, Japan, April.

    Google Scholar 

  • Chang, K.-Y., & Ghosh, J. (1998). Principal curves for nonlinear feature extraction and classification. In Applications of artificial neural networks in image processing III (pp. 120–129). Bellingham: SPIE.

    Google Scholar 

  • Cox, T., & Cox, M. (1994). Multidimensional scaling. London: Chapman & Hall.

    MATH  Google Scholar 

  • Datig, M., & Schlurmann, T. (2004). Transformance and limitations of the Hilbert–Huang transformation (HHT) with an application to irregular water waves. Ocean Engineering, 31, 1783–1834.

    Google Scholar 

  • Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Google Scholar 

  • Deller, J. R., Proakis, J. G., & Hansen, J. H. L. (1993). Discrete time processing of speech signals. New York: Macmillan.

    Google Scholar 

  • DeMers, D., & Cottrell, G. (1993). Non-linear dimensionality reduction. In Advances in neural information processing systems (Vol. 5, pp. 580–587). San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Doddington, G. R. (1989). Phonetically sensitive discriminants for improved speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 556–559), Glasgow, Scotland, May.

    Google Scholar 

  • Donoho, D. L., & Grimes, C. (2005). Hessian eigenmaps: New locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 102(21), 7426–7431.

    Google Scholar 

  • Faloutsos, C., & Lin, K.-I. (1995). FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of the 1995 ACM international conference on management of data (pp. 163–174).

    Google Scholar 

  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188.

    Google Scholar 

  • Flandrin, P., Rilling, G., & Gonçalves (2003). Empirical mode decomposition as a filter bank. IEEE Signal Processing Letters, 11(2), 112–114.

    Google Scholar 

  • Fukunaga, K. (1990). Introduction to statistical pattern recognition. San Diego: Academic Press.

    MATH  Google Scholar 

  • Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of the speech spectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1), 52–59.

    Google Scholar 

  • Furui, S. (1990). On the use of hierarchical spectral dynamics in speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 789–792), Albuquerque, New Mexico, USA, April.

    Google Scholar 

  • Gold, B. (1964). Note on buzz-hiss detection. The Journal of the Acoustical Society of America, 36, 1659–1661.

    Google Scholar 

  • Gold, B., & Rabiner, L. R. (1969). Parallel processing techniques for estimating pitch periods of speech in the time domain. The Journal of the Acoustical Society of America, 46(2), 442–449.

    Google Scholar 

  • Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In IEEE International conference on acoustics, speech, and signal processing (Vol. 1, pp. 13–16).

    Google Scholar 

  • Hamming, R. W. (1989). Digital filters (2nd ed.). Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • He, X., & Niyogi, P. (2004). Locality preserving projections. In Advances in neural information processing systems (Vol. 16, p. 37). Cambridge: MIT Press.

    Google Scholar 

  • Hess, W. (1983). Pitch determination of speech signals. New York: Springer.

    Google Scholar 

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    MathSciNet  Google Scholar 

  • Hoffmann, H. (2007). Kernel PCA for novelty detection. Pattern Recognition, 40(3), 863–874.

    MATH  Google Scholar 

  • Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441.

    Google Scholar 

  • Huang, N. E. (2005). Introduction to the Hilbert Huang transform and its related mathematical problems.

  • Huang, N. E., Long, S. R., & Shen, Z. (1996). The mechanism for frequency downshift in non linear evolution. Advances in Applied Mechanics, 32, 59–111.

    Google Scholar 

  • Huang et al. (1998). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London. Series A, 454, 903–993.

    MATH  Google Scholar 

  • Huang, N. E., Shen, Z., & Long, R. S. (1999). A new view of nonlinear water waves—the Hilbert spectrum. Annual Review of Fluid Mechanics, 31, 417–457.

    MathSciNet  Google Scholar 

  • Huang, N. E., Wu, M. L., Long, S. R., Shen, S. S., Qu, W. D., Gloersen, P., & Fan, K. L. (2003). A confidence limit for the empirical mode decomposition and Hilbert spectral analysis. Proceedings of the Royal Society of London. Series A, 459, 2,317–2,345.

    MathSciNet  Google Scholar 

  • Huber, R., Ramoser, H., Mayer, K., Penz, H., & Rubik, M. (2005). Classification of coins using an eigenspace approach. Pattern Recognition Letters, 26(1), 61–75.

    Google Scholar 

  • Jimenez, L. O., & Landgrebe, D. A. (1997). Supervised classification in high-dimensional space: Geometrical, statistical, and asymptotical properties of multivariate data. IEEE Transactions on Systems, Man and Cybernetics, 28(1), 39–54.

    Google Scholar 

  • Juang, B. H., Rabiner, L. R., & Wilpon, J. G. (1987). On the use of bandpass liftering in speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(7), 947–954.

    Google Scholar 

  • Kohonen, T. (1988). Self-organization and associative memory. Berlin: Springer.

    MATH  Google Scholar 

  • Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27.

    MATH  MathSciNet  Google Scholar 

  • Lafon, S., & Lee, A. B. (2006). Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1393–1403.

    Google Scholar 

  • Lee, K. F. (1989). Automatic speech recognition: The development of the SPHINX system. Boston: Kluwer Academic.

    Google Scholar 

  • Lima, A., Zen, H., Nankaku, Y., Miyajima, C., Tokuda, K., & Kitamura, T. (2004). On the use of kernel PCA for feature extraction in speech recognition. IEICE Transactions on Information and Systems, E87-D(12), 2802–2811.

    Google Scholar 

  • Markel, J., & Gray, A. H., Jr. (1980). Linear prediction of speech. New York: Springer.

    Google Scholar 

  • Mika, S., Schölkopf, B., Smola, A. J., Müller, K.-R., Scholz, M., & Rätsch, G. (1999). Kernel PCA and de-noising in feature spaces. In Advances in neural information processing systems (Vol. 11). Cambridge: MIT Press.

    Google Scholar 

  • Møller, A. R. (1983). Auditory physiology. New York: Academic Press.

    Google Scholar 

  • Naden, C., Hemando, J., & Gorricho, M. (1995). On the decorrelation of filter bank energies for speech recognition. In Int. proc. Eurospeech (pp. 1381–1384).

    Google Scholar 

  • Naden, C., Macho, D., & Hermando, L. (2001). Frequency and time filtering of filter-bank energies for robust HMM speech recognition. Speech Communication, 34, 93–114.

    Google Scholar 

  • Nadler, B., Lafon, S., Coifman, R. R., & Kevrekidis, I. G. (2006). Diffusion maps, spectral clustering and the reaction coordinates of dynamical systems. Applied and Computational Harmonic Analysis, 21, 113–127.

    MATH  MathSciNet  Google Scholar 

  • Nakadai, Y., & Sugamura, N. (1990). A speech recognition method for noise environments using dual inputs. In Proceedings of the international conference on spoken language processing (pp. 1141–1144), Kobe, Japan, November.

    Google Scholar 

  • Ney, H. (1990). Experiments on mixture-density phoneme modelling for the speaker-independent 1000-word speech recognition DARPA task. In Proceedings IEEE acoustics, speech, and signal processing (pp. 713–716), Albuquerque, New Mexico, USA, April.

    Google Scholar 

  • Noll, A. M. (1967). Cepstrum pitch determination. The Journal of the Acoustical Society of America, 41(2), 293–309.

    MathSciNet  Google Scholar 

  • Noll, A. M. (1990). Problems of speech recognition in mobile environments. In Proceedings of the international conference on spoken language processing (pp. 1133–1136), Kobe, Japan, November.

    Google Scholar 

  • Ogata, K. (1970). Modern control engineering. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • Oppenheim, A. V., & Schafer, R. W. (1975). Digital signal processing. Englewood Cliffs: Prentice-Hall.

    MATH  Google Scholar 

  • O’Shaughnessy, D. (1987). Speech communication: Human and machine. New York: Addison-Wesley.

    Google Scholar 

  • Pallet, D. S. (1989). Speech results on resource management task. In Proceedings of the February 1989 DARPA speech and natural language workshop (pp. 18–24). Philadelphia: Morgan Kaufman.

    Google Scholar 

  • Papamichalis, P. (1987). Practical approaches to speech coding. Prentice-Hall: Englewood Cliffs.

    Google Scholar 

  • Partridge, M., & Calvo, R. (1997). Fast dimensionality reduction and simple PCA. Intelligent Data Analysis, 2(3), 292–298.

    Google Scholar 

  • Paul, D. (1989). The Lincoln robust continuous speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 556–559). Glasgow, Scotland, May.

    Google Scholar 

  • Paul, D. (1989). The Lincoln robust continuous speech recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 556–559), Glasgow, Scotland, May.

    Google Scholar 

  • Paul, D. (1990). A speaker-stress resistant isolated word recognizer. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 713–716), Dallas, Texas, USA, April.

    Google Scholar 

  • Pickles, J. O. (1988). An introduction to the physiology of hearing. New York: Academic Press.

    Google Scholar 

  • Picone, J. (1983). Analytic signal processing. Ph.D. Dissertation, Illinois Institute of Technology, Chicago, Illinois, USA, December.

  • Picone, J. (1990). The demographics of speaker independent digit recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 105–108), Albuquerque, New Mexico, USA, April.

    Google Scholar 

  • Picone, J., Doddington, G. R., & Secrest, B. G. (1987). Robust pitch detection in a noisy telephone environment. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1442–1445), Dallas, Texas, USA, April.

    Google Scholar 

  • Proakis, J. G. (1989). Digital communications (2nd ed.). New York: McGraw-Hill.

    Google Scholar 

  • Rabiner, & Jung (1998). Fundamentals of speech recognition. Copyright 1998.

  • Rabiner, L. R., & Schafer, R. W. (1978). Digital processing of speech signals. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • Rabiner, L. R., & Schafer, R. W. (1978). Digital processing of speech signals. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • Reddy, D. R. (1967). Computer recognition of connected speech. The Journal of the Acoustical Society of America, 42(2), 329–347.

    Google Scholar 

  • Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

    Google Scholar 

  • Roweis, S. T., Saul, L., & Hinton, G. (2001). Global coordination of local linear models. In Advances in neural information processing systems (Vol. 14, pp. 889–896). Cambridge: MIT Press.

    Google Scholar 

  • Schafer, R. W., & Rabiner, L. R. (1970). System for automatic formant analysis of voiced speech. The Journal of the Acoustical Society of America, 47(2), 34–648.

    Google Scholar 

  • Scheirer, E., & Slaney, M. Construction and evaluation of a robust multi feature speech/music discriminator. Interval Research Corp, 1801-C Page Mill Road, Pal Alto, CA, 94304, USA.

  • Schölkopf, B., Smola, A. J., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.

    Google Scholar 

  • Seneff, S. (1988). A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics, 16(1), 55–76.

    Google Scholar 

  • Sha, F., & Saul, L. K. (2005). Analysis and extension of spectral methods for nonlinear dimensionality reduction. In Proceedings of the 22nd international conference on machine learning (pp. 785–792).

    Google Scholar 

  • Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.

    Google Scholar 

  • Shirai, K., Hosaka, N., Kitagawa, E., & Endou, T. (1990). Speaker adaptable phoneme recognition selecting reliable acoustic features based on mutual information. In Proceedings of the international conference on spoken language processing (pp. 353–356), Kobe, Japan, November.

    Google Scholar 

  • Sondhi, M. M. (1968). New methods of pitch detection. IEEE Transactions on Audio and Electroacoustics, AU-16, 262–266.

    Google Scholar 

  • Sukkar, R. S., LoCicero, J. L., & Picone, J. (1988). Design and implementation of a parallel processing based pitch detector. IEEE Journal on Selected Areas in Communications, 6(2), 441–451.

    Google Scholar 

  • Suykens, J. A. K. (2007). Data visualization and dimensionality reduction using kernel maps with a reference point (Technical Report 07-22, ESAT-SISTA). K.U. Leuven.

  • Tamura, S. (1989). An analysis of a noise reduction neural network. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 2001–2004), Glasgow, Scotland, May.

    Google Scholar 

  • Teh, Y. W., & Roweis, S. T. (2002). Automatic alignment of hidden representations. In Advances in neural information processing systems (Vol. 15, pp. 841–848). Cambridge: MIT Press.

    Google Scholar 

  • Tenenbaum, J. B. (1998). Mapping a manifold of perceptual observations. In Advances in neural information processing systems (Vol. 10, pp. 682–688). Cambridge: MIT Press.

    Google Scholar 

  • Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

    Google Scholar 

  • Tipping M.E. (2000). Sparse kernel principal component analysis. In Advances in neural information processing systems (Vol. 13, pp. 633–639). Cambridge: MIT Press.

    Google Scholar 

  • Torkkola, K. (2001). Linear discriminant analysis in document classification. In IEEE TextDM 2001 (pp. 800–806).

    Google Scholar 

  • Venkatarajan, M. S., & Braun, W. (2004). New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical chemical properties. Journal of Molecular Modeling, 7(12), 445–453.

    Google Scholar 

  • Verbeek, J. (2006). Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8), 1236–1250.

    Google Scholar 

  • von Békésy, G. (1960). Experiments in hearing. New York: McGraw-Hill.

    Google Scholar 

  • Weinberger, K. Q., Packer, B. D., & Saul, L. K. (2005). Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In Proceedings of the 10th international workshop on AI and statistics.

    Google Scholar 

  • Welch, V. C., Tremain, T. E., & Campbell, J. P., Jr. (1989). A comparison of U.S. government standard voice coders. In IEEE military communications conference record (pp. 269–273), USA, September.

    Google Scholar 

  • Wheatley, B., & Picone, J. (1991). Voice across America: Toward robust speaker independent speech recognition for telecommunications applications. Digital Signal Processing: A Review Journal, 1(2), 45–64.

    Google Scholar 

  • Wilpon, J. G., DeMarco, D. M., & Mikkilineni, R. P. (1988). Isolated word recognition over the DDD telephone network—results of two extensive field trials. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 55–57). New York, NY, USA, April.

    Google Scholar 

  • Wilpon, J. G., Lee, C. H., & Rabiner, L. R. (1989). Application of hidden Markov models for recognition of a limited set of words in unconstrained speech. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 254–257), Glasgow, Scotland.

    Google Scholar 

  • Wilpon, J. G., Mikkilineni, R. P., Roe, D. B., & Gokcen, S. (1990). Speech recognition: From the laboratory to the real world. AT & T Bell Laboratories Technical Journal, 69(5), 14–24.

    Google Scholar 

  • Wu, Z., & Huang, N. E. (2004). A study of the characteristics of white noise using the empirical mode decomposition method. Proceedings of the Royal Society of London. Series A, 460, 1597–1611.

    MATH  Google Scholar 

  • Zhang, Z., & Zha, H. (2004). Principal manifolds and nonlinear dimensionality reduction via local tangent space alignment. SIAM Journal of Scientific Computing, 26(1), 313–338.

    MATH  MathSciNet  Google Scholar 

  • Zhang, T., Yang, J., Zhao, D., & Ge, X. (2007). Linear local tangent space alignment and application to face recognition. Neurocomputing, 70, 1547–1533.

    Google Scholar 

  • Zwicker, E., & Terhardt, E. (1980). Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. The Journal of the Acoustical Society of America, 68(5), 1523–1525.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. K. Katti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anusuya, M.A., Katti, S.K. Front end analysis of speech recognition: a review. Int J Speech Technol 14, 99–145 (2011). https://doi.org/10.1007/s10772-010-9088-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-010-9088-7

Keywords

Navigation