Skip to main content
Log in

Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

A growing body of recent work documents the potential benefits of sub-band processing over wideband processing in automatic speech recognition and, less usually, speaker recognition. It is often found that the sub-band approach delivers performance improvements (especially in the presence of noise), but not always so. This raises the question of precisely when and how sub-band processing might be advantageous, which is difficult to answer because there is as yet only a rudimentary theoretical framework guiding this work. We describe a simple sub-band speaker recognition system designed to facilitate experimentation aimed at increasing understanding of the approach. This splits the time-domain speech signal into 16 sub-bands using a bank of second-order filters spaced on the psychophysical mel scale. Each sub-band has its own separate cepstral-based recognition system, the outputs of which are combined using the sum rule to produce a final decision. We find that sub-band processing leads to worthwhile reductions in both the verification and identification error rates relative to the wideband system, decreasing the identification error rate from 3.33% to 0.56% and equal error rate for verification by approximately 50% for clean speech. The hypothesis is advanced that, unlike the wideband system, sub-band processing effectively constrains the free parameters of the speaker models to be more uniformly deployed across frequency: as such, it offers a practical solution to the bias/variance dilemma of data modeling. Much remains to be done to explore fully the new paradigm of sub-band processing. Accordingly, several avenues for future work are identified. In particular, we aim to explore the hypothesis of a practical solution to the bias/variance dilemma in more depth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Allen, J.B. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4):567–577.

    Google Scholar 

  • Atal, B.S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. Journal of the Acoustical Society of America, 55:1304–1312.

    Google Scholar 

  • Auckenthaler, R. and Mason, J.S. (1997). Equalizing sub-band error rates in speaker recognition. In European Speech Communication Association (ESCA) Conference, Eurospeech 97, Rhodes, Greece, pp. 2303–2306.

  • Besacier, L. and Bonastre, J.-F. (1997). Subband approach for automatic speaker recognition: Optimal division of the frequency domain. In Proceedings of 1st International Conference on Audioand Visual-Based Biometric Person Authentication (AVBPA), Crans-Montana, Switzerland, pp. 195–202.

  • Bimbot, F. and Mathan, L. (1994). Second-order statistical measures for text-independent speaker recognition. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 51–54.

    Google Scholar 

  • Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press.

    Google Scholar 

  • Booth, I., Barlow, M., and Watson, B. (1993). Enhancements to DTW and VQ decision algorithms for speaker recognition. Speech Communication, 13:427–433.

    Google Scholar 

  • Bourlard, H. and Dupont, S. (1996). A new ASR approach based on independent processing and recombination of partial frequency bands. In Proceedings of International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, pp. 426–429.

  • Bowles, R.L., Damper, R.I., and Lucas, S.M. (1988). Combining evidence from separate speech recognition processes. In Proceedings of 7th FASE Symposium, Speech 88, Edinburgh, Scotland, Vol. 2, pp. 669–674.

    Google Scholar 

  • Campbell, J.P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1462.

    Google Scholar 

  • Carey, M.J. and Parris, E.S. (1992). Speaker verification using connected words. Proceedings of the Institute of Acoustics, 14(6):95–100.

    Google Scholar 

  • Cherkassky, V. and Mulier, F. (1998). Learning from Data.NewYork, NY: John Wiley.

    Google Scholar 

  • Damper, R.I. (1995). Introduction to Discrete-Time Signals and Systems. London: Chapman and Hall.

    Google Scholar 

  • Doddington, G. (1985). Speaker recognition—identifying people by their voices. Proceedings of the IEEE, 73(11):1651–1664.

    Google Scholar 

  • Doddington, G., Liggett, W., Martin, A., Przybocki, M., and Reynolds, D. (1998). Sheep, goats, lambs andwolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 608 on CD-ROM.

  • Finan, R.A. (1998). Towards the Use of Sub-Band Processing in Automatic Speaker Recognition. Ph.D. thesis, School of Engineering, University of Abertay Dundee.

  • Finan, R.A., Sapeluk, A.T., and Damper, R.I. (1997). Impostor cohort selection for score normalisation in speaker verification. Pattern Recognition Letters, 18:881–888.

    Google Scholar 

  • Furui, S. (1997). Recent advances in speaker recognition. Pattern Recognition Letters, 18:859–872.

    Google Scholar 

  • Gabor, D. (1946). Theory of communication. Journal of the Institution of Electrical Engineers, 93:429–457.

    Google Scholar 

  • Gabor, D. (1950). Communication theory and physics. Philosophical Magazine, 4:1161–1187.

    Google Scholar 

  • Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4:1–58.

    Google Scholar 

  • Hennecke, M., Stork, D.G., and Venkatesh Prasad, K. (1996).Visionary speech: Looking ahead to practical speechreading systems. In D.G. Stork and M. Hennecke (Eds.), Speechreading by Humans and Machines: Models, Systems and Applications. Berlin, Germany: NATO ASI Series, Springer, pp. 331–349.

    Google Scholar 

  • Hermansky, H. (1998). Should recognizers have ears? Speech Communication, 25(1–3):3–27.

    Google Scholar 

  • Hermansky, H. and Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4):578–589.

    Google Scholar 

  • Hermansky, H. and Sharma, S. (1998). TRAPS—Classifiers of temporal patterns. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 615 on CD-ROM.

  • Hermansky, H., Tibrewala, S., and Pavel, M. (1996). Towards ASR on partially corrupted speech. In Proceedings of 4th International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, Vol. 1, pp. 462–465.

    Google Scholar 

  • Jacobs, R.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3:79–87.

    Google Scholar 

  • Kittler, J., Hatef, M., Duin, R.P.W., and Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239.

    Google Scholar 

  • Li, K.-P. and Porter, J.E. (1988). Normalizations and selection of speech segments for speaker recognition scoring. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 88), New York, NY, pp. 595–598.

  • Linde, J., Buzo, A., and Gray, R.M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28:84–95.

    Article  Google Scholar 

  • Markel, J.D. and Gray, A.H. (1976). Linear Prediction of Speech. Berlin, Germany: Springer-Verlag.

    Google Scholar 

  • Matsui, T. and Furui, S. (1995). Likelihood normalization for speaker verification using phone-and speaker-independent models. Speech Communication, 17:109–116.

    Google Scholar 

  • Morris, A., Hagen, A., and Bourlard, H. (1999). The full-combination sub-bands approach to noise robust HMM/ANN-based ASR. In Proceedings of 6th European Conference on Speech Communication and Technology (Eurospeech 99), Budapest, Hungary, Vol. 2, pp. 599–602.

    Google Scholar 

  • Naik, J.M., Netsch, L.P., and Doddington, G.R. (1989). Speaker verification over long-distance telephone lines. In Proceedings of International Conference on Acoustics, Speech and Signal Processing ICASSP 89, Vol. 1, Glasgow, Scotland, pp. 524–527.

    Google Scholar 

  • Okawa, S., Bocchieri, E., and Potamianos, A. (1998). Multi-band speech recognition in noisy environments. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 98), Seattle, WA, Vol. I, p. 641.

    Google Scholar 

  • Owens, F.J. (1993). Signal Processing of Speech. Basingstoke, UK: Macmillan.

    Google Scholar 

  • Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81:1215–1247.

    Google Scholar 

  • Reynolds, D.A. (1994). Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4):639–643.

    Google Scholar 

  • Reynolds, D.A. (1995). Speaker identification and verification using Gaussian mixture models. Speech Communication, 17:91–108.

    Google Scholar 

  • Reynolds, D.A. (1997). Comparison of background normalization methods for text-independent speaker verification. In Proceedings of 5th European Conference on Speech Communication and Technology (Eurospeech 97), Vol. 2, Rhodes, Greece, pp. 963–966.

    Google Scholar 

  • Rosenberg, A.E. and Parthasarathy, S. (1996). Speaker background models for connected digit password speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 96), Atlanta, GA, Vol. 1, pp. 81–84.

    Google Scholar 

  • Rosenberg, A.E. and Soong, F.K. (1987). Evaluation of a vector quantization talker recognition system in text dependent and text independent modes. Computer Speech and Language, 22:143–157.

    Google Scholar 

  • Schroeder, M. (1999). Computer Speech: Recognition, Compression and Synthesis. Berlin, Germany: Springer-Verlag.

    Google Scholar 

  • Siegel, S. (1956). Non-parametric Statistics for the Behavioral Sciences. Tokyo, Japan: McGraw-HillKogakusha.

    Google Scholar 

  • Sivakumaran, P., Ariyaeeinia, A.M., and Hewitt, J.A. (1998). Subband speaker verification using dynamic recombination weights. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 1055 on CD-ROM.

  • Sivakumaran, P., Ariyaeeinia, A.M., Hewitt, J.A., and Malcolm, J.A. (1998). An effective sub-band based approach for robust speaker verification. Proceedings of the Institute of Acoustics, 20(6):69–72.

    Google Scholar 

  • Steeneken, H.T.M. and Houtgast, T. (1999). Mutual dependence of the octave-band weights in predicting speech intelligibility. Speech Communication, 28:109–123.

    Google Scholar 

  • Thompson, J. and Mason, J.S. (1994). The pre-detection of errorprone class members at the enrollment stage of speaker recognition systems. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 127–130.

    Google Scholar 

  • Tibrewala, S. and Hermansky, H. (1997). Sub-band based recognition of noisy speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 97), Munich, Germany, Vol. II, pp. 1255–1258.

    Google Scholar 

  • Wolpert, D.H. (1992). Stacked generalization. Neural Networks,:241–259.

  • Yu, K., Mason, J., and Oglesby, J. (1995). Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEEE Proceedings: Vision, Image and Signal Processing, 142:313–318.

    Google Scholar 

  • Zwicker, E. and Terhardt, E. (1980). Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. Journal of the Acoustical Society of America, 68(5):1523–1525.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Finan, R., Damper, R. & Sapeluk, A. Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing. International Journal of Speech Technology 4, 45–62 (2001). https://doi.org/10.1023/A:1009652732313

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009652732313

Navigation