Skip to main content

Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition

  • Chapter
Speech Separation by Humans and Machines

Summary

It is believed that computational auditory approaches are potentially extremely useful in ameliorating some of the most difficult speech recognition problems, specifically the recognition of speech presented at low SNRs, speech masked by other speech, speech masked by music, and speech in highly reverberant environments. The solution to these problems using CASA techniques is likely to depend on the ability to develop several key elements of signal processing, including the reliable detection of fundamental frequency for isolated speech and for multiple simultaneously-presented speech sounds, the reliable detection of modulations of amplitude and frequency in very narrowband channels, and the development of across-frequency correlation approaches that can identify frequency bands with coherent microactivity as they evolve over time. I am extremely optimistic that effective solutions for these problems are within reach in the near future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Acero, A., and Stern, R. M., 1990, “Environmental robustness in automatic speech recognition,” Proc. ICASSP, Albuquerque, New Mexico.

    Google Scholar 

  • Boll, S. F., 1979, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoustics, Speech and Signal Processing, 27: 113–120.

    Article  Google Scholar 

  • Bregman, A. S., 1990, Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge: MIT Press, Cambridge.

    Google Scholar 

  • Brown, G. J., and Palomäki, K., 2004, “Techniques for speech processing in noisy and reverberant conditions,” this volume.

    Google Scholar 

  • Colburn, H. S., 1995, “Computational models of binaural processing,” in Springer Handbook of Auditory Research: Auditory Computation, H. L. Hawkins, T. A. McMullen, A. N. Popper, and R. R. Fay, eds. New York: Academic Press, pp. 332–400.

    Google Scholar 

  • Colburn, H. S., and Durlach, N. I., 1978, “Models of binaural interaction,” in Handbook of Perception, E. C. Carterette and M. P. Friedman, eds., Academic Press, New York, pp. 467–518.

    Google Scholar 

  • Cooke, M., Green, P. Josifovski, L., and Vizinho, A, 2001, “Robust automatic speech recognition with missing features and unreliable acoustic data,” Speech Communication, 34: 267–285.

    Article  Google Scholar 

  • Cooke, M., 2004, “Making sense of everyday speech: A glimpsing account,” this volume.

    Google Scholar 

  • Culling, J. F., and Summerfield, Q., 1995, “Perceptual separation of concurrent speech sounds: Absence of across-frequency grouping by common interaural delay,” J. Acoust. Soc. Am. 98: 785–797.

    Google Scholar 

  • Darwin, C. J., and Carlyon, R. P., 1995, “Auditory grouping,” in Handbook of Perception and Cognition, Vol. 6: Hearing, B. C. J. Moore., ed. New York: Academic Press, pp. 347–386.

    Google Scholar 

  • de Cheveigné, A., 2004, “The cancellation principle in acoustic scene analysis,” this volume.

    Google Scholar 

  • de Cheveigné, A., and Baskind, A., 2003, “F0 estimation of one or several voices,” Proc. Eurospeech, pp. 833–836.

    Google Scholar 

  • Jeffress, L. A., 1948, “A place theory of sound localization,” J. Comparative and Physiological Psychology 41: 35–39.

    Google Scholar 

  • Juang, B.-H., 1991, “Speech recognition in adverse environments,” Computer Speech and Language, 5: 275–294.

    Article  Google Scholar 

  • Lindemann, W., 1986, “Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals,” J. Acoust. Soc. Am. 80, 1608–1622.

    Google Scholar 

  • Kawahara, H., and Irino, T., 2004, “Underlying principles of a high-quality speech manipulation system STRAIGHT and its application to speech segregation,” this volume.

    Google Scholar 

  • Kawahara, H., Matsuda-Katsuse, I., and de Cheveigné, A., 1999, “Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, 27: 175–185.

    Article  Google Scholar 

  • Komenek, J., and Black, A., 2003, CMU_ARCTIC Databases, http://www.festvox.org/cmu_arctic.

  • Palomäki, K. J., Brown, G. J., and Wang, D. L., 2004, “A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation,” Speech Communication (accepted for publication).

    Google Scholar 

  • Moreno, P. J., Raj, B., and Stern, R. M., 1996, “A vector taylor series approach for environment-independent speech recognition,” Proc. ICASSP, Atlanta, Georgia.

    Google Scholar 

  • Raj, B., Parikh, V. N., and Stern, R. M., 1997, “The effects of background music on speech recognition accuracy,” Proc. ICASSP, Munich, Germany.

    Google Scholar 

  • Raj, B., Seltzer, M. L., and Stern, R. M., 2004, “Reconstruction of missing features for robust speech recognition,” Speech Communication Journal (accepted for publication).

    Google Scholar 

  • Seltzer, M. L., Raj, B., and Stern, R. M., 2004, “A Bayesian Framework for Spectrographic Mask Estimation for Missing Feature Speech Recognition,” Speech Communication Journal (accepted for publication).

    Google Scholar 

  • Singh, R., Stern, R. M. and Raj, B., 2002a, “Signal and feature compensation methods for robust speech recognition,” Chapter in CRC Handbook on Noise Reduction in Speech Applications, Gillian Davis, ed., CRC Press, Boca Raton.

    Google Scholar 

  • Singh, R, Raj, B. and Stern, R. M., 2002b, “Model compensation and matched condition methods for robust speech rcognition,” Chapter in CRC Handbook on Noise Reduction in Speech Applications, Gillian Davis, ed. CRC Press, Boca Raton.

    Google Scholar 

  • Stern, R. M., Acero, A., Liu, F.-H. Liu, and Oshima, Y., 1996, “Signal processing for robust speech recognition,” Chapter in Automatic Speech and Speaker Recognition, C.-H. Lee, F. Soong, and K. Paliwal, eds., Kluwer Academic Publishers, Boston, pp. 351–378.

    Google Scholar 

  • Stern, R. M., Raj, B. and Moreno, P. J., 1997, “Compensation for environmental degradation in automatic speech recognition,” Proc. ETRW on Robust Speech Recognition for Unknown Communication Channels, Pont-au-Mousson, France, pp. 33–42.

    Google Scholar 

  • Stern, R. M., and Trahiotis, C., 1995, “Models of binaural interaction,” in Handbook of Perception and Cognition, Volume 6: Hearing, B. C. J. Moore., ed., Academic Press, New York, pp. 347–386.

    Google Scholar 

  • Stern, R. M., and Trahiiotis, C., 1996, “Models of Binaural Perception,” in Binaural and Spatial Hearing in Real and Virtual Environments, R. Gilkey and T. R. Anderson, Eds. New York: Lawrence Erlbaum Associates, pp. 499–531.

    Google Scholar 

  • Stern, R. M., Trahiotis, C., and Ripepi, A. M, 2004, “Some conditions under which interaural delays foster identification,” in Dynamics of Speech Production and Perception, G. Meyer and P. Divenyi. eds., IOP Press, Amsterdam: IOP Press (in press).

    Google Scholar 

  • Stockham, T. G., Cannon, T. M., and Ingebretsen, R. B., 2004, “Blind Deconvolution Through Digital Signal Processing,” Proc. IEEE, 63: 678–692/

    Google Scholar 

  • Sullivan, T. M., and Stern, R. M., 1993, “Multi-Microphone Correlation-Based Processing for Robust Speech Recognition,” Proc. ICASSP, Minneapolis, Minnesota.

    Google Scholar 

  • Wang, D., 2004, “On the use of ideal binary time-frequency masks for CASA,” this volume.

    Google Scholar 

  • Zurek, P. M., “Binaural Advantages and Directional Effects in Speech Intelligibility”, in Acoustical Factors Affecting Hearing Performance II, G. A. Studebaker and I. Hochberg, Eds. Boston: Allyn and Bacon, 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer Science + Business Media, Inc.

About this chapter

Cite this chapter

Stern, R.M. (2005). Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_9

Download citation

  • DOI: https://doi.org/10.1007/0-387-22794-6_9

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-8001-2

  • Online ISBN: 978-0-387-22794-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics