Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition

Stern, Richard M.

doi:10.1007/0-387-22794-6_9

Richard M. Stern²

1225 Accesses
1 Citations

Summary

It is believed that computational auditory approaches are potentially extremely useful in ameliorating some of the most difficult speech recognition problems, specifically the recognition of speech presented at low SNRs, speech masked by other speech, speech masked by music, and speech in highly reverberant environments. The solution to these problems using CASA techniques is likely to depend on the ability to develop several key elements of signal processing, including the reliable detection of fundamental frequency for isolated speech and for multiple simultaneously-presented speech sounds, the reliable detection of modulations of amplitude and frequency in very narrowband channels, and the development of across-frequency correlation approaches that can identify frequency bands with coherent microactivity as they evolve over time. I am extremely optimistic that effective solutions for these problems are within reach in the near future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Acero, A., and Stern, R. M., 1990, “Environmental robustness in automatic speech recognition,” Proc. ICASSP, Albuquerque, New Mexico.
Google Scholar
Boll, S. F., 1979, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoustics, Speech and Signal Processing, 27: 113–120.
Article Google Scholar
Bregman, A. S., 1990, Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge: MIT Press, Cambridge.
Google Scholar
Brown, G. J., and Palomäki, K., 2004, “Techniques for speech processing in noisy and reverberant conditions,” this volume.
Google Scholar
Colburn, H. S., 1995, “Computational models of binaural processing,” in Springer Handbook of Auditory Research: Auditory Computation, H. L. Hawkins, T. A. McMullen, A. N. Popper, and R. R. Fay, eds. New York: Academic Press, pp. 332–400.
Google Scholar
Colburn, H. S., and Durlach, N. I., 1978, “Models of binaural interaction,” in Handbook of Perception, E. C. Carterette and M. P. Friedman, eds., Academic Press, New York, pp. 467–518.
Google Scholar
Cooke, M., Green, P. Josifovski, L., and Vizinho, A, 2001, “Robust automatic speech recognition with missing features and unreliable acoustic data,” Speech Communication, 34: 267–285.
Article Google Scholar
Cooke, M., 2004, “Making sense of everyday speech: A glimpsing account,” this volume.
Google Scholar
Culling, J. F., and Summerfield, Q., 1995, “Perceptual separation of concurrent speech sounds: Absence of across-frequency grouping by common interaural delay,” J. Acoust. Soc. Am. 98: 785–797.
Google Scholar
Darwin, C. J., and Carlyon, R. P., 1995, “Auditory grouping,” in Handbook of Perception and Cognition, Vol. 6: Hearing, B. C. J. Moore., ed. New York: Academic Press, pp. 347–386.
Google Scholar
de Cheveigné, A., 2004, “The cancellation principle in acoustic scene analysis,” this volume.
Google Scholar
de Cheveigné, A., and Baskind, A., 2003, “F0 estimation of one or several voices,” Proc. Eurospeech, pp. 833–836.
Google Scholar
Jeffress, L. A., 1948, “A place theory of sound localization,” J. Comparative and Physiological Psychology 41: 35–39.
Google Scholar
Juang, B.-H., 1991, “Speech recognition in adverse environments,” Computer Speech and Language, 5: 275–294.
Article Google Scholar
Lindemann, W., 1986, “Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals,” J. Acoust. Soc. Am. 80, 1608–1622.
Google Scholar
Kawahara, H., and Irino, T., 2004, “Underlying principles of a high-quality speech manipulation system STRAIGHT and its application to speech segregation,” this volume.
Google Scholar
Kawahara, H., Matsuda-Katsuse, I., and de Cheveigné, A., 1999, “Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, 27: 175–185.
Article Google Scholar
Komenek, J., and Black, A., 2003, CMU_ARCTIC Databases, http://www.festvox.org/cmu_arctic.
Palomäki, K. J., Brown, G. J., and Wang, D. L., 2004, “A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation,” Speech Communication (accepted for publication).
Google Scholar
Moreno, P. J., Raj, B., and Stern, R. M., 1996, “A vector taylor series approach for environment-independent speech recognition,” Proc. ICASSP, Atlanta, Georgia.
Google Scholar
Raj, B., Parikh, V. N., and Stern, R. M., 1997, “The effects of background music on speech recognition accuracy,” Proc. ICASSP, Munich, Germany.
Google Scholar
Raj, B., Seltzer, M. L., and Stern, R. M., 2004, “Reconstruction of missing features for robust speech recognition,” Speech Communication Journal (accepted for publication).
Google Scholar
Seltzer, M. L., Raj, B., and Stern, R. M., 2004, “A Bayesian Framework for Spectrographic Mask Estimation for Missing Feature Speech Recognition,” Speech Communication Journal (accepted for publication).
Google Scholar
Singh, R., Stern, R. M. and Raj, B., 2002a, “Signal and feature compensation methods for robust speech recognition,” Chapter in CRC Handbook on Noise Reduction in Speech Applications, Gillian Davis, ed., CRC Press, Boca Raton.
Google Scholar
Singh, R, Raj, B. and Stern, R. M., 2002b, “Model compensation and matched condition methods for robust speech rcognition,” Chapter in CRC Handbook on Noise Reduction in Speech Applications, Gillian Davis, ed. CRC Press, Boca Raton.
Google Scholar
Stern, R. M., Acero, A., Liu, F.-H. Liu, and Oshima, Y., 1996, “Signal processing for robust speech recognition,” Chapter in Automatic Speech and Speaker Recognition, C.-H. Lee, F. Soong, and K. Paliwal, eds., Kluwer Academic Publishers, Boston, pp. 351–378.
Google Scholar
Stern, R. M., Raj, B. and Moreno, P. J., 1997, “Compensation for environmental degradation in automatic speech recognition,” Proc. ETRW on Robust Speech Recognition for Unknown Communication Channels, Pont-au-Mousson, France, pp. 33–42.
Google Scholar
Stern, R. M., and Trahiotis, C., 1995, “Models of binaural interaction,” in Handbook of Perception and Cognition, Volume 6: Hearing, B. C. J. Moore., ed., Academic Press, New York, pp. 347–386.
Google Scholar
Stern, R. M., and Trahiiotis, C., 1996, “Models of Binaural Perception,” in Binaural and Spatial Hearing in Real and Virtual Environments, R. Gilkey and T. R. Anderson, Eds. New York: Lawrence Erlbaum Associates, pp. 499–531.
Google Scholar
Stern, R. M., Trahiotis, C., and Ripepi, A. M, 2004, “Some conditions under which interaural delays foster identification,” in Dynamics of Speech Production and Perception, G. Meyer and P. Divenyi. eds., IOP Press, Amsterdam: IOP Press (in press).
Google Scholar
Stockham, T. G., Cannon, T. M., and Ingebretsen, R. B., 2004, “Blind Deconvolution Through Digital Signal Processing,” Proc. IEEE, 63: 678–692/
Google Scholar
Sullivan, T. M., and Stern, R. M., 1993, “Multi-Microphone Correlation-Based Processing for Robust Speech Recognition,” Proc. ICASSP, Minneapolis, Minnesota.
Google Scholar
Wang, D., 2004, “On the use of ideal binary time-frequency masks for CASA,” this volume.
Google Scholar
Zurek, P. M., “Binaural Advantages and Directional Effects in Speech Intelligibility”, in Acoustical Factors Affecting Hearing Performance II, G. A. Studebaker and I. Hochberg, Eds. Boston: Allyn and Bacon, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Richard M. Stern

Authors

Richard M. Stern
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

East Bay Institute for Research and Education, USA
Pierre Divenyi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stern, R.M. (2005). Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_9

Download citation

DOI: https://doi.org/10.1007/0-387-22794-6_9
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics