Summary
It is believed that computational auditory approaches are potentially extremely useful in ameliorating some of the most difficult speech recognition problems, specifically the recognition of speech presented at low SNRs, speech masked by other speech, speech masked by music, and speech in highly reverberant environments. The solution to these problems using CASA techniques is likely to depend on the ability to develop several key elements of signal processing, including the reliable detection of fundamental frequency for isolated speech and for multiple simultaneously-presented speech sounds, the reliable detection of modulations of amplitude and frequency in very narrowband channels, and the development of across-frequency correlation approaches that can identify frequency bands with coherent microactivity as they evolve over time. I am extremely optimistic that effective solutions for these problems are within reach in the near future.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Acero, A., and Stern, R. M., 1990, “Environmental robustness in automatic speech recognition,” Proc. ICASSP, Albuquerque, New Mexico.
Boll, S. F., 1979, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoustics, Speech and Signal Processing, 27: 113–120.
Bregman, A. S., 1990, Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge: MIT Press, Cambridge.
Brown, G. J., and Palomäki, K., 2004, “Techniques for speech processing in noisy and reverberant conditions,” this volume.
Colburn, H. S., 1995, “Computational models of binaural processing,” in Springer Handbook of Auditory Research: Auditory Computation, H. L. Hawkins, T. A. McMullen, A. N. Popper, and R. R. Fay, eds. New York: Academic Press, pp. 332–400.
Colburn, H. S., and Durlach, N. I., 1978, “Models of binaural interaction,” in Handbook of Perception, E. C. Carterette and M. P. Friedman, eds., Academic Press, New York, pp. 467–518.
Cooke, M., Green, P. Josifovski, L., and Vizinho, A, 2001, “Robust automatic speech recognition with missing features and unreliable acoustic data,” Speech Communication, 34: 267–285.
Cooke, M., 2004, “Making sense of everyday speech: A glimpsing account,” this volume.
Culling, J. F., and Summerfield, Q., 1995, “Perceptual separation of concurrent speech sounds: Absence of across-frequency grouping by common interaural delay,” J. Acoust. Soc. Am. 98: 785–797.
Darwin, C. J., and Carlyon, R. P., 1995, “Auditory grouping,” in Handbook of Perception and Cognition, Vol. 6: Hearing, B. C. J. Moore., ed. New York: Academic Press, pp. 347–386.
de Cheveigné, A., 2004, “The cancellation principle in acoustic scene analysis,” this volume.
de Cheveigné, A., and Baskind, A., 2003, “F0 estimation of one or several voices,” Proc. Eurospeech, pp. 833–836.
Jeffress, L. A., 1948, “A place theory of sound localization,” J. Comparative and Physiological Psychology 41: 35–39.
Juang, B.-H., 1991, “Speech recognition in adverse environments,” Computer Speech and Language, 5: 275–294.
Lindemann, W., 1986, “Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals,” J. Acoust. Soc. Am. 80, 1608–1622.
Kawahara, H., and Irino, T., 2004, “Underlying principles of a high-quality speech manipulation system STRAIGHT and its application to speech segregation,” this volume.
Kawahara, H., Matsuda-Katsuse, I., and de Cheveigné, A., 1999, “Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, 27: 175–185.
Komenek, J., and Black, A., 2003, CMU_ARCTIC Databases, http://www.festvox.org/cmu_arctic.
Palomäki, K. J., Brown, G. J., and Wang, D. L., 2004, “A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation,” Speech Communication (accepted for publication).
Moreno, P. J., Raj, B., and Stern, R. M., 1996, “A vector taylor series approach for environment-independent speech recognition,” Proc. ICASSP, Atlanta, Georgia.
Raj, B., Parikh, V. N., and Stern, R. M., 1997, “The effects of background music on speech recognition accuracy,” Proc. ICASSP, Munich, Germany.
Raj, B., Seltzer, M. L., and Stern, R. M., 2004, “Reconstruction of missing features for robust speech recognition,” Speech Communication Journal (accepted for publication).
Seltzer, M. L., Raj, B., and Stern, R. M., 2004, “A Bayesian Framework for Spectrographic Mask Estimation for Missing Feature Speech Recognition,” Speech Communication Journal (accepted for publication).
Singh, R., Stern, R. M. and Raj, B., 2002a, “Signal and feature compensation methods for robust speech recognition,” Chapter in CRC Handbook on Noise Reduction in Speech Applications, Gillian Davis, ed., CRC Press, Boca Raton.
Singh, R, Raj, B. and Stern, R. M., 2002b, “Model compensation and matched condition methods for robust speech rcognition,” Chapter in CRC Handbook on Noise Reduction in Speech Applications, Gillian Davis, ed. CRC Press, Boca Raton.
Stern, R. M., Acero, A., Liu, F.-H. Liu, and Oshima, Y., 1996, “Signal processing for robust speech recognition,” Chapter in Automatic Speech and Speaker Recognition, C.-H. Lee, F. Soong, and K. Paliwal, eds., Kluwer Academic Publishers, Boston, pp. 351–378.
Stern, R. M., Raj, B. and Moreno, P. J., 1997, “Compensation for environmental degradation in automatic speech recognition,” Proc. ETRW on Robust Speech Recognition for Unknown Communication Channels, Pont-au-Mousson, France, pp. 33–42.
Stern, R. M., and Trahiotis, C., 1995, “Models of binaural interaction,” in Handbook of Perception and Cognition, Volume 6: Hearing, B. C. J. Moore., ed., Academic Press, New York, pp. 347–386.
Stern, R. M., and Trahiiotis, C., 1996, “Models of Binaural Perception,” in Binaural and Spatial Hearing in Real and Virtual Environments, R. Gilkey and T. R. Anderson, Eds. New York: Lawrence Erlbaum Associates, pp. 499–531.
Stern, R. M., Trahiotis, C., and Ripepi, A. M, 2004, “Some conditions under which interaural delays foster identification,” in Dynamics of Speech Production and Perception, G. Meyer and P. Divenyi. eds., IOP Press, Amsterdam: IOP Press (in press).
Stockham, T. G., Cannon, T. M., and Ingebretsen, R. B., 2004, “Blind Deconvolution Through Digital Signal Processing,” Proc. IEEE, 63: 678–692/
Sullivan, T. M., and Stern, R. M., 1993, “Multi-Microphone Correlation-Based Processing for Robust Speech Recognition,” Proc. ICASSP, Minneapolis, Minnesota.
Wang, D., 2004, “On the use of ideal binary time-frequency masks for CASA,” this volume.
Zurek, P. M., “Binaural Advantages and Directional Effects in Speech Intelligibility”, in Acoustical Factors Affecting Hearing Performance II, G. A. Studebaker and I. Hochberg, Eds. Boston: Allyn and Bacon, 1993.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer Science + Business Media, Inc.
About this chapter
Cite this chapter
Stern, R.M. (2005). Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_9
Download citation
DOI: https://doi.org/10.1007/0-387-22794-6_9
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)