Conclusions
The purely bottom-up approach to auditory perception is clearly inconsistent with the wealth of evidence suggesting that the neural topology involved in sound understanding is more convoluted. One can build a system that separates sounds based on their cochleagram or correlogram representations, but this appears inconsistent with the functional connections. Instead, our brains seem to abstract sounds, and solve the auditory scene analysis problem using high-level representations of each sound object.
There has been work that addresses some of these problems, but it is solving an engineering problem (how do we separate sounds) instead of building a model of human perception. One such solution is proposed by Barker and his colleagues (2001) and combines a low-level perceptual model with a topdown statistical language model. This is a promising direction for solving the engineering problem (how do we improve speech recognition in the face of noise) but nobody has evaluated the suitability of modeling human-language perception with a hidden-Markov model.
A bigger problem is understanding at which stage acoustic restoration is performed. It seems unlikely that the brain reconstructs the full acoustic waveform before performing sound recognition. Instead it seems more likely that the sound understanding and sound separation occur in concert and the brain only understands the concepts. Later, upon introspection the full word can be imagined.
Much remains to be done to understand how humans perform sound separation, and to understand where CASA researchers should go. But clearly systems that combine low-level and high-level cues are important.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Assman, P.F. and Summerfield, Q., 1990, Modelling the perception of concurrent vowels: Vowels with different fundamental frequencies, J. Acoust. Soc. Am. 88, pp. 680–697.
Barker, J., Cooke M., and Ellis, D.P.W., 2001, Integrating bottom-up and top-down constraints to achieve robust ASR: The multisource decoder. Presented at the CRAC workshop, Aalborg, Denmark.
Bregman, A.S., 1990, Auditory Scene Analysis, MIT Press, Cambridge, MA.
Cole, R.A., Mariani, J., Uszkoreit, H., Zaenen, A., Zue, V. (eds.), 1996, Survey of the State of the Art in Human Language Technology, http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html.
Cooke, M. and Ellis, D.P.W., 2001, The auditory organization of speech and other sources in listeners and computational models, Speech Comm., vol. 35, no. 3–4, pp. 141–177.
Grossberg, S., Govindarajan, K.K., Wyse, L.L., and Cohen, M.A., 2003, ARTSTREAM: A neural network model of auditory scene analysis and source segregation. Neural Networks.
Ladefoged, P., 1989, A note on ‘Information conveyed by vowels,’ J. Acoust. Soc. Am, 85, pp. 2223–2224.
Lee, T.-W., Bell, A., Lambert, R.H., 1997, Blind separation of delayed and convolved sources. In: Advances in Neural Information Processing Systems, vol. 9. Cambridge, MA, pp. 758–764.
Licklider, J.C.R., 1951, A duplex theory of pitch perception, Experientia 7, pp. 128–134.
Marr, D., 1982, Vision, W. H. Freeman and Co.
Meddis, R. and Hewitt, M.J., 1991, Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification, J. Acoust. Soc. Am., vol. 89, no. 6, pp. 2866–2882.
Pardo, B. and Birmingham, W., 2002, Improved Score Following for Acoustic Performances International Computer Music Conference 2002, Gothenburg, Sweden.
Remez, R.E., Rubin, P.E., Pisoni, D.B. and Carrell, T.D., 1981, Speech perception without traditional speech cues, Science, 212, pp. 947–950.
Quatieri, T.F., 2002, Discrete-Time Speech Signal Processing: Principles and Practice. Prentice-Hall.
Roweis, ST., 2003, Factorial Models and Refiltering for Speech Separation and Denoising, Proceedings of Eurospeech03 (Geneva, Switzerland), pp. 1009–1012.
Slaney, M. and Lyon, R.F., 1990, A perceptual pitch detector. Proceedings of the International Conference on Acoustics, Speech and Signal Processing.
Slaney, M., 1996, Pattern Playback in the’ 90s, Advances in Neural Information Processing Systems 7, Gerald Tesauro, David Touretzky, and Todd Leen (eds.), MIT Press, Cambridge, MA.
Slaney, M., 1998, A critique of pure audition, Computational Auditory Scene Analysis, edited by David Rosenthal and Hiroshi G. Okuno, Erlbaum.
Warren, R.M., 1970, Perception restoration of missing speech sounds. Science, 167, pp. 393–395.
Weintraub, M., 1986, A computational model for separating two simultaneous talkers. Proc. of ICASSP’ 86., Vol. 11, pp. 81–84.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer Science + Business Media, Inc.
About this chapter
Cite this chapter
Slaney, M. (2005). The History and Future of CASA. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_13
Download citation
DOI: https://doi.org/10.1007/0-387-22794-6_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)