The History and Future of CASA

Slaney, Malcolm

doi:10.1007/0-387-22794-6_13

Malcolm Slaney²

1226 Accesses
4 Citations

Conclusions

The purely bottom-up approach to auditory perception is clearly inconsistent with the wealth of evidence suggesting that the neural topology involved in sound understanding is more convoluted. One can build a system that separates sounds based on their cochleagram or correlogram representations, but this appears inconsistent with the functional connections. Instead, our brains seem to abstract sounds, and solve the auditory scene analysis problem using high-level representations of each sound object.

There has been work that addresses some of these problems, but it is solving an engineering problem (how do we separate sounds) instead of building a model of human perception. One such solution is proposed by Barker and his colleagues (2001) and combines a low-level perceptual model with a topdown statistical language model. This is a promising direction for solving the engineering problem (how do we improve speech recognition in the face of noise) but nobody has evaluated the suitability of modeling human-language perception with a hidden-Markov model.

A bigger problem is understanding at which stage acoustic restoration is performed. It seems unlikely that the brain reconstructs the full acoustic waveform before performing sound recognition. Instead it seems more likely that the sound understanding and sound separation occur in concert and the brain only understands the concepts. Later, upon introspection the full word can be imagined.

Much remains to be done to understand how humans perform sound separation, and to understand where CASA researchers should go. But clearly systems that combine low-level and high-level cues are important.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Assman, P.F. and Summerfield, Q., 1990, Modelling the perception of concurrent vowels: Vowels with different fundamental frequencies, J. Acoust. Soc. Am. 88, pp. 680–697.
Google Scholar
Barker, J., Cooke M., and Ellis, D.P.W., 2001, Integrating bottom-up and top-down constraints to achieve robust ASR: The multisource decoder. Presented at the CRAC workshop, Aalborg, Denmark.
Google Scholar
Bregman, A.S., 1990, Auditory Scene Analysis, MIT Press, Cambridge, MA.
Google Scholar
Cole, R.A., Mariani, J., Uszkoreit, H., Zaenen, A., Zue, V. (eds.), 1996, Survey of the State of the Art in Human Language Technology, http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html.
Cooke, M. and Ellis, D.P.W., 2001, The auditory organization of speech and other sources in listeners and computational models, Speech Comm., vol. 35, no. 3–4, pp. 141–177.
Google Scholar
Grossberg, S., Govindarajan, K.K., Wyse, L.L., and Cohen, M.A., 2003, ARTSTREAM: A neural network model of auditory scene analysis and source segregation. Neural Networks.
Google Scholar
Ladefoged, P., 1989, A note on ‘Information conveyed by vowels,’ J. Acoust. Soc. Am, 85, pp. 2223–2224.
Article Google Scholar
Lee, T.-W., Bell, A., Lambert, R.H., 1997, Blind separation of delayed and convolved sources. In: Advances in Neural Information Processing Systems, vol. 9. Cambridge, MA, pp. 758–764.
Google Scholar
Licklider, J.C.R., 1951, A duplex theory of pitch perception, Experientia 7, pp. 128–134.
Article Google Scholar
Marr, D., 1982, Vision, W. H. Freeman and Co.
Google Scholar
Meddis, R. and Hewitt, M.J., 1991, Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification, J. Acoust. Soc. Am., vol. 89, no. 6, pp. 2866–2882.
Google Scholar
Pardo, B. and Birmingham, W., 2002, Improved Score Following for Acoustic Performances International Computer Music Conference 2002, Gothenburg, Sweden.
Google Scholar
Remez, R.E., Rubin, P.E., Pisoni, D.B. and Carrell, T.D., 1981, Speech perception without traditional speech cues, Science, 212, pp. 947–950.
Google Scholar
Quatieri, T.F., 2002, Discrete-Time Speech Signal Processing: Principles and Practice. Prentice-Hall.
Google Scholar
Roweis, ST., 2003, Factorial Models and Refiltering for Speech Separation and Denoising, Proceedings of Eurospeech03 (Geneva, Switzerland), pp. 1009–1012.
Google Scholar
Slaney, M. and Lyon, R.F., 1990, A perceptual pitch detector. Proceedings of the International Conference on Acoustics, Speech and Signal Processing.
Google Scholar
Slaney, M., 1996, Pattern Playback in the’ 90s, Advances in Neural Information Processing Systems 7, Gerald Tesauro, David Touretzky, and Todd Leen (eds.), MIT Press, Cambridge, MA.
Google Scholar
Slaney, M., 1998, A critique of pure audition, Computational Auditory Scene Analysis, edited by David Rosenthal and Hiroshi G. Okuno, Erlbaum.
Google Scholar
Warren, R.M., 1970, Perception restoration of missing speech sounds. Science, 167, pp. 393–395.
Google Scholar
Weintraub, M., 1986, A computational model for separating two simultaneous talkers. Proc. of ICASSP’ 86., Vol. 11, pp. 81–84.
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, USA
Malcolm Slaney

Authors

Malcolm Slaney
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

East Bay Institute for Research and Education, USA
Pierre Divenyi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Slaney, M. (2005). The History and Future of CASA. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_13

Download citation

DOI: https://doi.org/10.1007/0-387-22794-6_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-8001-2
Online ISBN: 978-0-387-22794-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics