Skip to main content

Conclusions

The purely bottom-up approach to auditory perception is clearly inconsistent with the wealth of evidence suggesting that the neural topology involved in sound understanding is more convoluted. One can build a system that separates sounds based on their cochleagram or correlogram representations, but this appears inconsistent with the functional connections. Instead, our brains seem to abstract sounds, and solve the auditory scene analysis problem using high-level representations of each sound object.

There has been work that addresses some of these problems, but it is solving an engineering problem (how do we separate sounds) instead of building a model of human perception. One such solution is proposed by Barker and his colleagues (2001) and combines a low-level perceptual model with a topdown statistical language model. This is a promising direction for solving the engineering problem (how do we improve speech recognition in the face of noise) but nobody has evaluated the suitability of modeling human-language perception with a hidden-Markov model.

A bigger problem is understanding at which stage acoustic restoration is performed. It seems unlikely that the brain reconstructs the full acoustic waveform before performing sound recognition. Instead it seems more likely that the sound understanding and sound separation occur in concert and the brain only understands the concepts. Later, upon introspection the full word can be imagined.

Much remains to be done to understand how humans perform sound separation, and to understand where CASA researchers should go. But clearly systems that combine low-level and high-level cues are important.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Assman, P.F. and Summerfield, Q., 1990, Modelling the perception of concurrent vowels: Vowels with different fundamental frequencies, J. Acoust. Soc. Am. 88, pp. 680–697.

    Google Scholar 

  • Barker, J., Cooke M., and Ellis, D.P.W., 2001, Integrating bottom-up and top-down constraints to achieve robust ASR: The multisource decoder. Presented at the CRAC workshop, Aalborg, Denmark.

    Google Scholar 

  • Bregman, A.S., 1990, Auditory Scene Analysis, MIT Press, Cambridge, MA.

    Google Scholar 

  • Cole, R.A., Mariani, J., Uszkoreit, H., Zaenen, A., Zue, V. (eds.), 1996, Survey of the State of the Art in Human Language Technology, http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html.

  • Cooke, M. and Ellis, D.P.W., 2001, The auditory organization of speech and other sources in listeners and computational models, Speech Comm., vol. 35, no. 3–4, pp. 141–177.

    Google Scholar 

  • Grossberg, S., Govindarajan, K.K., Wyse, L.L., and Cohen, M.A., 2003, ARTSTREAM: A neural network model of auditory scene analysis and source segregation. Neural Networks.

    Google Scholar 

  • Ladefoged, P., 1989, A note on ‘Information conveyed by vowels,’ J. Acoust. Soc. Am, 85, pp. 2223–2224.

    Article  Google Scholar 

  • Lee, T.-W., Bell, A., Lambert, R.H., 1997, Blind separation of delayed and convolved sources. In: Advances in Neural Information Processing Systems, vol. 9. Cambridge, MA, pp. 758–764.

    Google Scholar 

  • Licklider, J.C.R., 1951, A duplex theory of pitch perception, Experientia 7, pp. 128–134.

    Article  Google Scholar 

  • Marr, D., 1982, Vision, W. H. Freeman and Co.

    Google Scholar 

  • Meddis, R. and Hewitt, M.J., 1991, Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification, J. Acoust. Soc. Am., vol. 89, no. 6, pp. 2866–2882.

    Google Scholar 

  • Pardo, B. and Birmingham, W., 2002, Improved Score Following for Acoustic Performances International Computer Music Conference 2002, Gothenburg, Sweden.

    Google Scholar 

  • Remez, R.E., Rubin, P.E., Pisoni, D.B. and Carrell, T.D., 1981, Speech perception without traditional speech cues, Science, 212, pp. 947–950.

    Google Scholar 

  • Quatieri, T.F., 2002, Discrete-Time Speech Signal Processing: Principles and Practice. Prentice-Hall.

    Google Scholar 

  • Roweis, ST., 2003, Factorial Models and Refiltering for Speech Separation and Denoising, Proceedings of Eurospeech03 (Geneva, Switzerland), pp. 1009–1012.

    Google Scholar 

  • Slaney, M. and Lyon, R.F., 1990, A perceptual pitch detector. Proceedings of the International Conference on Acoustics, Speech and Signal Processing.

    Google Scholar 

  • Slaney, M., 1996, Pattern Playback in the’ 90s, Advances in Neural Information Processing Systems 7, Gerald Tesauro, David Touretzky, and Todd Leen (eds.), MIT Press, Cambridge, MA.

    Google Scholar 

  • Slaney, M., 1998, A critique of pure audition, Computational Auditory Scene Analysis, edited by David Rosenthal and Hiroshi G. Okuno, Erlbaum.

    Google Scholar 

  • Warren, R.M., 1970, Perception restoration of missing speech sounds. Science, 167, pp. 393–395.

    Google Scholar 

  • Weintraub, M., 1986, A computational model for separating two simultaneous talkers. Proc. of ICASSP’ 86., Vol. 11, pp. 81–84.

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer Science + Business Media, Inc.

About this chapter

Cite this chapter

Slaney, M. (2005). The History and Future of CASA. In: Divenyi, P. (eds) Speech Separation by Humans and Machines. Springer, Boston, MA. https://doi.org/10.1007/0-387-22794-6_13

Download citation

  • DOI: https://doi.org/10.1007/0-387-22794-6_13

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-8001-2

  • Online ISBN: 978-0-387-22794-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics