Skip to main content

To Separate Speech

A System for Recognizing Simultaneous Speech

  • Conference paper
Book cover Machine Learning for Multimodal Interaction (MLMI 2007)

Abstract

The PASCAL Speech Separation Challenge (SSC) is based on a corpus of sentences from the Wall Street Journal task read by two speakers simultaneously and captured with two circular eight-channel microphone arrays. This work describes our system for the recognition of such simultaneous speech. Our system has four principal components: A person tracker returns the locations of both active speakers, as well as segmentation information for each utterance, which are often of unequal length; two beamformers in generalized sidelobe canceller (GSC) configuration separate the simultaneous speech by setting their active weight vectors according to a minimum mutual information (MMI) criterion; a postfilter and binary mask operating on the outputs of the beamformers further enhance the separated speech; and finally an automatic speech recognition (ASR) engine based on a weighted finite-state transducer (WFST) returns the most likely word hypotheses for the separated streams. In addition to optimizing each of these components, we investigated the effect of the filter bank design used to perform subband analysis and synthesis during beamforming. On the SSC development data, our system achieved a word error rate of 39.6%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gehrig, T., Klee, U., McDonough, J., Ikbal, S., Wölfel, M., Fügen, C.: Tracking and beamforming for multiple simultaneous speakers with probabilistic data association filters. In: Proc. Interspeech, pp. 2594–2597 (2006)

    Google Scholar 

  2. Bar-Shalom, Y., Fortmann, T.E.: Tracking and Data Association. Academic Press, San Diego (1988)

    MATH  Google Scholar 

  3. Van Trees, H.L.: Optimum Array Processing. Wiley-Interscience, Chichester (2002)

    Google Scholar 

  4. Hyvärinen, A., Oja, E.: Independent component analysis: Algorithms and applications. Neural Networks 13, 411–430 (2000)

    Article  Google Scholar 

  5. McDonough, J., Kumatani, K.: Minimum mutual information beamforming. Technical Report 107, Interactive Systems Lab, Universität Karlsruhe (August 2006)

    Google Scholar 

  6. Kumatani, K., Gehrig, T., Mayer, U., Stoimenov, E., McDonough, J., Wölfel, M.: Adaptive beamforming with a minimum mutual information criterion. IEEE Trans. Audio Speech and Lang. Proc. (to appear)

    Google Scholar 

  7. Vaidyanathan, P.P.: Multirate Systems and Filter Banks. Prentice-Hall, Englewood Cliffs (1993)

    MATH  Google Scholar 

  8. de Haan, J.M., Grbic, N., Claesson, I., Nordholm, S.E.: Filter bank design for subband adaptive microphone arrays. IEEE Trans. Speech and Audio Proc. 11(1), 14–23 (2003)

    Article  Google Scholar 

  9. Brehm, H., Stammler, W.: Description and generation of spherically invariant speech-model signals. Signal Processing 12, 119–141 (1987)

    Article  Google Scholar 

  10. Mohri, M., Riley, M., Hindle, D., Ljolje, A., Periera, F.: Full expansion of context-dependent networks in large vocabulary speech recognition. In: Proc. ICASSP, Seattle, vol. II, pp. 665–668 (1998)

    Google Scholar 

  11. Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Computer Speech and Language 16, 69–88 (2002)

    Article  Google Scholar 

  12. Mohri, M., Riley, M.: Network optimizations for large vocabulary speech recognition. Speech Communication 25(3) (1998)

    Google Scholar 

  13. Stoimenov, E., McDonough, J.: Modeling polyphone context with weighted finite-state transducers. In: Proc. ICASSP (2006)

    Google Scholar 

  14. Stoimenov, E., McDonough, J.: Memory efficient modeling of polyphone context with weighted finite-state transducers. In: Proc. Interspeech (2007)

    Google Scholar 

  15. Mohri, M.: Finite-state transducers in language and speech processing. Computational Linguistics 23(2) (1997)

    Google Scholar 

  16. Mohri, M., Riley, M.: A weight pushing algorithm for large vocabulary speech recognition. In: Proc. ASRU, Aarlborg, Denmark, September 2001, pp. 1603–1606 (2001)

    Google Scholar 

  17. Mohri, M.: Minimization algorithms for sequential transducers. Theoretical Computer Science 234(1–2), 177–201 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  18. Lincoln, M., McCowan, I., Vepa, J., Maganti, H.: The multi-channel wall street journal audio visual corpus (mc-wsj-av): specification and initial experiments. In: Proc. ASRU, pp. 357–362 (November 2005)

    Google Scholar 

  19. Wölfel, M., McDonough, J.: Minimum variance distortionless response spectral estimation, review and refinements. IEEE Signal Processing Magazine 22(5), 117–126 (2005)

    Article  Google Scholar 

  20. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1990)

    MATH  Google Scholar 

  21. Gales, M.J.F.: Semi-tied covariance matrices. In: Proc. ICASSP (1998)

    Google Scholar 

  22. Fransen, J., Pye, D., Robinson, T., Woodland, P., Young, S.: Wsjcam0 corpus and recording description. Technical Report CUED/F-INFENG/TR.192, Cambridge University Engineering Department (CUED) Speech Group (September 1994)

    Google Scholar 

  23. Deller, J., Hansen, J., Proakis, J.: Discrete-Time Processing of Speech Signals. Macmillan Publishing, New York (1993)

    Google Scholar 

  24. Anastasakos, T., McDonough, J., Schwarz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proc. ICSLP, pp. 1137–1140 (1996)

    Google Scholar 

  25. Uebel, L., Woodland, P.: Improvements in linear transform based speaker adaptation. In: Proc. ICASSP (2001)

    Google Scholar 

  26. Wölfel, M.: Mel-Frequenzanpassung der Minimum Varianz Distortionless Response Einhüllenden. In: Proc. of ESSV (2003)

    Google Scholar 

  27. Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12 (1998)

    Google Scholar 

  28. Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language 9, 171–185 (1995)

    Article  Google Scholar 

  29. McDonough, J., Stoimenov, E., Klakow, D.: An algorithm for fast composition of weighted finite-state transducers. In: Proc. ASRU (submitted, 2007)

    Google Scholar 

  30. Simmer, K.U., Bitzer, J., Marro, C.: Post-filtering techniques. In: Branstein, M., Ward, D. (eds.) Microphone Arrays, pp. 39–60. Springer, Heidelberg (2001)

    Google Scholar 

  31. McCowan, I., Hari-Krishna, M., Gatica-Perez, D., Moore, D., Ba, S.: Speech acquisition in meetings with an audio-visual sensor array. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) (July 2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Andrei Popescu-Belis Steve Renals Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

McDonough, J. et al. (2008). To Separate Speech. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78155-4_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78154-7

  • Online ISBN: 978-3-540-78155-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics