To Separate Speech

McDonough, John; Kumatani, Kenichi; Gehrig, Tobias; Stoimenov, Emilian; Mayer, Uwe; Schacht, Stefan; Wölfel, Matthias; Klakow, Dietrich

doi:10.1007/978-3-540-78155-4_25

John McDonough^1,3,
Kenichi Kumatani^2,3,
Tobias Gehrig⁴,
Emilian Stoimenov⁴,
Uwe Mayer⁴,
Stefan Schacht¹,
Matthias Wölfel⁴ &
…
Dietrich Klakow¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4892))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

1038 Accesses
3 Citations

Abstract

The PASCAL Speech Separation Challenge (SSC) is based on a corpus of sentences from the Wall Street Journal task read by two speakers simultaneously and captured with two circular eight-channel microphone arrays. This work describes our system for the recognition of such simultaneous speech. Our system has four principal components: A person tracker returns the locations of both active speakers, as well as segmentation information for each utterance, which are often of unequal length; two beamformers in generalized sidelobe canceller (GSC) configuration separate the simultaneous speech by setting their active weight vectors according to a minimum mutual information (MMI) criterion; a postfilter and binary mask operating on the outputs of the beamformers further enhance the separated speech; and finally an automatic speech recognition (ASR) engine based on a weighted finite-state transducer (WFST) returns the most likely word hypotheses for the separated streams. In addition to optimizing each of these components, we investigated the effect of the filter bank design used to perform subband analysis and synthesis during beamforming. On the SSC development data, our system achieved a word error rate of 39.6%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gehrig, T., Klee, U., McDonough, J., Ikbal, S., Wölfel, M., Fügen, C.: Tracking and beamforming for multiple simultaneous speakers with probabilistic data association filters. In: Proc. Interspeech, pp. 2594–2597 (2006)
Google Scholar
Bar-Shalom, Y., Fortmann, T.E.: Tracking and Data Association. Academic Press, San Diego (1988)
MATH Google Scholar
Van Trees, H.L.: Optimum Array Processing. Wiley-Interscience, Chichester (2002)
Google Scholar
Hyvärinen, A., Oja, E.: Independent component analysis: Algorithms and applications. Neural Networks 13, 411–430 (2000)
Article Google Scholar
McDonough, J., Kumatani, K.: Minimum mutual information beamforming. Technical Report 107, Interactive Systems Lab, Universität Karlsruhe (August 2006)
Google Scholar
Kumatani, K., Gehrig, T., Mayer, U., Stoimenov, E., McDonough, J., Wölfel, M.: Adaptive beamforming with a minimum mutual information criterion. IEEE Trans. Audio Speech and Lang. Proc. (to appear)
Google Scholar
Vaidyanathan, P.P.: Multirate Systems and Filter Banks. Prentice-Hall, Englewood Cliffs (1993)
MATH Google Scholar
de Haan, J.M., Grbic, N., Claesson, I., Nordholm, S.E.: Filter bank design for subband adaptive microphone arrays. IEEE Trans. Speech and Audio Proc. 11(1), 14–23 (2003)
Article Google Scholar
Brehm, H., Stammler, W.: Description and generation of spherically invariant speech-model signals. Signal Processing 12, 119–141 (1987)
Article Google Scholar
Mohri, M., Riley, M., Hindle, D., Ljolje, A., Periera, F.: Full expansion of context-dependent networks in large vocabulary speech recognition. In: Proc. ICASSP, Seattle, vol. II, pp. 665–668 (1998)
Google Scholar
Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Computer Speech and Language 16, 69–88 (2002)
Article Google Scholar
Mohri, M., Riley, M.: Network optimizations for large vocabulary speech recognition. Speech Communication 25(3) (1998)
Google Scholar
Stoimenov, E., McDonough, J.: Modeling polyphone context with weighted finite-state transducers. In: Proc. ICASSP (2006)
Google Scholar
Stoimenov, E., McDonough, J.: Memory efficient modeling of polyphone context with weighted finite-state transducers. In: Proc. Interspeech (2007)
Google Scholar
Mohri, M.: Finite-state transducers in language and speech processing. Computational Linguistics 23(2) (1997)
Google Scholar
Mohri, M., Riley, M.: A weight pushing algorithm for large vocabulary speech recognition. In: Proc. ASRU, Aarlborg, Denmark, September 2001, pp. 1603–1606 (2001)
Google Scholar
Mohri, M.: Minimization algorithms for sequential transducers. Theoretical Computer Science 234(1–2), 177–201 (2000)
Article MATH MathSciNet Google Scholar
Lincoln, M., McCowan, I., Vepa, J., Maganti, H.: The multi-channel wall street journal audio visual corpus (mc-wsj-av): specification and initial experiments. In: Proc. ASRU, pp. 357–362 (November 2005)
Google Scholar
Wölfel, M., McDonough, J.: Minimum variance distortionless response spectral estimation, review and refinements. IEEE Signal Processing Magazine 22(5), 117–126 (2005)
Article Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1990)
MATH Google Scholar
Gales, M.J.F.: Semi-tied covariance matrices. In: Proc. ICASSP (1998)
Google Scholar
Fransen, J., Pye, D., Robinson, T., Woodland, P., Young, S.: Wsjcam0 corpus and recording description. Technical Report CUED/F-INFENG/TR.192, Cambridge University Engineering Department (CUED) Speech Group (September 1994)
Google Scholar
Deller, J., Hansen, J., Proakis, J.: Discrete-Time Processing of Speech Signals. Macmillan Publishing, New York (1993)
Google Scholar
Anastasakos, T., McDonough, J., Schwarz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proc. ICSLP, pp. 1137–1140 (1996)
Google Scholar
Uebel, L., Woodland, P.: Improvements in linear transform based speaker adaptation. In: Proc. ICASSP (2001)
Google Scholar
Wölfel, M.: Mel-Frequenzanpassung der Minimum Varianz Distortionless Response Einhüllenden. In: Proc. of ESSV (2003)
Google Scholar
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12 (1998)
Google Scholar
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language 9, 171–185 (1995)
Article Google Scholar
McDonough, J., Stoimenov, E., Klakow, D.: An algorithm for fast composition of weighted finite-state transducers. In: Proc. ASRU (submitted, 2007)
Google Scholar
Simmer, K.U., Bitzer, J., Marro, C.: Post-filtering techniques. In: Branstein, M., Ward, D. (eds.) Microphone Arrays, pp. 39–60. Springer, Heidelberg (2001)
Google Scholar
McCowan, I., Hari-Krishna, M., Gatica-Perez, D., Moore, D., Ba, S.: Speech acquisition in meetings with an audio-visual sensor array. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) (July 2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Spoken Language Systems, Saarland University, Saarbrücken, Germany
John McDonough, Stefan Schacht & Dietrich Klakow
IDIAP Research Institute, Martigny, Switzerland
Kenichi Kumatani
Institute for Intelligent Sensor-Actuator Systems, University of Karlsruhe, Germany
John McDonough & Kenichi Kumatani
Institute for Theoretical Computer Science, University of Karlsruhe, Germany
Tobias Gehrig, Emilian Stoimenov, Uwe Mayer & Matthias Wölfel

Authors

John McDonough
View author publications
You can also search for this author in PubMed Google Scholar
Kenichi Kumatani
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Gehrig
View author publications
You can also search for this author in PubMed Google Scholar
Emilian Stoimenov
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Mayer
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Schacht
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Wölfel
View author publications
You can also search for this author in PubMed Google Scholar
Dietrich Klakow
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andrei Popescu-Belis Steve Renals Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

McDonough, J. et al. (2008). To Separate Speech. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-540-78155-4_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78154-7
Online ISBN: 978-3-540-78155-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics