Skip to main content

Sparsification for Monaural Source Separation

  • Chapter
Blind Speech Separation

Part of the book series: Signals and Communication Technology ((SCT))

We explore the use of sparse representations for separation of a monaural mixture signal, where by a sparse representation we mean one where the number of nonzero elements is smaller than might be expected. This is a surprisingly powerful idea, as the ability to express a signal sparsely in some known, and potentially overcomplete, basis constitutes a strong model, while also lending itself to efficient algorithms. In the framework we explore, the representation of the signal is linear in a vector of coefficients. However, because many coefficient values could represent the same signal, the mapping from signal to coefficients is nonlinear, with the coeffi- cients being chosen to simultaneously represent the signal and maximize a measure of sparsity. This conversion of the signal into the coefficients using L1-optimization is viewed not as a preprocessing step performed before the data reaches the heart of the algorithm, but rather as itself the heart of the algorithm: after the coeffi- cients have been found, only trivial processing remains to be done. We show how, by suitable choice of overcomplete basis, this framework can use a variety of cues (e.g., speaker identity, differential filtering, differential attenuation) to accomplish monaural separation. We also discuss two radically different algorithms for finding the required overcomplete dictionaries: one based on nonnegative matrix factorization of isolated sources, and the other based on end-to-end optimization using automatic differentiation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. B. Laughlin and T. J. Sejnowski, “Communication in neuronal networks,” Science, vol. 301, no. 5641, pp. 1870-1874, 2003.

    Article  Google Scholar 

  2. W. B. Levy and R. A. Baxter, “Energy efficient neural codes,” Neural Compu-tation, vol. 8, no. 3, pp. 531-543, 1996.

    Article  Google Scholar 

  3. E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual Review of Neurosci., vol. 24, no. 1, pp. 1193-1216, 2001.

    Article  Google Scholar 

  4. M. S. Lewicki, “Efficient coding of natural sounds,” Nature Neuroscience, vol. 5, no. 4, pp. 356-363, 2002.

    Article  Google Scholar 

  5. E. C. Smith and M. S. Lewicki, “Efficient auditory coding,” Nature, vol. 439, pp. 978-982, 2006.

    Article  Google Scholar 

  6. A. J. Bell and T. J. Sejnowski, “The ‘independent components’ of natural scenes are edge filters,” Vision Research, vol. 37, no. 23, pp. 3327-3338, 1997.

    Article  Google Scholar 

  7. B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, pp. 607-609, 1996.

    Article  Google Scholar 

  8. M. D. Plumbley, S. A. Abdallah, J. P. Bello, M. E. Davies, G. Monti, and M. B. Sandler, “Automatic music transcription and audio source separation,” Cybernetics and Systems, vol. 33, no. 6, pp. 603-627, 2002.

    Article  Google Scholar 

  9. L. Benaroya, L. M. Donagh, F. Bimbot, and R. Gribonval, “Non negative sparse representation for wiener based source separation with a single sensor,” in Acoustics, Speech, and Signal Processing, 2003, pp. 613-616.

    Google Scholar 

  10. T. Virtanen, “Sound source separation using sparse coding with temporal con-tinuity objective,” in ICMC, 2003.

    Google Scholar 

  11. G. J. Jang and T. W. Lee, “A maximum likelihood approach to single channel source separation,” Journal of Machine Learning Research, vol. 4, pp. 1365-1392, 2003.

    Article  MathSciNet  Google Scholar 

  12. A. S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge, Massachusetts: MIT Press, 1990.

    Google Scholar 

  13. H. Asari, B. A. Pearlmutter, and A. M. Zador, “Sparse representations for the cocktail party problem,” Journal Neuroscience, vol. 26, no. 28, pp. 7477-7490, 2006.

    Article  Google Scholar 

  14. B. A. Pearlmutter and R. K. Olsson, “Linear program differentiation for single-channel speech separation,” in International Workshop on Machine Learning for Signal Processing, Maynooth, Ireland: IEEE Press, 2006.

    Google Scholar 

  15. S. W. Hainsworth, “Techniques for the automated analysis of musical audio,” Ph.D. dissertation, Department of Engineering, University of Cambridge, 2004.

    Google Scholar 

  16. B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?” Vision Research, vol. 37, no. 23, pp. 3311-3325, 1997.

    Article  Google Scholar 

  17. O. Schwartz and E. P. Simoncelli, “Natural signal statistics and sensory gain control,” Nature Neuroscience, vol. 4, no. 8, pp. 819-825, 2001.

    Article  Google Scholar 

  18. D. Klein, P. Konig, and K. P. Kording, “Sparse spectrotemporal coding of sounds,” Journal on Applied Signal Processing, vol. 7, pp. 659-667, 2003.

    Article  Google Scholar 

  19. S. S. Stevens, “On the phychophysical law,” Psychological Review, vol. 64, pp. 153-181, 1957.

    Article  Google Scholar 

  20. D. D. Lee and H. S. Seung, “Learning the parts of objects with nonnegative matric factorization,” Nature, vol. 401, pp. 788-791, 1999.

    Article  Google Scholar 

  21. P. Smaragdis, “Non-negative matrix factor deconvolution; extraction of mul-tiple sound sources from monophonic inputs,” in Fifth International Confer-ence on Independent Component Analysis, ser. LNCS 3195. Granada, Spain: Springer-Verlag, 2006, pp. 494-499.

    Google Scholar 

  22. P. Smaragdis, “Convolutive speech bases and their application to supervised speech separation,” IEEE Transactions on Speech Audio Processing, vol. 15, no. 1, pp. 1-12, 2007.

    Article  Google Scholar 

  23. T. Virtanen, “Techniques for the automated analysis of musical audio,” Ph.D. dissertation, Institute of Signal Processing, Tampere University of Technology, 2006.

    Google Scholar 

  24. H. B. Barlow, “Possible principles underlying the transformations of sensory messages,” in Sensory Communication, W. A. Rosenblith, Ed. MIT Press, 1961, pp. 217-234.

    Google Scholar 

  25. M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computation, vol. 12, no. 2, pp. 337-365, 2000.

    Article  Google Scholar 

  26. K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee, and T. J. Sejnowski, “Dictionary learning algorithms for sparse representation,” Neural Computation, vol. 15, no. 2, pp. 349-396, 2003.

    Article  MATH  Google Scholar 

  27. D. L. Donoho and M. Elad, “Maximal sparsity representation via l1 min-imization,” Proceeding of the National Academy of Sciences USA, vol. 100, pp. 2197-2202, 2003.

    Article  MATH  MathSciNet  Google Scholar 

  28. Y. Li, A. Cichocki, and S. Amari, “Analysis of sparse representation and blind source separation,” Neural Computation, vol. 16, no. 6, pp. 1193-1234, 2004.

    Article  MATH  Google Scholar 

  29. R. Fletcher, “Semidefinite matrix constraints in optimization,” SIAM Journal on Control and Optimization, vol. 23, pp. 493-513, 1985.

    Article  MATH  MathSciNet  Google Scholar 

  30. O. L. Mangasarian, W. N. Street, and W. H. Wolberg, “Breast cancer diagnosis and prognosis via linear programming,” Operations Research, vol. 43, no. 4, pp. 570-577, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  31. P. S. Bradley, O. L. Mangasarian, and W. N. Street, “Clustering via concave minimization,” in Advances in Neural Information Processing Systems 9. MIT Press, 1997, pp. 368-374.

    Google Scholar 

  32. I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Trans-actions on Signal Processing, vol. 45, no. 3, pp. 600-616, 1997.

    Article  Google Scholar 

  33. M. Lewicki and B. A. Olshausen, “Inferring sparse, overcomplete image codes using an efficient coding framework,” in Advances in Neural Information Processing Systems 10. MIT Press, 1998, pp. 815-821.

    Google Scholar 

  34. S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33-61, 1998.

    Article  MathSciNet  Google Scholar 

  35. T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, “Blind source separation of more sources than mixtures using overcomplete representations,” IEEE Signal Processing Letters, vol. 4, no. 5, pp. 87-90, 1999.

    Google Scholar 

  36. M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse decomposition in a signal dictionary,” Neural Computation, vol. 13, no. 4, pp. 863-882, 2001.

    Article  MATH  Google Scholar 

  37. B. A. Pearlmutter and A. M. Zador, “Monaural source separation using spectral cues,” in Fifth International Conference on Independent Component Analysis, ser. LNCS 3195. Granada, Spain: Springer-Verlag, 2004, pp. 478-485.

    Google Scholar 

  38. G. B. Dantzig, “Programming in a linear structure,” 1948, uSAF, Washington D.C.

    Google Scholar 

  39. S. I. Gass, An Illustrated Guide to Linear Programming. McGraw-Hill, 1970.

    Google Scholar 

  40. R. Dorfman, “The discovery of linear programming,” Annals of the History of Computing, vol. 6, no. 3, pp. 283-295, 1984.

    Article  MATH  MathSciNet  Google Scholar 

  41. A. Griewank, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, ser. Frontiers in Appl. Math. Philadelphia, PA: SIAM, 2000, no. 19.

    Google Scholar 

  42. R. E. Wengert, “A simple automatic derivative evaluation program,” Commu-nications of the ACM, vol. 7, no. 8, pp. 463-464, 1964.

    Article  MATH  Google Scholar 

  43. B. Speelpenning, “Compiling fast partial derivatives of functions given by algorithms,” Ph.D. dissertation, Department of Computer Science, University of Illinois, Urbana-Champaign, 1980.

    Google Scholar 

  44. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533-536, 1986.

    Article  Google Scholar 

  45. A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7, no. 6, pp. 1129-1159, 1995.

    Article  Google Scholar 

  46. H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, vol. 22, pp. 400-407, 1951.

    Article  MATH  MathSciNet  Google Scholar 

  47. S. T. Roweis, “One microphone source separation,” in Advances in Neural Information Processing Systems, 2001, pp. 793-799.

    Google Scholar 

  48. T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition,” in ICSLP, 2006.

    Google Scholar 

  49. F. Bach and M. I. Jordan, “Blind one-microphone speech separation: A spectral learning approach,” in Advances in Neural Information Processing Systems 17, 2005, pp. 65-72.

    Google Scholar 

  50. M. P. Cooke, J. Barker, S. P. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,” Journal of the Acoustical Society of America, vol. 120, pp. 2421-2424, 2006.

    Article  Google Scholar 

  51. D. P. W. Ellis and R. J. Weiss, “Model-based monaural source separation using a vector-quantized phase-vocoder representation,” in ICASSP, 2006.

    Google Scholar 

  52. W. A. Yost, R. H. Dye, Jr., and S. Sheft, “A simulated “cocktail party” with up to three sound sources,” Perception and Psychophysics, vol. 58, no. 7, pp. 1026-1036, 1996.

    Google Scholar 

  53. F. L. Wightman and D. J. Kistler, “Headphone simulation of free-field listening. II: Psychophysical validation,” Journal of the Acoustical Society of America, vol. 85, no. 2, pp. 868-878, 1989.

    Article  Google Scholar 

  54. A. Kulkarni and H. S. Colburn, “Role of spectral detail in sound-source local-ization,” Nature, vol. 396, no. 6713, pp. 747-749, 1998.

    Article  Google Scholar 

  55. E. I. Knudsen and M. Konishi, “Mechanisms of sound localization in the barn owl,” Journal of Comparative Physiology, vol. 133, pp. 13-21, 1979.

    Article  Google Scholar 

  56. E. M. Wenzel, M. Arruda, D. J. Kistler, and F. L. Wightman, “Localization using nonindividualized head-related transfer functions,” Journal of the Acoustical Society of America, vol. 94, no. 1, pp. 111-123, 1993.

    Article  Google Scholar 

  57. P. M. Hofman and A. J. V. Opstal, “Bayesian reconstruction of sound localization cues from responses to random spectra,” Biological Cybernetics, vol. 86, no. 4, pp. 305-316, 2002.

    Article  MATH  Google Scholar 

  58. H. Attias and C. Schreiner, “Temporal low-order statistics of natural sounds,” in Advances in Neural Information Processing Systems, 1997.

    Google Scholar 

  59. T. Nishino, Y. Nakai, K. Takeda, and F. Itakura, “Estimating head related transfer function using multiple regression analysis,” IEICE Transactions A, vol. J84-A, no. 3, pp. 260-268, 2001, in Japanese.

    Google Scholar 

  60. H. Farid and E. H. Adelson, “Separating reflections from images by use of inde-pendent components analysis,” Journal of Optical Society of America, vol. 16, no. 9, pp. 2136-2145, 1999.

    Article  Google Scholar 

  61. S. T. Rickard and F. Dietrich, “DOA estimation of many W -disjoint orthogo-nal sources from two mixtures using DUET,” in Proceedings of the 10th IEEE Workshop on Statistical Signal and Array Processing (SSAP2000), Pocono Manor, PA, 2000, pp. 311-314.

    Google Scholar 

  62. A. Levin and Y. Weiss, “User assisted separation of reflections from a single image using a sparsity prior,” in Proceedings of the European Conference on Computer Vision (ECCV), Prague, 2004.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer

About this chapter

Cite this chapter

Asari, H., Olsson, R.K., Pearlmutter, B.A., Zador, A.M. (2007). Sparsification for Monaural Source Separation. In: Makino, S., Sawada, H., Lee, TW. (eds) Blind Speech Separation. Signals and Communication Technology. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6479-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-4020-6479-1_14

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-6478-4

  • Online ISBN: 978-1-4020-6479-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics