ABSTRACT
In this paper, we explore an application of basis pursuit to audio scene analysis. The goal of our work is to detect when certain sounds are present in a mixed audio signal. We focus on the regime where out of a large number of possible sources, a small but unknown number combine and overlap to yield the observed signal. To infer which sounds are present, we decompose the observed signal as a linear combination of a small number of active sources. We cast the inference as a regularized form of linear regression whose sparse solutions yield decompositions with few active sources. We characterize the acoustic variability of individual sources by autoregressive models of their time domain waveforms. When we do not have prior knowledge of the individual sources, the coefficients of these autoregressive models must be learned from audio examples. We analyze the dynamical stability of these models and show how to estimate stable models by substituting a simple convex optimization for a difficult eigenvalue problem. We demonstrate our approach by learning dictionaries of musical notes and using these dictionaries to analyze polyphonic recordings of piano, cello, and violin.
- Chechik, G., Ie, E., Rehn, M., Bengio, S., & Lyon, D. (2008). Large-scale content-based audio retrieval from text queries. Proceeding of the 1st ACM International Conference on Multimedia Information Retrieval (MIR-08) (pp. 105--112). ACM. Google ScholarDigital Library
- Chen, S. S., Donoho, D. L., & Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20, 33--61. Google ScholarDigital Library
- Cheng, C., Hu, D. J., & Saul, L. K. (2008). Nonnegative matrix factorization for real time musical analysis and sight-reading evaluation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-08) (pp. 2017--2020).Google Scholar
- Cho, Y., & Saul, L. K. (2009). Sparse decomposition of mixed audio signals by basis pursuit with autoregressive models. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-09) (pp. 1705--1708). Google ScholarDigital Library
- Cont, A. (2006). Realtime multiple pitch observation using sparse non-negative constraints. Proceedings of the International Symposium on Music Information Retrieval (ISMIR-06).Google Scholar
- Fritts, L. (1997). The University of Iowa Musical Instrument Samples. http://theremin.music.uiowa.edu/MIS.html.Google Scholar
- Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. The Johns Hopkins University Press.Google Scholar
- Goto, M. (2006). Analysis of musical audio signals. In D. Wang and G. Brown (Eds.), Computational auditory scene analysis: Principles, algorithms, and applications, 251--295. John Wiley & Sons, Inc.Google Scholar
- Grosse, R., Raina, R., Kwong, H., & Ng, A. Y. (2007). Shift-invariant sparse coding for audio classification. Proceedings of the 23rd Annual Conference on Uncertainty in Artificial Intelligence (UAI-07) (pp. 149--158).Google Scholar
- Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. John Wiley & Sons.Google Scholar
- Lacy, S. L., & Bernstein, D. S. (2002). Subspace identification with guaranteed stability using constrained optimization. Proceedings of the American Control Conference (pp. 3307--3312).Google ScholarCross Ref
- Lee, D. D., & Seung, H. S. (2001). Algorithms for nonnegative matrix factorization. Advances in Neural Information Processing Systems 14 (pp. 556--562). MIT Press.Google Scholar
- Makhoul, J. J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63, 561--580.Google ScholarCross Ref
- Nakashizuka, M. (2008). A sparse decomposition method for periodic signal mixtures. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 91, 791--800. Google ScholarDigital Library
- Roweis, S. T. (2000). One microphone source separation. Advances in Neural Information Processing Systems 13 (pp. 793--799). MIT Press.Google Scholar
- Sardy, S., Bruce, A. G., & Tseng, P. (2000). Block coordinate relaxation methods for nonparametric wavelet denoising. Journal of Computational and Graphical Statistics, 9, 361--379.Google Scholar
- Siddiqi, S., Boots, B., & Gordon, G. (2008). A constraint generation approach to learning stable linear dynamical systems. Advances in Neural Information Processing Systems 20 (pp. 1329--1336). MIT Press.Google Scholar
- Smaragdis, P., & Brown, J. C. (2003). Non-negative matrix factorization for polyphonic music transcription. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 177--180).Google ScholarCross Ref
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1), 267--288.Google ScholarCross Ref
- Wang, D., & Brown, G. J. (Eds.). (2006). Computational auditory scene analysis: Principles, algorithms, and applications. John Wiley & Sons, Inc. Google ScholarDigital Library
- Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49--67.Google ScholarCross Ref
Index Terms
- Learning dictionaries of stable autoregressive models for audio scene analysis
Recommendations
Audio Coding for Representation in MIDI via Pitch Detection Using Harmonic Dictionaries
special issue on multimedia signal processingThe search for a flexible and concise alternate representation for digital musical sound leads to the proposal for the use of the MIDI (Musical Instrument Digital Interface) protocol. The problem becomes one of automating the conversion process from sound ...
Comments