Abstract
We present a multiple pass streaming algorithm for learning the density function of a mixture of k uniform distributions over rectangles (cells) in \({\mathbb R}^d\), for any d > 0. Our learning model is: samples drawn according to the mixture are placed in arbitrary order in a data stream that may only be accessed sequentially by an algorithm with a very limited random access memory space. Our algorithm makes 2ℓ + 1 passes, for any ℓ> 0, and requires memory at most \(\tilde O(\epsilon^{-2/\ell}k^2d^4+(2k)^d)\). This exhibits a strong memory-space tradeoff: a few more passes significantly lowers its memory requirements, thus trading one of the two most important resources in streaming computation for the other. Chang and Kannan ? first considered this problem for [1] d = 1, 2.
Our learning algorithm is especially appropriate for situations where massive data sets of samples are available, but practical computation with such large inputs requires very restricted models of computation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chang, K., Kannan, R.: The space complexity of pass-efficient algorithms for clustering. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1157–1166 (2006)
Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theoretical Computer Science 12, 315–323 (1980)
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. Journal of Computer and System Sciences 58, 137–147 (1999)
Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the Association for Computing Machinery 53, 307–323 (2006)
Arora, S., Kannan, R.: Learning mixtures of separated nonsphereical Gaussians. Annals of Applied Probability 15, 69–92 (2005)
Dasgupta, S.: Learning mixtures of Gaussians. In: Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pp. 634–644. IEEE Computer Society Press, Los Alamitos (1999)
Kannan, R., Salmasian, H., Vempala, S.: The spectral method for general mixture models. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 444–457. Springer, Heidelberg (2005)
Vempala, S., Wang, G.: A spectral algorithm for learning mixtures of distributions. Journal of Computer and System Sciences 68, 841–860 (2004)
Dasgupta, A., Hopcroft, J.E., Kleinberg, J.M., Sandler, M.: On learning mixtures of heavy-tailed distributions. In: Proceedings of the 46th IEEE Symposium on Foundations of Computer Science, pp. 491–500. IEEE Computer Society Press, Los Alamitos (2005)
Gilbert, A.C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the 34th Annual ACM Symposium on the Theory of Computing, pp. 389–398. ACM Press, New York (2002)
Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimensional histograms. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp. 428–439. ACM Press, New York, NY, USA (2002)
Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 733–742 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, K.L. (2007). Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in \({\mathbb R}^d\) . In: Hutter, M., Servedio, R.A., Takimoto, E. (eds) Algorithmic Learning Theory. ALT 2007. Lecture Notes in Computer Science(), vol 4754. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75225-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-75225-7_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75224-0
Online ISBN: 978-3-540-75225-7
eBook Packages: Computer ScienceComputer Science (R0)