Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in ${\mathbb R}^d$

Chang, Kevin L.

doi:10.1007/978-3-540-75225-7_19

Kevin L. Chang⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4754))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

2294 Accesses

Abstract

We present a multiple pass streaming algorithm for learning the density function of a mixture of k uniform distributions over rectangles (cells) in ${\mathbb R}^d$, for any d > 0. Our learning model is: samples drawn according to the mixture are placed in arbitrary order in a data stream that may only be accessed sequentially by an algorithm with a very limited random access memory space. Our algorithm makes 2ℓ + 1 passes, for any ℓ> 0, and requires memory at most $\tilde O(\epsilon^{-2/\ell}k^2d^4+(2k)^d)$. This exhibits a strong memory-space tradeoff: a few more passes significantly lowers its memory requirements, thus trading one of the two most important resources in streaming computation for the other. Chang and Kannan ? first considered this problem for [1] d = 1, 2.

Our learning algorithm is especially appropriate for situations where massive data sets of samples are available, but practical computation with such large inputs requires very restricted models of computation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Evaluating Bayesian Networks via Data Streams

Parallel Streaming Random Sampling

Three Big Data Tools for a Data Scientist’s Toolbox

References

Chang, K., Kannan, R.: The space complexity of pass-efficient algorithms for clustering. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1157–1166 (2006)
Google Scholar
Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theoretical Computer Science 12, 315–323 (1980)
Article MathSciNet MATH Google Scholar
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. Journal of Computer and System Sciences 58, 137–147 (1999)
Article MathSciNet MATH Google Scholar
Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the Association for Computing Machinery 53, 307–323 (2006)
Article MathSciNet MATH Google Scholar
Arora, S., Kannan, R.: Learning mixtures of separated nonsphereical Gaussians. Annals of Applied Probability 15, 69–92 (2005)
Article MathSciNet MATH Google Scholar
Dasgupta, S.: Learning mixtures of Gaussians. In: Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pp. 634–644. IEEE Computer Society Press, Los Alamitos (1999)
Google Scholar
Kannan, R., Salmasian, H., Vempala, S.: The spectral method for general mixture models. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 444–457. Springer, Heidelberg (2005)
Chapter Google Scholar
Vempala, S., Wang, G.: A spectral algorithm for learning mixtures of distributions. Journal of Computer and System Sciences 68, 841–860 (2004)
Article MathSciNet MATH Google Scholar
Dasgupta, A., Hopcroft, J.E., Kleinberg, J.M., Sandler, M.: On learning mixtures of heavy-tailed distributions. In: Proceedings of the 46th IEEE Symposium on Foundations of Computer Science, pp. 491–500. IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Gilbert, A.C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the 34th Annual ACM Symposium on the Theory of Computing, pp. 389–398. ACM Press, New York (2002)
Google Scholar
Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimensional histograms. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp. 428–439. ACM Press, New York, NY, USA (2002)
Chapter Google Scholar
Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 733–742 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Max Planck Institute for Computer Science, Saarbrücken, Germany
Kevin L. Chang

Authors

Kevin L. Chang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

RSISE @ ANU and SML @ NICTA, Canberra,, ACT, 0200, Australia
Marcus Hutter
Columbia University, NY, P.O. Box, New York, USA
Rocco A. Servedio
Graduate School of Information Sciences, Tohoku University,, Sendai 980-8579, Japan
Eiji Takimoto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, K.L. (2007). Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in ${\mathbb R}^d$ . In: Hutter, M., Servedio, R.A., Takimoto, E. (eds) Algorithmic Learning Theory. ALT 2007. Lecture Notes in Computer Science(), vol 4754. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75225-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-75225-7_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75224-0
Online ISBN: 978-3-540-75225-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in \({\mathbb R}^d\)

Abstract

Access this chapter

Preview

Similar content being viewed by others

Evaluating Bayesian Networks via Data Streams

Parallel Streaming Random Sampling

Three Big Data Tools for a Data Scientist’s Toolbox

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Multiple Pass Streaming Algorithms for Learning Mixtures of Distributions in \({\mathbb R}^d\)

Abstract

Access this chapter

Preview

Similar content being viewed by others

Evaluating Bayesian Networks via Data Streams

Parallel Streaming Random Sampling

Three Big Data Tools for a Data Scientist’s Toolbox

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us