skip to main content
10.1145/2803140.2803143acmotherconferencesArticle/Chapter ViewAbstractPublication PagesimdmConference Proceedingsconference-collections
research-article

Gaussian Mixture Models Use-Case: In-Memory Analysis with Myria

Authors Info & Claims
Published:31 August 2015Publication History

ABSTRACT

In our work with scientists, we find that Gaussian Mixture Modeling is a common type of analysis applied to increasingly large datasets. We implement this algorithm in the Myria shared-nothing relational data management system, which performs the computation in memory. We study resulting memory utilization challenges and implement several optimizations that yield an efficient and scalable solution. Empirical evaluations on large astronomy and oceanography datasets confirm that our Myria approach scales well and performs up to an order of magnitude faster than Hadoop.

References

  1. R. Adams. Computing log-sum-exp, Jan. 2013.Google ScholarGoogle Scholar
  2. A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In Proc. of VLDB, pages 169--180, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In Proc. of SODA, pages 1027--1035, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Benaglia, D. Chauveau, D. R. Hunter, and D. Young. mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software, 32(6):1--29, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  5. L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. G. Brown. Overview of scidb: large scale array storage, processing and analysis. In Proc. of SIGMOD, pages 963--968, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Chu et al. Map-reduce for machine learning on multicore. In Proc. of NIPS, pages 281--288, 2006.Google ScholarGoogle Scholar
  8. HP Distributed R. http://www.distributedr.org.Google ScholarGoogle Scholar
  9. C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association, 97:611--631, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  10. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  11. D. Halperin et al. Demonstration of the Myria big data management service. In Proc. of SIGMOD, pages 881--884, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hicklin, C. Moler, P. Webb, R. F. Boisvert, B. Miller, R. Pozo, and K. Remington. Jama: A Java matrix package. URL: http://math. nist.gov/javanumerics/jama, 2000.Google ScholarGoogle Scholar
  13. J. Hyrkas, D. Halperin, and B. Howe. Time-varying clusters in large-scale flow cytometry. In Proc. of AAAI, pages 4022--4023, 2015.Google ScholarGoogle Scholar
  14. jblas: Linear Algebra for Java. http://jblas.org.Google ScholarGoogle Scholar
  15. LAPACK: âĂL' Linear Algebra PACKage. http://www.netlib.org/lapack.Google ScholarGoogle Scholar
  16. Y. Low et al. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB, 5(8):716--727, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Apache Mahout. http://mahout.apache.org.Google ScholarGoogle Scholar
  18. Spark Machine Learning Library (MLlib). http://spark.apache.org/mllib.Google ScholarGoogle Scholar
  19. K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Myria: Big Data as a Service. http://myria.cs.washington.edu.Google ScholarGoogle Scholar
  21. F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sloan Digital Sky Survey III: DR 10. http://www.sdss3.org/dr10/.Google ScholarGoogle Scholar
  23. J. Swalwell, F. Ribalet, and E. Armbrust. Seaflow: A novel underway flow-cytometer for continuous observations of phytoplankton in the ocean. Limnology & Oceanography Methods, 9:466--477, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  24. Wide-field Infrared Survey Explorer. http://www.nasa.gov/mission_pages/WISE/main/index.html.Google ScholarGoogle Scholar
  25. M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of USENIX, pages 15--28, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    IMDM '15: Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics
    August 2015
    63 pages
    ISBN:9781450337137
    DOI:10.1145/2803140

    Copyright © 2015 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 31 August 2015

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader