skip to main content
10.1145/2463676.2465273acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Cumulon: optimizing statistical data analysis in the cloud

Published: 22 June 2013 Publication History

Abstract

We present Cumulon, a system designed to help users rapidly develop and intelligently deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible execution model and new operators especially suited for such workloads. We show how to implement Cumulon on top of Hadoop/HDFS while avoiding limitations of MapReduce, and demonstrate Cumulon's performance advantages over existing Hadoop-based systems for statistical data analysis. To support intelligent deployment in the cloud according to time/budget constraints, Cumulon goes beyond database-style optimization to make choices automatically on not only physical operators and their parameters, but also hardware provisioning and configuration settings. We apply a suite of benchmarking, simulation, modeling, and search techniques to support effective cost-based optimization over this rich space of deployment plans.

References

[1]
V. R. Borkar, Y. Bu, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Declarative systems for large-scale machine learning. IEEE Data Engineering Bulletin, 35(2):24--32, 2012.
[2]
P. G. Brown. Overview of SciDB: Large scale array storage, processing and analysis. SIGMOD 2010.
[3]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285--296, 2010.
[4]
J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. A. Brandt. SciHadoop: Array-based query processing in Hadoop. Supercomputing 2011.
[5]
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. VLDB 2009.
[6]
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla2, P. J. Haas, and J. McPherson. Ricardo: Integrating R and Hadoop. SIGMOD 2010.
[7]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI 2004.
[8]
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE 2011.
[9]
S. Guha. Computing Environment for the Statistical Analysis of Large and Complex Data. PhD thesis, Purdue University, 2010.
[10]
J. M. Hellerstein, C. Re, F. Schoppmann, Z. D. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib analytics library or MAD skills, the SQL. PVLDB, 5(12):1700--1711, 2012.
[11]
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB Endowment, 4(11):1111--1122, 2011.
[12]
H. Herodotou, F. Dong, and S. Babu. No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics. SoCC 2011.
[13]
T. Hofmann. Probabilistic latent semantic indexing. SIGIR 1999.
[14]
K. Kambatla, A. Pathak, and H. Pucha. Towards optimizing hadoop provisioning for the cloud. HotCloud 2009.
[15]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A peta-scale graph mining system. ICDM 2009.
[16]
J. Li, X. Ma, S. B. Yoginath, G. Kora, and N. F. Samatova. Transparent runtime parallelization of the R scripting language. J. Parallel & Distributed Computing, 71(2):157--168, 2011.
[17]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. SIGMOD 2008.
[18]
V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM J. Matrix Analysis & Applications, 31(3), 2009.
[19]
S. Seo, E. J. Yoon, J.-H. Kim, S. Jin, J.-S. Kim, and S. Maeng. HAMA: An efficient matrix computation with the MapReduce framework. CloudCom 2010.
[20]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P.Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009.
[21]
S. Toledo. A survey of out-of-core algorithms in numerical linear algebra. In DIMACS Series In Discrete Mathematics And Theoretical Computer Science: External Memory Algorithms, pages 161--179. 1999.
[22]
L. G. Valiant. A bridging model for parallel computation. CACM, 33(8), 1990.
[23]
Y. Zhang, H. Herodotou, and J. Yang. RIOT: I/O-efficient numerical computing without SQL. CIDR 2009.
[24]
Z. Zhang, L. Cherkasova, A. Verma, and B. T. Loo. Meeting service level objectives of Pig programs. In 2012 Intl. Workshop on Cloud Computing Platforms.

Cited By

View all
  • (2025)LCP: Enhancing Scientific Data Management with Lossy Compression for ParticlesProceedings of the ACM on Management of Data10.1145/37097003:1(1-27)Online publication date: 11-Feb-2025
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • (2023)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 22-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cloud
  2. data parallelism
  3. linear algebra
  4. statistical computing

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'13
Sponsor:

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)LCP: Enhancing Scientific Data Management with Lossy Compression for ParticlesProceedings of the ACM on Management of Data10.1145/37097003:1(1-27)Online publication date: 11-Feb-2025
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • (2023)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 22-Jun-2023
  • (2022)Scalable Graph Convolutional Network Training on Distributed-Memory SystemsProceedings of the VLDB Endowment10.14778/3574245.357425616:4(711-724)Online publication date: 1-Dec-2022
  • (2022)Towards distribution-aware query answering in data marketsProceedings of the VLDB Endowment10.14778/3551793.355185815:11(3137-3144)Online publication date: 29-Sep-2022
  • (2022)TiresiasProceedings of the VLDB Endowment10.14778/3551793.355185715:11(3126-3136)Online publication date: 29-Sep-2022
  • (2022)Selective data acquisition in the wild for model chargingProceedings of the VLDB Endowment10.14778/3523210.352322315:7(1466-1478)Online publication date: 22-Jun-2022
  • (2022)FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan GenerationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517895(1891-1904)Online publication date: 10-Jun-2022
  • (2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
  • (2022)A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00084(529-536)Online publication date: Oct-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media