skip to main content
10.1145/2694344.2694351acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Published: 14 March 2015 Publication History

Abstract

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and accepting and running a precise and a user-defined approximate version of the MapReduce code. We then show how to leverage statistical theories to compute error bounds for popular classes of MapReduce programs when approximating with input data sampling and/or task dropping. We implement the proposed mechanisms and error bound estimations in a prototype system called ApproxHadoop. Our evaluation uses MapReduce applications from different domains, including data analytics, scientific computing, video encoding, and machine learning. Our results show that ApproxHadoop can significantly reduce application execution time and/or energy consumption when the user is willing to tolerate small errors. For example, ApproxHadoop can reduce runtimes by up to 32x when the user can tolerate an error of 1% with 95% confidence. We conclude that our framework and system can make approximation easily accessible to many application domains using the MapReduce model.

References

[1]
Apache Hadoop. http://hadoop.apache.org.
[2]
Apache Mahout. http://mahout.apache.org.
[3]
Apache Nutch. http://nutch.apache.org.
[4]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the European Conference on Computer Systems (EuroSys), 2013.
[5]
G. Ananthanarayanan, M. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014.
[6]
W. Baek and T. M. Chilimbi. Green: A Framework for Supporting Energy-Conscious Programming using Controlled Approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010.
[7]
S. Bhat, J. Borgstrom, A. D. Gordon, and C. Russo. Deriving Probability Density Functions from Probabilistic Functional Programs. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2013.
[8]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2010.
[9]
J. Bornholt, T. Mytkowicz, and K. S. McKinley. Uncertain : A First-Order Type for Uncertain Data. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014.
[10]
S. Chaudhuri, G. Das, and V. Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems (TODS), 32(2), 2007.
[11]
S. Coles. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.
[12]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010.
[13]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2004.
[14]
A. Doucet, S. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian Filtering. Statistics and Computing, 10(3), 2000.
[15]
J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for Data Intensive Scientific Analyses. In Proceedings of the IEEE International Conference on e-Science (e-Science), 2008.
[16]
Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Adapting MapReduce for HPC environments. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2011.
[17]
M. N. Garofalakis and P. B. Gibbons. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Databases (VLDB), 2001.
[18]
I. Goiri, K. Le, J. Guitart, J. Torres, and R. Bianchini. Intelligent Placement of Datacenters for Internet Services. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), 2011.
[19]
I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. Technical Report DCS-TR-709, Department of Computer Science, Rutgers University, 2014.
[20]
P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proceedings of the International Conference on Very Large Databases (VLDB), 1995.
[21]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997.
[22]
H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic Knobs for Responsive Power-Aware Computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.
[23]
O. Kiselyov and C.-C. Shan. Embedded Probabilistic Programming. In Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages (DSL), 2009.
[24]
J. Lin. Cloud9: A Hadoop Toolkit for Working with Big Data. http://lintool.github.io/Cloud9.
[25]
J. W. Liu, W.-K. Shih, K.-J. Lin, R. Bettati, and J.-Y. Chung. Imprecise Computations. Proceedings of the IEEE, 82(1), 1994.
[26]
S. Liu and W. Q. Meeker. Statistical Methods for Estimating the Minimum Thickness Along a Pipeline. Technometrics, 2014.
[27]
S. Lohr. Sampling: Design and Analysis. Cengage Learning, 2009.
[28]
T. Minka, J. Winn, J. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6. Microsoft Research Cambridge, 2014. http://research.microsoft.com/infernet.
[29]
S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. Quality of Service Profiling. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE), 2010.
[30]
S. Misailovic, D. M. Roy, and M. C. Rinard. Probabilistically Accurate Program Transformations. In Proceedings of the International Static Analysis Symposium (SAS), 2011.
[31]
S. Misailovic, S. Sidiroglou, H. Hoffmann, M. Carbin, A. Agarwal, and M. Rinard. Code Perforation: Automatically and Dynamically Trading Accuracy for Performance and Power, 2014. http://groups.csail.mit.edu/cag/codeperf/.
[32]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford InfoLab, 1999.
[33]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. Proceedings of the VLDB Endowment (PVLDB), 4(11), 2011.
[34]
A. Pfeffer. A General Importance Sampling Algorithm for Probabilistic Programs. Technical Report TR-12-07, Harvard University, 2007.
[35]
M. Rinard. Probabilistic Accuracy Bounds for Fault-tolerant Computations That Discard Tasks. In Proceedings of the Annual International Conference on Supercomputing (ICS), 2006.
[36]
M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2012.
[37]
M. Samadi, J. Lee, A. Jamshidi, A. Hormati, and S. Mahlke. SAGE: Self-Tuning Approximation for Graphics Engines. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013.
[38]
A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011.
[39]
A. Sampson, J. Nelson, K. Strauss, and L. Ceze. Approximate Storage in Solid-State Memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013.
[40]
A. Sampson, P. Panchekha, T. Mytkowicz, K. S. McKinley, D. Grossman, and L. Ceze. Expressing and Verifying Probabilistic Assertions. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014.
[41]
S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing Performance vs. Accuracy Trade-offs with Loop Perforation. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), 2011.
[42]
L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2011.
[43]
J. Slauson and Q. Wan. Approximate Hadoop, 2012. http://www.joshslauson.com/pdf/cs736_project.pdf.
[44]
A. Verma, N. Zea, B. Cho, I. Gupta, and R. H. Campbell. Breaking the MapReduce Stage Barrier. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), 2010.
[45]
Wikipedia. Wikipedia Database, 2014. http://en.wikipedia.org/wiki/Wikipedia_database.
[46]
Wikipedia. Wikimedia Downloads, 2014. http://dumps.wikimedia.org.
[47]
D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
[48]
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008.

Cited By

View all
  • (2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
  • (2025)Strainer: Windowing-Based Advanced Sampling in Stream Processing SystemsEconomics of Grids, Clouds, Systems, and Services10.1007/978-3-031-81226-2_24(275-285)Online publication date: 6-Feb-2025
  • (2024)Approximate caching for efficiently serving text-to-image diffusion modelsProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691890(1173-1189)Online publication date: 16-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2015
720 pages
ISBN:9781450328357
DOI:10.1145/2694344
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MapReduce
  2. approximation
  3. extreme value theory
  4. multi-stage sampling

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

ASPLOS '15

Acceptance Rates

ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
  • (2025)Strainer: Windowing-Based Advanced Sampling in Stream Processing SystemsEconomics of Grids, Clouds, Systems, and Services10.1007/978-3-031-81226-2_24(275-285)Online publication date: 6-Feb-2025
  • (2024)Approximate caching for efficiently serving text-to-image diffusion modelsProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691890(1173-1189)Online publication date: 16-Apr-2024
  • (2024)A Survey on Design Space Exploration Approaches for Approximate Computing SystemsElectronics10.3390/electronics1322444213:22(4442)Online publication date: 13-Nov-2024
  • (2024)DiApprox: Differential Privacy-based Online Range Queries Approximation for Multidimensional DataProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636070(337-344)Online publication date: 8-Apr-2024
  • (2024)Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data ExplorationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334145136:11(6532-6546)Online publication date: Nov-2024
  • (2024)HPAC-ML: A Programming Model for Embedding ML Surrogates in Scientific ApplicationsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00078(1-16)Online publication date: 17-Nov-2024
  • (2024)Exact and Approximate Tasks Computation in IoT NetworksIEEE Internet of Things Journal10.1109/JIOT.2023.331669911:5(7974-7988)Online publication date: 1-Mar-2024
  • (2023)Polygon Simplification for the Efficient Approximate Analytics of Georeferenced Big DataSensors10.3390/s2319817823:19(8178)Online publication date: 29-Sep-2023
  • (2023)Auto-HPCnet: An Automatic Framework to Build Neural Network-based Surrogate for High-Performance Computing ApplicationsProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592985(31-44)Online publication date: 7-Aug-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media