research-article

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Authors:

Ricardo Bianchini,

Santosh Nagarakatte,

Thu D. NguyenAuthors Info & Claims

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 383 - 397

https://doi.org/10.1145/2694344.2694351

Published: 14 March 2015 Publication History

Abstract

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and accepting and running a precise and a user-defined approximate version of the MapReduce code. We then show how to leverage statistical theories to compute error bounds for popular classes of MapReduce programs when approximating with input data sampling and/or task dropping. We implement the proposed mechanisms and error bound estimations in a prototype system called ApproxHadoop. Our evaluation uses MapReduce applications from different domains, including data analytics, scientific computing, video encoding, and machine learning. Our results show that ApproxHadoop can significantly reduce application execution time and/or energy consumption when the user is willing to tolerate small errors. For example, ApproxHadoop can reduce runtimes by up to 32x when the user can tolerate an error of 1% with 95% confidence. We conclude that our framework and system can make approximation easily accessible to many application domains using the MapReduce model.

References

[1]

Apache Hadoop. http://hadoop.apache.org.

[2]

Apache Mahout. http://mahout.apache.org.

[3]

Apache Nutch. http://nutch.apache.org.

[4]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the European Conference on Computer Systems (EuroSys), 2013.

Digital Library

[5]

G. Ananthanarayanan, M. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014.

Digital Library

[6]

W. Baek and T. M. Chilimbi. Green: A Framework for Supporting Energy-Conscious Programming using Controlled Approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010.

Digital Library

[7]

S. Bhat, J. Borgstrom, A. D. Gordon, and C. Russo. Deriving Probability Density Functions from Probabilistic Functional Programs. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2013.

Digital Library

[8]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2010.

Digital Library

[9]

J. Bornholt, T. Mytkowicz, and K. S. McKinley. Uncertain : A First-Order Type for Uncertain Data. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014.

Digital Library

[10]

S. Chaudhuri, G. Das, and V. Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems (TODS), 32(2), 2007.

Digital Library

[11]

S. Coles. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.

[12]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010.

Digital Library

[13]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2004.

Digital Library

[14]

A. Doucet, S. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian Filtering. Statistics and Computing, 10(3), 2000.

Digital Library

[15]

J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for Data Intensive Scientific Analyses. In Proceedings of the IEEE International Conference on e-Science (e-Science), 2008.

Digital Library

[16]

Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Adapting MapReduce for HPC environments. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2011.

Digital Library

[17]

M. N. Garofalakis and P. B. Gibbons. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Databases (VLDB), 2001.

Digital Library

[18]

I. Goiri, K. Le, J. Guitart, J. Torres, and R. Bianchini. Intelligent Placement of Datacenters for Internet Services. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), 2011.

Digital Library

[19]

I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. Technical Report DCS-TR-709, Department of Computer Science, Rutgers University, 2014.

[20]

P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proceedings of the International Conference on Very Large Databases (VLDB), 1995.

Digital Library

[21]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997.

Digital Library

[22]

H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic Knobs for Responsive Power-Aware Computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.

Digital Library

[23]

O. Kiselyov and C.-C. Shan. Embedded Probabilistic Programming. In Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages (DSL), 2009.

Digital Library

[24]

J. Lin. Cloud9: A Hadoop Toolkit for Working with Big Data. http://lintool.github.io/Cloud9.

[25]

J. W. Liu, W.-K. Shih, K.-J. Lin, R. Bettati, and J.-Y. Chung. Imprecise Computations. Proceedings of the IEEE, 82(1), 1994.

[26]

S. Liu and W. Q. Meeker. Statistical Methods for Estimating the Minimum Thickness Along a Pipeline. Technometrics, 2014.

[27]

S. Lohr. Sampling: Design and Analysis. Cengage Learning, 2009.

[28]

T. Minka, J. Winn, J. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6. Microsoft Research Cambridge, 2014. http://research.microsoft.com/infernet.

[29]

S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. Quality of Service Profiling. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE), 2010.

Digital Library

[30]

S. Misailovic, D. M. Roy, and M. C. Rinard. Probabilistically Accurate Program Transformations. In Proceedings of the International Static Analysis Symposium (SAS), 2011.

Digital Library

[31]

S. Misailovic, S. Sidiroglou, H. Hoffmann, M. Carbin, A. Agarwal, and M. Rinard. Code Perforation: Automatically and Dynamically Trading Accuracy for Performance and Power, 2014. http://groups.csail.mit.edu/cag/codeperf/.

[32]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford InfoLab, 1999.

[33]

N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. Proceedings of the VLDB Endowment (PVLDB), 4(11), 2011.

Digital Library

[34]

A. Pfeffer. A General Importance Sampling Algorithm for Probabilistic Programs. Technical Report TR-12-07, Harvard University, 2007.

[35]

M. Rinard. Probabilistic Accuracy Bounds for Fault-tolerant Computations That Discard Tasks. In Proceedings of the Annual International Conference on Supercomputing (ICS), 2006.

Digital Library

[36]

M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2012.

Digital Library

[37]

M. Samadi, J. Lee, A. Jamshidi, A. Hormati, and S. Mahlke. SAGE: Self-Tuning Approximation for Graphics Engines. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013.

Digital Library

[38]

A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011.

Digital Library

[39]

A. Sampson, J. Nelson, K. Strauss, and L. Ceze. Approximate Storage in Solid-State Memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013.

Digital Library

[40]

A. Sampson, P. Panchekha, T. Mytkowicz, K. S. McKinley, D. Grossman, and L. Ceze. Expressing and Verifying Probabilistic Assertions. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014.

Digital Library

[41]

S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing Performance vs. Accuracy Trade-offs with Loop Perforation. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), 2011.

Digital Library

[42]

L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2011.

[43]

J. Slauson and Q. Wan. Approximate Hadoop, 2012. http://www.joshslauson.com/pdf/cs736_project.pdf.

[44]

A. Verma, N. Zea, B. Cho, I. Gupta, and R. H. Campbell. Breaking the MapReduce Stage Barrier. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), 2010.

Digital Library

[45]

Wikipedia. Wikipedia Database, 2014. http://en.wikipedia.org/wiki/Wikipedia_database.

[46]

Wikipedia. Wikimedia Downloads, 2014. http://dumps.wikimedia.org.

[47]

D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.

[48]

M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008.

Digital Library

Cited By

Leon VHanif MArmeniakos GJiao XShafique MPekmestzi KSoudris D(2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
https://dl.acm.org/doi/10.1145/3711683
Koevski NEsteves SVeiga L(2025)Strainer: Windowing-Based Advanced Sampling in Stream Processing SystemsEconomics of Grids, Clouds, Systems, and Services10.1007/978-3-031-81226-2_24(275-285)Online publication date: 6-Feb-2025
https://doi.org/10.1007/978-3-031-81226-2_24
Agarwal SMitra SChakraborty SKaranam SMukherjee KSaini SVanbever LZhang I(2024)Approximate caching for efficiently serving text-to-image diffusion modelsProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691890(1173-1189)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691890
Show More Cited By

Index Terms

ApproxHadoop: Bringing Approximations to MapReduce Frameworks

Recommendations

ApproxHadoop: Bringing Approximations to MapReduce Frameworks
ASPLOS '15

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and ...
ApproxHadoop: Bringing Approximations to MapReduce Frameworks
ASPLOS'15

We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and ...
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

March 2015

720 pages

ISBN:9781450328357

DOI:10.1145/2694344

General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA

ACM SIGPLAN Notices Volume 50, Issue 4
ASPLOS '15
April 2015
676 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2775054
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 43, Issue 1
ASPLOS'15
March 2015
676 pages
ISSN:0163-5964
DOI:10.1145/2786763
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

ASPLOS '15

Sponsor:

ASPLOS '15: Architectural Support for Programming Languages and Operating Systems

March 14 - 18, 2015

Istanbul, Turkey

Acceptance Rates

ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

144
Total Citations
View Citations
896
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leon VHanif MArmeniakos GJiao XShafique MPekmestzi KSoudris D(2025)Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and ApplicationsACM Computing Surveys10.1145/371168357:7(1-36)Online publication date: 20-Feb-2025
https://dl.acm.org/doi/10.1145/3711683
Koevski NEsteves SVeiga L(2025)Strainer: Windowing-Based Advanced Sampling in Stream Processing SystemsEconomics of Grids, Clouds, Systems, and Services10.1007/978-3-031-81226-2_24(275-285)Online publication date: 6-Feb-2025
https://doi.org/10.1007/978-3-031-81226-2_24
Agarwal SMitra SChakraborty SKaranam SMukherjee KSaini SVanbever LZhang I(2024)Approximate caching for efficiently serving text-to-image diffusion modelsProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691890(1173-1189)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691890
Saeedi SPiri ADeveautour BO’Connor IBosio ASavino ADi Carlo S(2024)A Survey on Design Space Exploration Approaches for Approximate Computing SystemsElectronics10.3390/electronics1322444213:22(4442)Online publication date: 13-Nov-2024
https://doi.org/10.3390/electronics13224442
Laouir AImine AHong JPark J(2024)DiApprox: Differential Privacy-based Online Range Queries Approximation for Multidimensional DataProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636070(337-344)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3636070
Zhang HJing YHe ZZhang KWang X(2024)Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data ExplorationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334145136:11(6532-6546)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2023.3341451
Fink ZParasyris KRathi PGeorgakoudis GMenon HBremer P(2024)HPAC-ML: A Programming Model for Embedding ML Surrogates in Scientific ApplicationsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00078(1-16)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00078
Cui YChin KSoh SRos M(2024)Exact and Approximate Tasks Computation in IoT NetworksIEEE Internet of Things Journal10.1109/JIOT.2023.331669911:5(7974-7988)Online publication date: 1-Mar-2024
https://doi.org/10.1109/JIOT.2023.3316699
Al Jawarneh IFoschini LBellavista P(2023)Polygon Simplification for the Efficient Approximate Analytics of Georeferenced Big DataSensors10.3390/s2319817823:19(8178)Online publication date: 29-Sep-2023
https://doi.org/10.3390/s23198178
Dong WKestor GLi DButt AMi NChard K(2023)Auto-HPCnet: An Automatic Framework to Build Neural Network-based Surrogate for High-Performance Computing ApplicationsProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592985(31-44)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592985
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten