Optimization of Massively Parallel Data Flows

Hueske, Fabian; Markl, Volker

doi:10.1007/978-1-4614-9242-9_2

Optimization of Massively Parallel Data Flows

Fabian Hueske³ &
Volker Markl³

Chapter
First Online: 28 November 2013

3120 Accesses

Abstract

Massively parallel data analysis is an emerging research topic that is motivated by the continuous growth of data sets and the rising complexity of data analysis tasks. To facilitate the analysis of big data, several parallel data processing frameworks, such as MapReduce and parallel data flow processors, have emerged. However, the implementation and tuning of parallel data analysis tasks requires expert knowledge and is very time-consuming and costly. Higher-level abstraction frameworks have been designed to ease the definition of analysis tasks. Optimizers can automatically generate efficient parallel execution plans from higher-level task definitions. Therefore, optimization is a crucial technology for massively parallel data analysis. This chapter presents the state of the art in optimization of parallel data flows. It covers higher-level languages for MapReduce, approaches to optimize plain MapReduce jobs, and optimization for parallel data flow systems. The optimization capabilities of those approaches are discussed and compared with each other. The chapter concludes with directions for future research on parallel data flow optimization.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The optimal join order depends on the choice of the physical operators. Therefore, join ordering is done as part of the physical optimization, although it is a logical rewrite.
2.
Hadoop’s implementation varies from the original paper by performing partial sorts already within the Map task. Subsequently, the Reduce task merges the sorted buckets.
3.
Due to lack of space, we do not explain the execution of the optional Combiner. Instead, we refer the reader to the original paper [26].
4.
Process is equivalent to Map.
5.
This is true for the pure programming model, not necessarily for its implementations, such as Hadoop.
6.
User-defined functions (UDFs) incorporate semantics a query optimizer cannot reason about.

References

Abhirama, M., Bhaumik, S., Dey, A., Shrimal, H., Haritsa, J.R.: On the stability of plan costs and the costs of plan stability. PVLDB 3(1), 1137–1148 (2010)
Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT, Lausanne, pp. 99–110 (2010)
Google Scholar
Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: VLDB, Cairo, pp. 496–505 (2000)
Google Scholar
Agrawal, P., Kifer, D., Olston, C.: Scheduling shared scans of large data files. PVLDB 1(1), 958–969 (2008)
Google Scholar
Alexandrov, A., Battré, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Massively parallel data analysis with pacts on nephele. PVLDB 3(2), 1625–1628 (2010)
Google Scholar
Alexandrov, A., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Mapreduce and pact – comparing data parallel programming models. In: BTW, Kaiserslautern, pp. 25–44 (2011)
Google Scholar
Apache Hadoop: http://hadoop.apache.org
Apache Hive: http://hive.apache.org
Apache Mahout: http://mahout.apache.org
Apache PIG: http://pig.apache.org
Asterix: A highly scalable parallel platform for semi-structured data management and analysis. http://asterix.ics.uci.edu
Babcock, B., Chaudhuri, S.: Towards a robust query optimizer: a principled and practical approach. In: SIGMOD conference, Baltimore, pp. 119–130 (2005)
Google Scholar
Babu, S.: Towards automatic optimization of mapreduce programs. In: SoCC, Indianapolis, pp. 137–142 (2010)
Google Scholar
Babu, S., Bizarro, P., DeWitt, D.J.: Proactive re-optimization with rio. In: SIGMOD conference, Baltimore, pp. 936–938 (2005)
Google Scholar
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC’10: Proceedings of the ACM Symposium on Cloud Computing, Indianapolis, pp. 119–130. ACM, New York (2010)
Google Scholar
Behm, A., Borkar, V.R., Carey, M.J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., Tsotras, V.J.: Asterix: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases 29(3), 185–216 (2011)
Article Google Scholar
Bernstein, P.A., Goodman, N., Wong, E., Reeve, C.L., Jr., J.B.R.: Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst. 6(4), 602–625 (1981)
Google Scholar
Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.C., Ozcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4, 1272–1283 (2011)
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: SIGMOD conference, Indianapolis, pp. 975–986 (2010)
Google Scholar
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE, Hannover, pp. 1151–1162 (2011)
Google Scholar
Bryant, R.E.: Data-intensive supercomputing: the case for disc. Tech. Rep. CMU-CS-07-128, School of Computer Science, Carnegie Mellon University (2007)
Google Scholar
Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB, Indianapolis (2010)
Google Scholar
Cascading: http://www.cascading.org/
Chaiken, R., Jenkins, B., Larson, P.Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008)
Google Scholar
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, San Francisco, pp. 137–150 (2004)
Google Scholar
DeWitt, D.J., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Google Scholar
Dryad – Microsoft Research: http://research.microsoft.com/projects/Dryad
DryadLINQ – Microsoft Research: http://research.microsoft.com/projects/DryadLINQ
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Google Scholar
Fender, P., Moerkotte, G.: A new, highly efficient, and easy to implement top-down join enumeration algorithm. In: ICDE, Hannover, pp. 864–875 (2011)
Google Scholar
Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for mapreduce. PVLDB 4(7), 419–429 (2011)
Google Scholar
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: the pig experience. PVLDB 2(2), 1414–1425 (2009)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP, Bolton Landing New York, pp. 29–43 (2003)
Google Scholar
Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)
Google Scholar
Graefe, G.: A generalized join algorithm. In: BTW, Kaiserslautern, pp. 267–286 (2011)
Google Scholar
Graefe, G., Ward, K.: Dynamic query evaluation plans. In: Proceedings of the 1989 ACM SIGMOD International conference on Management of Data, SIGMOD ’89, Portland, pp. 358–366. ACM, New York (1989).
Google Scholar
Gupta, A., Sudarshan, S., Viswanathan, S.: Query scheduling in multi query optimization. In: IDEAS, Grenoble, pp. 11–19 (2001)
Google Scholar
Haas, L.M., Freytag, J.C., Lohman, G.M., Pirahesh, H.: Extensible query processing in starburst. In: SIGMOD conference, Portland, pp. 377–388 (1989)
Google Scholar
Herodotou, H.: Hadoop performance models. Tech. rep., Duke Computer Science (2010). http://www.cs.duke.edu/~hero/files/hadoop-models.pdf
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4, 1111–1122 (2011)
Google Scholar
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, Asilomar, pp. 261–272 (2011)
Google Scholar
Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: SIGMOD conference, Providence, pp. 987–994 (2009)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, Lisbon, pp. 59–72 (2007)
Google Scholar
Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for mapreduce programs. PVLDB 4(6), 385–396 (2011)
Google Scholar
Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)
Article Google Scholar
Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: SIGMOD conference, Athens, pp. 961–972 (2011)
Google Scholar
Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H.: Robust query processing through progressive optimization. In: SIGMOD conference, Paris, pp. 659–670 (2004)
Google Scholar
Mehta, M., DeWitt, D.J.: Data placement in shared-nothing parallel database systems. VLDB J. 6(1), 53–72 (1997)
Article Google Scholar
Moerkotte, G., Neumann, T.: Dynamic programming strikes back. In: SIGMOD conference, Vancouver, pp. 539–552 (2008)
Google Scholar
Nippl, C., Mitschang, B.: Topaz: a cost-based, rule-driven, multi-phase parallelizer. In: VLDB, New York City, pp. 251–262 (1998)
Google Scholar
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. PVLDB 3(1), 494–505 (2010)
Google Scholar
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX Annual Technical Conference, Boston, pp. 267–273 (2008)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD conference, Vancouver pp. 1099–1110 (2008)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD conference, Providence, pp. 165–178 (2009)
Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)
Google Scholar
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD conference, Boston, pp. 23–34 (1979)
Google Scholar
Sellis, T.K.: Multiple-query optimization. ACM Trans. Database Syst. 13(1), 23–52 (1988)
Article Google Scholar
Szalay, A., Gray, J.: Science in an exponential world. Nature 440(23), 413–414 (2006)
Article Google Scholar
The Stratosphere Project: http://stratosphere.eu
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive – a warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., 0002, N.Z., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: ICDE, Long Beach, pp. 996–1005 (2010)
Google Scholar
Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: SC-MTAGS, Portland (2009)
Google Scholar
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, San Diego, pp. 1–14 (2008)
Google Scholar
Zhou, J., Larson, P.Å., Chaiken, R.: Incorporating partitioning and parallel plans into the scope optimizer. In: ICDE, Long Beach, pp. 1060–1071 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Technische Universität Berlin, Berlin, Germany
Fabian Hueske & Volker Markl

Authors

Fabian Hueske
View author publications
You can also search for this author in PubMed Google Scholar
Volker Markl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Hueske .

Editor information

Editors and Affiliations

IBM Research - Ireland, Mulhuddart, Ireland
Aris Gkoulalas-Divanis
IBM Research - Zurich, Rüschlikon, Switzerland
Abderrahim Labbi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hueske, F., Markl, V. (2014). Optimization of Massively Parallel Data Flows. In: Gkoulalas-Divanis, A., Labbi, A. (eds) Large-Scale Data Analytics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9242-9_2

Download citation

DOI: https://doi.org/10.1007/978-1-4614-9242-9_2
Published: 28 November 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9241-2
Online ISBN: 978-1-4614-9242-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics