survey

Handling Iterations in Distributed Dataflow Systems

Authors:

Gábor E. Gévay,

Volker MarklAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 54, Issue 9

Article No.: 199, Pages 1 - 38

https://doi.org/10.1145/3477602

Published: 08 October 2021 Publication History

Abstract

Over the past decade, distributed dataflow systems (DDS) have become a standard technology. In these systems, users write programs in restricted dataflow programming models, such as MapReduce, which enable them to scale out program execution to a shared-nothing cluster of machines. Yet, there is no established consensus that prescribes how to extend these programming models to support iterative algorithms. In this survey, we review the research literature and identify how DDS handle control flow, such as iteration, from both the programming model and execution level perspectives. This survey will be of interest for both users and designers of DDS.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 265–283.

Digital Library

[2]

Divy Agrawal, Mouhamadou Lamine Ba, Laure Berti-Équille, Sanjay Chawla, Ahmed K. Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Mohammed J. Zaki. 2016. Rheem: Enabling multi-platform task execution. In Proceedings of the SIGMOD Conference, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). 2069–2072.

[3]

Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling cross-platform data processing - may the big data be with you!PVLDB 11, 11 (2018), 1414–1427.

[4]

Divy Agrawal, Sanjay Chawla, Ahmed K. Elmagarmid, Zoi Kaoudi, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Mohammed J. Zaki. 2016. Road to freedom in big data analytics. In Proceedings of the International Conference on Extending Database Technology. 479–484.

[5]

Rakesh Agrawal. 1988. Alpha: An extension of relational algebra to express a class of recursive queries. IEEE Trans. Softw. Eng. 14, 7 (1988), 879–885.

Digital Library

[6]

Alfred V. Aho and Jeffrey D. Ullman. 1979. Universality of data retrieval languages. In Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POP’79). ACM, 110–119.

[7]

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, and Alexander Belopolsky. 2016. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688.

[8]

Alexander Alexandrov. 2019. Representations and Optimizations for Embedded Parallel Dataflow Languages. Ph.D. Dissertation. Technische Universität Berlin.

[9]

Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939–964.

Digital Library

[10]

Alexander Alexandrov, Georgi Krastev, and Volker Markl. 2019. Representations and optimizations for embedded parallel dataflow languages. ACM Trans. Datab. Syst. 44, 1 (2019), 1–44.

Digital Library

[11]

Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, Felix Schüler, Lauritz Thamsen, Odej Kao, Tobias Herb, and Volker Markl. 2015. Implicit parallelism through deep language embedding. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 47–61.

Digital Library

[12]

Tiago A. O. Alves, Leandro A. J. Marzulo, Felipe M. G. França, and Vítor Santos Costa. 2011. Trebuchet: Exploring TLP with dataflow virtualisation. Int. J. High Perf. Syst. Archit. 3, 2–3 (2011), 137–148.

[13]

Gabriel Aranda, Susana Nieva, Fernando Sáenz-Pérez, and Jaime Sánchez-Hernández. 2013. Formalizing a broader recursion coverage in SQL. In Proceedings of the International Symposium on Practical Aspects of Declarative Languages. Springer, 93–108.

Digital Library

[14]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1383–1394.

Digital Library

[15]

Arvind, Rishiyur S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. 39, 3 (1990), 300–318.

Digital Library

[16]

Francois Bancilhon and Raghu Ramakrishnan. 1989. An amateur’s introduction to recursive query processing strategies. In Readings in Artificial Intelligence and Databases. Elsevier, 376–430.

[17]

Omar Batarfi, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, S.-M.-R. Beheshti, Ahmed Barnawi, and Sherif Sakr. 2015. Large scale graph processing systems: Survey and an experimental evaluation. Clust. Comput. 18, 3 (2015).

[18]

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2017. Julia: A fresh approach to numerical computing. SIAM Rev. 59, 1 (2017), 65–98.

Digital Library

[19]

Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthör, Kevin Innerebner, Florijan Klezin, Stefanie Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqi, and Sebastian Benjamin Wrede. 2020. SystemDS: A declarative machine learning system for the end-to-end data science lifecycle. In Proceedings of the 10th Conference on Innovative Data Systems Research (CIDR’20).

[20]

Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative machine learning on Spark. Proc. VLDB Endow. 9, 13 (2016), 1425–1436.

Digital Library

[21]

Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the IEEE 27th International Conference on Data Engineering.

Digital Library

[22]

Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Declarative systems for large-scale machine learning. IEEE Data Eng. Bull. 35, 2 (2012).

[23]

Yingyi Bu, Vinayak Borkar, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Scaling datalog for machine learning on big data. arXiv preprint arXiv:1203.0160 (2012).

[24]

Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1–2 (2010), 285–296.

Digital Library

[25]

Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2012. The HaLoop approach to large-scale iterative data analysis. VLDB J. 21, 2 (2012), 169–190.

Digital Library

[26]

Eugene Burmako. 2013. Scala macros: Let our powers combine!: on how rich syntax and static types work with metaprogramming. In Proceedings of the 4th Workshop on Scala. ACM.

Digital Library

[27]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Technic. Commit. Data Eng. 36, 4 (2015).

[28]

Stefano Ceri, Georg Gottlob, and Letizia Tanca. 1989. What you always wanted to know about Datalog (and never dared to ask). IEEE Trans. Knowl. Data Eng. 1, 1 (1989), 146–166.

Digital Library

[29]

Hassan Chafi, Zach DeVito, Adriaan Moors, Tiark Rompf, Arvind K. Sujeeth, Pat Hanrahan, Martin Odersky, and Kunle Olukotun. 2010. Language virtualization for heterogeneous parallel computing. ACM SIGPLAN Not. 45, 10 (2010), 835–847.

Digital Library

[30]

Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In ACM SIGPLAN Notices, Vol. 45. ACM, 363–375.

Digital Library

[31]

Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. In Proceedings of the International Conference on Learning Representations Workshop Track.

[32]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Proceedings of LearningSys (2015). http://learningsys.org/papers/LearningSys_2015_paper_1.pdf.

[33]

Sarah Chlyah, Nils Gesbert, Pierre Genevès, and Nabil Layaïda. 2020. On the optimization of iterative programming with distributed data collections. (2020).

[34]

Zaheer Chothia, John Liagouris, Frank McSherry, and Timothy Roscoe. 2016. Explaining outputs in modern data analytics. Proc. VLDB Endow. 9, 12 (2016). https://hal.inria.fr/hal-02066649v5.

Digital Library

[35]

Ankur Dave. 2014. IndexedRDD. Retrieved from: https://github.com/amplab/spark-indexedrdd.

[36]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. OSDI (2004), 137–150.

Digital Library

[37]

James M. Decker, Dan Moldovan, Guannan Wei, Vritant Bhardwaj, Gregory Essertel, Fei Wang, Alexander B. Wiltschko, and Tiark Rompf. 2018. The 800 Pound Python in the Machine Learning Room. Retrieved from: https://www.cs.purdue.edu/homes/rompf/papers/decker-preprint201811.pdf.

[38]

Christos Doulkeridis and Kjetil NØrvåg. 2014. A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 3 (2014), 355–380.

Digital Library

[39]

Sergey Dudoladov, Chen Xu, Sebastian Schelter, Asterios Katsifodimos, Stephan Ewen, Kostas Tzoumas, and Volker Markl. 2015. Optimistic recovery for iterative dataflows in action. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1439–1443.

Digital Library

[40]

Christian Duta, Denis Hirn, and Torsten Grust. 2020. Compiling PL/SQL Away. In Proceedings of the 10th Conference on Innovative Data Systems Research (CIDR’20).

[41]

Andrew Eisenberg. 1996. New standard for stored procedures in SQL. ACM SIGMOD Rec. 25, 4 (1996), 81–88.

Digital Library

[42]

Andrew Eisenberg and Jim Melton. 1999. SQL: 1999, formerly known as SQL3. ACM SIGMOD Rec. 28, 1 (1999).

[43]

Jaliya Ekanayake. 2010. Architecture and performance of runtime environments for data intensive scalable computing. School Inform. Comput. Bloomington, Indiana University.

[44]

Jaliya Ekanayake, Thilina Gunarathne, Geoffrey Fox, Atilla Soner Balkir, Christophe Poulain, Nelson Araujo, and Roger Barga. 2009. DryadLINQ for scientific analyses. In Proceedings of the 5th IEEE International Conference on e-Science. IEEE.

Digital Library

[45]

Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 810–818.

Digital Library

[46]

Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox. 2008. MapReduce for data intensive scientific analyses. In Proceedings of the IEEE 4th International Conference on eScience. IEEE, 277–284.

Digital Library

[47]

Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed linear algebra for large-scale machine learning. Proc. VLDB Endow. 9, 12 (2016), 960–971.

Digital Library

[48]

Eslam Elnikety, Tamer Elsayed, and Hany E. Ramadan. 2011. iHadoop: Asynchronous iterations for MapReduce. In Proceedings of the IEEE 3rd International Conference on Cloud Computing Technology and Science. IEEE, 81–90.

[49]

Stephan Ewen, Sebastian Schelter, Kostas Tzoumas, Daniel Warneke, and Volker Markl. 2013. Iterative parallel data processing with Stratosphere: An inside look. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1053–1056.

Digital Library

[50]

Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (2012), 1268–1279.

Digital Library

[51]

Leonidas Fegaras. 2017. An algebra for distributed big data analytics. J. Funct. Prog. 27 (2017).

[52]

Leonidas Fegaras and Md Hasanuzzaman Noor. 2018. Compile-time code generation for embedded data-intensive query languages. In Proceedings of the IEEE International Congress on Big Data (BigData Congress’18). IEEE, 1–8.

[53]

Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. 2014. Making state explicit for imperative big data processing. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 49–60.

[54]

Steven Feuerstein and Bill Pribyl. 2005. Oracle PL/SQL Programming. O’Reilly Media, Inc.

[55]

Martin Fowler. 2010. Domain-specific Languages. Pearson Education.

[56]

Gábor E. Gévay, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. The power of nested parallelism in big data processing—hitting three flies with one slap. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 605–618.

Digital Library

[57]

Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Loránd Madai-Tahy, and Volker Markl. 2018. Labyrinth: Compiling imperative control flow to parallel dataflows. arXiv preprint arXiv:1809.06845 (2018).

[58]

Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Loránd Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. Efficient control flow in dataflow systems: When ease-of-use meets high performance. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE’21).

[59]

Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. In Proceedings of the IEEE 27th International Conference on Data Engineering. IEEE, 231–242.

Digital Library

[60]

Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and Steven Hand. 2015. Musketeer: All for one, one for all in data processing systems. In Proceedings of the 10th European Conference on Computer Systems. 1–16.

Digital Library

[61]

Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 17–30.

Digital Library

[62]

Jiaqi Gu, Yugo H. Watanabe, William A. Mazza, Alexander Shkapsky, Mohan Yang, Ling Ding, and Carlo Zaniolo. 2019. RaSQL: Greater power and performance for big data analytics with recursive-aggregate-SQL on Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 467–484.

Digital Library

[63]

Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, and Andrew Whitaker. 2014. Demonstration of the Myria big data management service. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM.

Digital Library

[64]

Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An experimental comparison of Pregel-like graph processing systems. Proc. VLDB Endow. 7, 12 (2014).

Digital Library

[65]

Safiollah Heidari, Yogesh Simmhan, Rodrigo N. Calheiros, and Rajkumar Buyya. 2018. Scalable graph processing frameworks: A taxonomy and open challenges. ACM Comput. Surv. 51, 3 (2018), 60.

Digital Library

[66]

Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib analytics library: Or MAD skills, the SQL. Proc. VLDB Endow. 5, 12 (2012), 1700–1711.

Digital Library

[67]

Denis Hirn and Torsten Grust. 2020. PL/SQL without the PL. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2677–2680.

Digital Library

[68]

Denis Hirn and Torsten Grust. 2021. One with recursive is worth many GOTOs. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 723–735.

Digital Library

[69]

Muhammad Imran, Gábor E. Gévay, and Volker Markl. 2020. Distributed graph analytics with datalog queries in Flink. In Proceedings of the International Workshop on Large Scale Graph Data Analytics.

[70]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 59–72.

Digital Library

[71]

Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2020. Declarative recursive computation on an RDBMS: Or, Why you should use a database for distributed machine learning. ACM SIGMOD Rec. 49, 1 (2020), 43–50.

Digital Library

[72]

Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dong-Jin Shin, and Byung-Gon Chun. 2019. JANUS: Fast and flexible deep learning via symbolic graph execution of imperative programs. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). USENIX Association, 453–468.

[73]

Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dong-Jin Shin, Taebum Kim, and Byung-Gon Chun. 2019. Speculative symbolic graph execution of imperative deep learning programs. ACM SIGOPS Oper. Syst. Rev. 53, 1 (2019), 26–33.

Digital Library

[74]

Neil D. Jones. 1996. An introduction to partial evaluation. ACM Comput. Surv. 28, 3 (1996), 480–503.

Digital Library

[75]

Martin Junghanns, André Petermann, Martin Neumann, and Erhard Rahm. 2017. Management and analysis of big graph data: Current systems and open challenges. In Handbook of Big Data Technologies. Springer, 457–505.

[76]

Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Saravanan Thirumuruganathan, Sanjay Chawla, and Divy Agrawal. 2017. A cost-based optimizer for gradient descent optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 977–992.

Digital Library

[77]

Qifa Ke, Michael Isard, and Yuan Yu. 2013. Optimus: A dynamic rewriting framework for data-parallel execution plans. In Proceedings of the 8th ACM European Conference on Computer Systems. 15–28.

Digital Library

[78]

Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5 (1999).

Digital Library

[79]

Sebastian Kruse, Zoi Kaoudi, Bertty Contreras-Rojas, Sanjay Chawla, Felix Naumann, and Jorge-Arnulfo Quiané-Ruiz. 2020. RHEEMix in the data jungle: A cost-based optimizer for cross-platform systems. VLDB J. 29 (2020), 1287–1310.

Digital Library

[80]

Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978).

[81]

Haejoon Lee, Minseo Kang, Sun-Bum Youn, Jae-Gil Lee, and YongChul Kwon. 2016. An experimental comparison of iterative MapReduce frameworks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2089–2094.

Digital Library

[82]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, 583–598.

Digital Library

[83]

Zhenguo Li, Yixiang Fang, Qin Liu, Jiefeng Cheng, Reynold Cheng, and John Lui. 2015. Walking in the cloud: Parallel SimRank at scale. Proc. VLDB Endow. 9, 1 (2015), 24–35.

Digital Library

[84]

Leonid Libkin. 2003. Expressive power of SQL. Theor. Comput. Sci. 296, 3 (2003), 379–404.

Digital Library

[85]

David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don’t get caught in the cold, warm-up your JVM: Understand and eliminate JVM warm-up overhead in data-parallel systems. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 383–400.

[86]

Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (2012), 716–727.

Digital Library

[87]

James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 281–297.

[88]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 135–146.

Digital Library

[89]

Robert Malouf. 2002. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th Conference on Natural Language Learning. Association for Computational Linguistics, 1–7.

Digital Library

[90]

Omid Mashayekhi, Hang Qu, Chinmayee Shah, and Philip Levis. 2017. Execution templates: Caching control plane decisions for strong scaling of data analytics. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’17). 513–526.

[91]

Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. 48, 2 (2015), 25.

Digital Library

[92]

Frank McSherry, Rebecca Isaacs, Michael Isard, and Derek G. Murray. 2012. Composable incremental and iterative data-parallel computation with Naiad. Microsoft Research. Technical Report. MSR-TR-2012-105. https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/naiad.pdf.

[93]

Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability! But at what COST? In Proceedings of the Workshop on Hot Topics in Operating Systems. USENIX.

[94]

Frank McSherry, Derek Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential dataflow. In Proceedings of the Conference on Innovative Data Systems Research. Retrieved from https://www.microsoft.com/en-us/research/publication/differential-dataflow/.

[95]

Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling object, relations and XML in the .NET framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 706–706.

Digital Library

[96]

Svilen R. Mihaylov, Zachary G. Ives, and Sudipto Guha. 2012. REX: Recursive, delta-based data-centric computation. Proc. VLDB Endow. 5, 11 (2012), 1280–1291.

Digital Library

[97]

Dan Moldovan, James Decker, Fei Wang, Andrew Johnson, Brian Lee, Zachary Nado, D. Sculley, Tiark Rompf, and Alexander B. Wiltschko. 2019. AutoGraph: Imperative-style coding with graph-based performance. Proc. Mach. Learn. Syst. 1 (2019), 389–405.

[98]

Adriaan Moors, Tiark Rompf, Philipp Haller, and Martin Odersky. 2012. Scala-Virtualized. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation. 117–120.

[99]

Derek Gordon Murray and Steven Hand. 2010. Scripting the cloud with skywriting.HotCloud 10 (2010), 12–12.

[100]

Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM.

Digital Library

[101]

Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: A universal execution engine for distributed data-flow computing. In Proceedings of the 8th ACM/USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 113–126.

[102]

Shayan Najd, Sam Lindley, Josef Svenningsson, and Philip Wadler. 2016. Everything old is new again: Quoted domain-specific languages. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation. ACM, 25–36.

Digital Library

[103]

Balazs Nemeth, Tom Haber, Jori Liesenborgs, and Wim Lamotte. 2020. Automatic parallelization of probabilistic models with varying load imbalance. In Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID’20). IEEE, 752–759.

[104]

Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 69–84.

Digital Library

[105]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web.Technical Report. Stanford InfoLab.

[106]

Shrideep Pallickara, Hasan Bulut, and Geoffrey Fox. 2007. Fault-tolerant reliable delivery of messages in distributed publish/subscribe systems. In Proceedings of the 4th International Conference on Autonomic Computing (ICAC’07). IEEE, 19–19.

Digital Library

[107]

Linnea Passing, Manuel Then, Nina Hubig, Harald Lang, Michael Schreier, Stephan Günnemann, Alfons Kemper, and Thomas Neumann. 2017. SQL-and Operator-centric Data Analytics in Relational Main-Memory Databases. In Proceedings of the International Conference on Extending Database Technology.

[108]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017). In Proceedings of the Autodiff Workshop (NIPS’17). https://openreview.net/pdf?id=BJJsrmfCZ.

[109]

Oleksandr Pochayevets. 2006. BMDFM: A Hybrid Dataflow Runtime Parallelization Environment for Shared Memory Multiprocessors. Ph.D. Dissertation. Technische Universität München.

[110]

Piotr Przymus, Aleksandra Boniewicz, Marta Burzańska, and Krzysztof Stencel. 2010. Recursive query facilities in relational databases: A survey. In Database Theory and Application, Bio-Science and Bio-Technology. Springer, 89–99.

[111]

Fabrice Rastello. 2016. SSA-based Compiler Design. Springer Publishing Company, Incorporated.

[112]

Till Rohrmann, Sebastian Schelter, Tilmann Rabl, and Volker Markl. 2017. Gilbert: Declarative sparse linear algebra on massively parallel dataflow systems. In Proceedings of the Datenbanksysteme für Business, Technologie und Web (BTW’17). Gesellschaft für Informatik, Bonn, 269–288.

[113]

Tiark Rompf, Nada Amin, Adriaan Moors, Philipp Haller, and Martin Odersky. 2012. Scala-Virtualized: Linguistic reuse for deep embeddings. High.-ord. Symbol. Comput. 25, 1 (2012), 165–207.

Digital Library

[114]

Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering. 127–136.

Digital Library

[115]

Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (2012), 121–130.

Digital Library

[116]

Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 49–68.

Digital Library

[117]

Leonid Ryzhyk and Mihai Budiu. 2019. Differential datalog.Datalog 2 (2019), 4–5.

[118]

Afaf G. Bin Saadon and Hoda M. O. Mokhtar. 2019. Survey on iterative and incremental approaches in distributed computing environment. Int. J. Data Sci. 4, 1 (2019), 18–30.

[119]

Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. 2013. The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. 46, 1 (2013), 11.

Digital Library

[120]

Semih Salihoglu and Jennifer Widom. 2014. Optimizing graph algorithms on Pregel-like systems. Proc. VLDB Endow. 7, 7 (2014), 577–588.

Digital Library

[121]

Sebastian Schelter, Stephan Ewen, Kostas Tzoumas, and Volker Markl. 2013. All roads lead to Rome: Optimistic recovery for distributed iterative data processing. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 1919–1928.

Digital Library

[122]

Jiwon Seo, Stephen Guo, and Monica S. Lam. 2013. SociaLite: Datalog extensions for efficient social network analysis. In Proceedings of the IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 278–289.

[123]

Prateek Sharma, Tian Guo, Xin He, David Irwin, and Prashant Shenoy. 2016. Flint: Batch-interactive data-intensive processing on transient servers. In Proceedings of the 11th European Conference on Computer Systems. ACM, 6.

Digital Library

[124]

Marianne Shaw, Paraschos Koutris, Bill Howe, and Dan Suciu. 2012. Optimizing large-scale Semi-Naïve datalog evaluation in Hadoop. In Proceedings of the International Datalog 2.0 Workshop. Springer, 165–176.

Digital Library

[125]

Avraham Shinnar, David Cunningham, Vijay Saraswat, and Benjamin Herta. 2012. M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB Endow. 5, 12 (2012), 1736–1747.

Digital Library

[126]

Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with Datalog queries on Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[127]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1–10.

Digital Library

[128]

Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proc. VLDB Endow. 3, 1–2 (2010), 703–710.

Digital Library

[129]

Marc Snir, William Gropp, Steve Otto, Steven Huss-Lederman, Jack Dongarra, and David Walker. 1998. MPI–the Complete Reference: The MPI Core. Vol. 1. The MIT Press.

Digital Library

[130]

Emad Soroush, Magdalena Balazinska, Simon Krughoff, and Andrew Connolly. 2015. Efficient iterative processing in the SciDB parallel array engine. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management. ACM, 39.

Digital Library

[131]

Quoc-Cuong To, Juan Soto, and Volker Markl. 2018. A survey of state management in big data processing systems. VLDB J. 27, 6 (2018), 847–872.

Digital Library

[132]

Laurence Tratt. 2008. Domain specific language implementation via compile-time meta-programming. ACM Trans. Prog. Lang. Syst. 30, 6 (2008), 1–40.

Digital Library

[133]

Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (1990), 103–111.

Digital Library

[134]

Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, and Shengliang Xu. 2017. The Myria big data management and analytics system and cloud services. In Proceedings of the Conference on Innovative Data Systems Research.

[135]

Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. 2015. Asynchronous and fault-tolerant recursive Datalog evaluation in shared-nothing engines. Proc. VLDB Endow. 8, 12 (2015), 1542–1553.

Digital Library

[136]

Qiange Wang, Yanfeng Zhang, Hao Wang, Liang Geng, Rubao Lee, Xiaodong Zhang, and Ge Yu. 2020. Automating incremental and asynchronous evaluation for recursive aggregate data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2439–2454.

Digital Library

[137]

Haijiang Wu, Jie Liu, Tao Wang, Dan Ye, Jun Wei, and Hua Zhong. 2016. Parallel materialization of Datalog programs with Spark for scalable reasoning. In Proceedings of the International Conference on Web Information Systems Engineering. Springer.

[138]

Chenning Xie, Rong Chen, Haibing Guan, Binyu Zang, and Haibo Chen. 2015. Sync or async: Time to fuse for distributed graph-parallel computation. In ACM SIGPLAN Notices, Vol. 50. ACM, 194–204.

Digital Library

[139]

Chen Xu, Markus Holzemer, Manohar Kaul, Juan Soto, and Volker Markl. 2017. On fault tolerance for distributed iterative dataflow processing. IEEE Trans. Knowl. Data Eng. 29, 8 (2017), 1709–1722.

Digital Library

[140]

Da Yan, Yingyi Bu, Yuanyuan Tian, and Amol Deshpande. 2017. Big graph analytics platforms. Found. Trends® Datab. 7, 1–2 (2017), 1–195.

Digital Library

[141]

Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, and Xiaoqiang Zheng. 2018. Dynamic control flow in large-scale machine learning. In Proceedings of the 13th EuroSys Conference. ACM, 18.

Digital Library

[142]

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, Vol. 8. 1–14.

Digital Library

[143]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.

Digital Library

[144]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets.HotCloud 10 (2010).

[145]

Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.

Digital Library

[146]

Qizhen Zhang, Akash Acharya, Hongzhi Chen, Simran Arora, Ang Chen, Vincent Liu, and Boon Thau Loo. 2019. Optimizing declarative graph queries at large scale. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1411–1428.

Digital Library

[147]

Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. 2011. PrIter: A distributed framework for prioritized iterative computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing. 1–14.

Digital Library

[148]

Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. 2012. iMapReduce: A distributed computing framework for iterative computation. J. Grid Comput. 10, 1 (2012), 47–68.

Digital Library

[149]

Kangfei Zhao and Jeffrey Xu Yu. 2017. All-in-one: Graph processing in RDBMSs revisited. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1165–1180.

Digital Library

Cited By

Chlyah SGesbert NGenevès PLayaïda N(2025)Efficient Iterative Programs with Distributed Data CollectionsJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2025.101047(101047)Online publication date: Feb-2025
https://doi.org/10.1016/j.jlamp.2025.101047
Gulisano VPapatriantafilou MMargara ASchiavoni VEdinger JCao JJin Z(2024)On the Semantic Overlap of Operators in Stream Processing EnginesProceedings of the 25th International Middleware Conference10.1145/3652892.3654790(8-21)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3654790
Shaikhha ASuciu DSchleich MNgo H(2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639271
Show More Cited By

Recommendations

Superlinearly convergent approximate Newton methods for LC1 optimization problems

In the literature, the proof of superlinear convergence of approximate Newton or SQP methods for solving nonlinear programming problems requires twice smoothness of the objective and constraint functions. Sometimes, the second-order derivatives of those ...
EnviroSuite: An environmentally immersive programming framework for sensor networks

Sensor networks open a new frontier for embedded-distributed computing. Paradigms for sensor network programming-in-the-large have been identified as a significant challenge toward developing large-scale applications. Classical programming languages are ...
A source-level transformation framework for RPC-based distributed programs
HPDC '96: Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing

The remote procedure call (RPC) paradigm has been a favorite of programmers who write distributed programs because RPC uses a familiar procedure call abstraction as the sole mechanism of remote operation. The abstraction helps to simplify programming ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 54, Issue 9

December 2022

800 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3485140

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2021

Accepted: 01 July 2021

Revised: 01 June 2021

Received: 01 November 2020

Published in CSUR Volume 54, Issue 9

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Refereed

Funding Sources

German Federal Ministry of Education and Research as BIFOLD – Berlin Institute for the Foundations of Learning and Data

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
540
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)11

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chlyah SGesbert NGenevès PLayaïda N(2025)Efficient Iterative Programs with Distributed Data CollectionsJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2025.101047(101047)Online publication date: Feb-2025
https://doi.org/10.1016/j.jlamp.2025.101047
Gulisano VPapatriantafilou MMargara ASchiavoni VEdinger JCao JJin Z(2024)On the Semantic Overlap of Operators in Stream Processing EnginesProceedings of the 25th International Middleware Conference10.1145/3652892.3654790(8-21)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3654790
Shaikhha ASuciu DSchleich MNgo H(2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639271
Shahmirzadi DKhaledian NRahmani A(2024)Analyzing the impact of various parameters on job scheduling in the Google cluster datasetCluster Computing10.1007/s10586-024-04377-827:6(7673-7687)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s10586-024-04377-8
Margara ACugola GFelicioni NCilloni S(2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604801
Debnath MRaj ROthman BJohar SNamdev AUike D(2022)The Emerging Role of the knowledge Driven Applications of Wireless Networks for Next Generation online Stream Processing2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)10.1109/ICACITE53722.2022.9823654(172-176)Online publication date: 28-Apr-2022
https://doi.org/10.1109/ICACITE53722.2022.9823654

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents