skip to main content
survey

Handling Iterations in Distributed Dataflow Systems

Published: 08 October 2021 Publication History

Abstract

Over the past decade, distributed dataflow systems (DDS) have become a standard technology. In these systems, users write programs in restricted dataflow programming models, such as MapReduce, which enable them to scale out program execution to a shared-nothing cluster of machines. Yet, there is no established consensus that prescribes how to extend these programming models to support iterative algorithms. In this survey, we review the research literature and identify how DDS handle control flow, such as iteration, from both the programming model and execution level perspectives. This survey will be of interest for both users and designers of DDS.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 265–283.
[2]
Divy Agrawal, Mouhamadou Lamine Ba, Laure Berti-Équille, Sanjay Chawla, Ahmed K. Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Mohammed J. Zaki. 2016. Rheem: Enabling multi-platform task execution. In Proceedings of the SIGMOD Conference, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). 2069–2072.
[3]
Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling cross-platform data processing - may the big data be with you!PVLDB 11, 11 (2018), 1414–1427.
[4]
Divy Agrawal, Sanjay Chawla, Ahmed K. Elmagarmid, Zoi Kaoudi, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Mohammed J. Zaki. 2016. Road to freedom in big data analytics. In Proceedings of the International Conference on Extending Database Technology. 479–484.
[5]
Rakesh Agrawal. 1988. Alpha: An extension of relational algebra to express a class of recursive queries. IEEE Trans. Softw. Eng. 14, 7 (1988), 879–885.
[6]
Alfred V. Aho and Jeffrey D. Ullman. 1979. Universality of data retrieval languages. In Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POP’79). ACM, 110–119.
[7]
Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, and Alexander Belopolsky. 2016. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688.
[8]
Alexander Alexandrov. 2019. Representations and Optimizations for Embedded Parallel Dataflow Languages. Ph.D. Dissertation. Technische Universität Berlin.
[9]
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939–964.
[10]
Alexander Alexandrov, Georgi Krastev, and Volker Markl. 2019. Representations and optimizations for embedded parallel dataflow languages. ACM Trans. Datab. Syst. 44, 1 (2019), 1–44.
[11]
Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, Felix Schüler, Lauritz Thamsen, Odej Kao, Tobias Herb, and Volker Markl. 2015. Implicit parallelism through deep language embedding. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 47–61.
[12]
Tiago A. O. Alves, Leandro A. J. Marzulo, Felipe M. G. França, and Vítor Santos Costa. 2011. Trebuchet: Exploring TLP with dataflow virtualisation. Int. J. High Perf. Syst. Archit. 3, 2–3 (2011), 137–148.
[13]
Gabriel Aranda, Susana Nieva, Fernando Sáenz-Pérez, and Jaime Sánchez-Hernández. 2013. Formalizing a broader recursion coverage in SQL. In Proceedings of the International Symposium on Practical Aspects of Declarative Languages. Springer, 93–108.
[14]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1383–1394.
[15]
Arvind, Rishiyur S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. 39, 3 (1990), 300–318.
[16]
Francois Bancilhon and Raghu Ramakrishnan. 1989. An amateur’s introduction to recursive query processing strategies. In Readings in Artificial Intelligence and Databases. Elsevier, 376–430.
[17]
Omar Batarfi, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, S.-M.-R. Beheshti, Ahmed Barnawi, and Sherif Sakr. 2015. Large scale graph processing systems: Survey and an experimental evaluation. Clust. Comput. 18, 3 (2015).
[18]
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2017. Julia: A fresh approach to numerical computing. SIAM Rev. 59, 1 (2017), 65–98.
[19]
Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthör, Kevin Innerebner, Florijan Klezin, Stefanie Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqi, and Sebastian Benjamin Wrede. 2020. SystemDS: A declarative machine learning system for the end-to-end data science lifecycle. In Proceedings of the 10th Conference on Innovative Data Systems Research (CIDR’20).
[20]
Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative machine learning on Spark. Proc. VLDB Endow. 9, 13 (2016), 1425–1436.
[21]
Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the IEEE 27th International Conference on Data Engineering.
[22]
Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Declarative systems for large-scale machine learning. IEEE Data Eng. Bull. 35, 2 (2012).
[23]
Yingyi Bu, Vinayak Borkar, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Scaling datalog for machine learning on big data. arXiv preprint arXiv:1203.0160 (2012).
[24]
Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1–2 (2010), 285–296.
[25]
Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2012. The HaLoop approach to large-scale iterative data analysis. VLDB J. 21, 2 (2012), 169–190.
[26]
Eugene Burmako. 2013. Scala macros: Let our powers combine!: on how rich syntax and static types work with metaprogramming. In Proceedings of the 4th Workshop on Scala. ACM.
[27]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Technic. Commit. Data Eng. 36, 4 (2015).
[28]
Stefano Ceri, Georg Gottlob, and Letizia Tanca. 1989. What you always wanted to know about Datalog (and never dared to ask). IEEE Trans. Knowl. Data Eng. 1, 1 (1989), 146–166.
[29]
Hassan Chafi, Zach DeVito, Adriaan Moors, Tiark Rompf, Arvind K. Sujeeth, Pat Hanrahan, Martin Odersky, and Kunle Olukotun. 2010. Language virtualization for heterogeneous parallel computing. ACM SIGPLAN Not. 45, 10 (2010), 835–847.
[30]
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In ACM SIGPLAN Notices, Vol. 45. ACM, 363–375.
[31]
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. In Proceedings of the International Conference on Learning Representations Workshop Track.
[32]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Proceedings of LearningSys (2015). http://learningsys.org/papers/LearningSys_2015_paper_1.pdf.
[33]
Sarah Chlyah, Nils Gesbert, Pierre Genevès, and Nabil Layaïda. 2020. On the optimization of iterative programming with distributed data collections. (2020).
[34]
Zaheer Chothia, John Liagouris, Frank McSherry, and Timothy Roscoe. 2016. Explaining outputs in modern data analytics. Proc. VLDB Endow. 9, 12 (2016). https://hal.inria.fr/hal-02066649v5.
[35]
Ankur Dave. 2014. IndexedRDD. Retrieved from: https://github.com/amplab/spark-indexedrdd.
[36]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. OSDI (2004), 137–150.
[37]
James M. Decker, Dan Moldovan, Guannan Wei, Vritant Bhardwaj, Gregory Essertel, Fei Wang, Alexander B. Wiltschko, and Tiark Rompf. 2018. The 800 Pound Python in the Machine Learning Room. Retrieved from: https://www.cs.purdue.edu/homes/rompf/papers/decker-preprint201811.pdf.
[38]
Christos Doulkeridis and Kjetil NØrvåg. 2014. A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 3 (2014), 355–380.
[39]
Sergey Dudoladov, Chen Xu, Sebastian Schelter, Asterios Katsifodimos, Stephan Ewen, Kostas Tzoumas, and Volker Markl. 2015. Optimistic recovery for iterative dataflows in action. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1439–1443.
[40]
Christian Duta, Denis Hirn, and Torsten Grust. 2020. Compiling PL/SQL Away. In Proceedings of the 10th Conference on Innovative Data Systems Research (CIDR’20).
[41]
Andrew Eisenberg. 1996. New standard for stored procedures in SQL. ACM SIGMOD Rec. 25, 4 (1996), 81–88.
[42]
Andrew Eisenberg and Jim Melton. 1999. SQL: 1999, formerly known as SQL3. ACM SIGMOD Rec. 28, 1 (1999).
[43]
Jaliya Ekanayake. 2010. Architecture and performance of runtime environments for data intensive scalable computing. School Inform. Comput. Bloomington, Indiana University.
[44]
Jaliya Ekanayake, Thilina Gunarathne, Geoffrey Fox, Atilla Soner Balkir, Christophe Poulain, Nelson Araujo, and Roger Barga. 2009. DryadLINQ for scientific analyses. In Proceedings of the 5th IEEE International Conference on e-Science. IEEE.
[45]
Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 810–818.
[46]
Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox. 2008. MapReduce for data intensive scientific analyses. In Proceedings of the IEEE 4th International Conference on eScience. IEEE, 277–284.
[47]
Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed linear algebra for large-scale machine learning. Proc. VLDB Endow. 9, 12 (2016), 960–971.
[48]
Eslam Elnikety, Tamer Elsayed, and Hany E. Ramadan. 2011. iHadoop: Asynchronous iterations for MapReduce. In Proceedings of the IEEE 3rd International Conference on Cloud Computing Technology and Science. IEEE, 81–90.
[49]
Stephan Ewen, Sebastian Schelter, Kostas Tzoumas, Daniel Warneke, and Volker Markl. 2013. Iterative parallel data processing with Stratosphere: An inside look. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1053–1056.
[50]
Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (2012), 1268–1279.
[51]
Leonidas Fegaras. 2017. An algebra for distributed big data analytics. J. Funct. Prog. 27 (2017).
[52]
Leonidas Fegaras and Md Hasanuzzaman Noor. 2018. Compile-time code generation for embedded data-intensive query languages. In Proceedings of the IEEE International Congress on Big Data (BigData Congress’18). IEEE, 1–8.
[53]
Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. 2014. Making state explicit for imperative big data processing. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 49–60.
[54]
Steven Feuerstein and Bill Pribyl. 2005. Oracle PL/SQL Programming. O’Reilly Media, Inc.
[55]
Martin Fowler. 2010. Domain-specific Languages. Pearson Education.
[56]
Gábor E. Gévay, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. The power of nested parallelism in big data processing—hitting three flies with one slap. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 605–618.
[57]
Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Loránd Madai-Tahy, and Volker Markl. 2018. Labyrinth: Compiling imperative control flow to parallel dataflows. arXiv preprint arXiv:1809.06845 (2018).
[58]
Gábor E. Gévay, Tilmann Rabl, Sebastian Breß, Loránd Madai-Tahy, Jorge-Arnulfo Quiané-Ruiz, and Volker Markl. 2021. Efficient control flow in dataflow systems: When ease-of-use meets high performance. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE’21).
[59]
Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. In Proceedings of the IEEE 27th International Conference on Data Engineering. IEEE, 231–242.
[60]
Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and Steven Hand. 2015. Musketeer: All for one, one for all in data processing systems. In Proceedings of the 10th European Conference on Computer Systems. 1–16.
[61]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 17–30.
[62]
Jiaqi Gu, Yugo H. Watanabe, William A. Mazza, Alexander Shkapsky, Mohan Yang, Ling Ding, and Carlo Zaniolo. 2019. RaSQL: Greater power and performance for big data analytics with recursive-aggregate-SQL on Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 467–484.
[63]
Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, and Andrew Whitaker. 2014. Demonstration of the Myria big data management service. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM.
[64]
Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. 2014. An experimental comparison of Pregel-like graph processing systems. Proc. VLDB Endow. 7, 12 (2014).
[65]
Safiollah Heidari, Yogesh Simmhan, Rodrigo N. Calheiros, and Rajkumar Buyya. 2018. Scalable graph processing frameworks: A taxonomy and open challenges. ACM Comput. Surv. 51, 3 (2018), 60.
[66]
Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib analytics library: Or MAD skills, the SQL. Proc. VLDB Endow. 5, 12 (2012), 1700–1711.
[67]
Denis Hirn and Torsten Grust. 2020. PL/SQL without the PL. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2677–2680.
[68]
Denis Hirn and Torsten Grust. 2021. One with recursive is worth many GOTOs. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 723–735.
[69]
Muhammad Imran, Gábor E. Gévay, and Volker Markl. 2020. Distributed graph analytics with datalog queries in Flink. In Proceedings of the International Workshop on Large Scale Graph Data Analytics.
[70]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 59–72.
[71]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2020. Declarative recursive computation on an RDBMS: Or, Why you should use a database for distributed machine learning. ACM SIGMOD Rec. 49, 1 (2020), 43–50.
[72]
Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dong-Jin Shin, and Byung-Gon Chun. 2019. JANUS: Fast and flexible deep learning via symbolic graph execution of imperative programs. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). USENIX Association, 453–468.
[73]
Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dong-Jin Shin, Taebum Kim, and Byung-Gon Chun. 2019. Speculative symbolic graph execution of imperative deep learning programs. ACM SIGOPS Oper. Syst. Rev. 53, 1 (2019), 26–33.
[74]
Neil D. Jones. 1996. An introduction to partial evaluation. ACM Comput. Surv. 28, 3 (1996), 480–503.
[75]
Martin Junghanns, André Petermann, Martin Neumann, and Erhard Rahm. 2017. Management and analysis of big graph data: Current systems and open challenges. In Handbook of Big Data Technologies. Springer, 457–505.
[76]
Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Saravanan Thirumuruganathan, Sanjay Chawla, and Divy Agrawal. 2017. A cost-based optimizer for gradient descent optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 977–992.
[77]
Qifa Ke, Michael Isard, and Yuan Yu. 2013. Optimus: A dynamic rewriting framework for data-parallel execution plans. In Proceedings of the 8th ACM European Conference on Computer Systems. 15–28.
[78]
Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5 (1999).
[79]
Sebastian Kruse, Zoi Kaoudi, Bertty Contreras-Rojas, Sanjay Chawla, Felix Naumann, and Jorge-Arnulfo Quiané-Ruiz. 2020. RHEEMix in the data jungle: A cost-based optimizer for cross-platform systems. VLDB J. 29 (2020), 1287–1310.
[80]
Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978).
[81]
Haejoon Lee, Minseo Kang, Sun-Bum Youn, Jae-Gil Lee, and YongChul Kwon. 2016. An experimental comparison of iterative MapReduce frameworks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2089–2094.
[82]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, 583–598.
[83]
Zhenguo Li, Yixiang Fang, Qin Liu, Jiefeng Cheng, Reynold Cheng, and John Lui. 2015. Walking in the cloud: Parallel SimRank at scale. Proc. VLDB Endow. 9, 1 (2015), 24–35.
[84]
Leonid Libkin. 2003. Expressive power of SQL. Theor. Comput. Sci. 296, 3 (2003), 379–404.
[85]
David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don’t get caught in the cold, warm-up your JVM: Understand and eliminate JVM warm-up overhead in data-parallel systems. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 383–400.
[86]
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (2012), 716–727.
[87]
James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 281–297.
[88]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 135–146.
[89]
Robert Malouf. 2002. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th Conference on Natural Language Learning. Association for Computational Linguistics, 1–7.
[90]
Omid Mashayekhi, Hang Qu, Chinmayee Shah, and Philip Levis. 2017. Execution templates: Caching control plane decisions for strong scaling of data analytics. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’17). 513–526.
[91]
Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. 48, 2 (2015), 25.
[92]
Frank McSherry, Rebecca Isaacs, Michael Isard, and Derek G. Murray. 2012. Composable incremental and iterative data-parallel computation with Naiad. Microsoft Research. Technical Report. MSR-TR-2012-105. https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/naiad.pdf.
[93]
Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability! But at what COST? In Proceedings of the Workshop on Hot Topics in Operating Systems. USENIX.
[94]
Frank McSherry, Derek Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential dataflow. In Proceedings of the Conference on Innovative Data Systems Research. Retrieved from https://www.microsoft.com/en-us/research/publication/differential-dataflow/.
[95]
Erik Meijer, Brian Beckman, and Gavin Bierman. 2006. LINQ: Reconciling object, relations and XML in the .NET framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 706–706.
[96]
Svilen R. Mihaylov, Zachary G. Ives, and Sudipto Guha. 2012. REX: Recursive, delta-based data-centric computation. Proc. VLDB Endow. 5, 11 (2012), 1280–1291.
[97]
Dan Moldovan, James Decker, Fei Wang, Andrew Johnson, Brian Lee, Zachary Nado, D. Sculley, Tiark Rompf, and Alexander B. Wiltschko. 2019. AutoGraph: Imperative-style coding with graph-based performance. Proc. Mach. Learn. Syst. 1 (2019), 389–405.
[98]
Adriaan Moors, Tiark Rompf, Philipp Haller, and Martin Odersky. 2012. Scala-Virtualized. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation. 117–120.
[99]
Derek Gordon Murray and Steven Hand. 2010. Scripting the cloud with skywriting.HotCloud 10 (2010), 12–12.
[100]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM.
[101]
Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: A universal execution engine for distributed data-flow computing. In Proceedings of the 8th ACM/USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 113–126.
[102]
Shayan Najd, Sam Lindley, Josef Svenningsson, and Philip Wadler. 2016. Everything old is new again: Quoted domain-specific languages. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation. ACM, 25–36.
[103]
Balazs Nemeth, Tom Haber, Jori Liesenborgs, and Wim Lamotte. 2020. Automatic parallelization of probabilistic models with varying load imbalance. In Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID’20). IEEE, 752–759.
[104]
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 69–84.
[105]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web.Technical Report. Stanford InfoLab.
[106]
Shrideep Pallickara, Hasan Bulut, and Geoffrey Fox. 2007. Fault-tolerant reliable delivery of messages in distributed publish/subscribe systems. In Proceedings of the 4th International Conference on Autonomic Computing (ICAC’07). IEEE, 19–19.
[107]
Linnea Passing, Manuel Then, Nina Hubig, Harald Lang, Michael Schreier, Stephan Günnemann, Alfons Kemper, and Thomas Neumann. 2017. SQL-and Operator-centric Data Analytics in Relational Main-Memory Databases. In Proceedings of the International Conference on Extending Database Technology.
[108]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017). In Proceedings of the Autodiff Workshop (NIPS’17). https://openreview.net/pdf?id=BJJsrmfCZ.
[109]
Oleksandr Pochayevets. 2006. BMDFM: A Hybrid Dataflow Runtime Parallelization Environment for Shared Memory Multiprocessors. Ph.D. Dissertation. Technische Universität München.
[110]
Piotr Przymus, Aleksandra Boniewicz, Marta Burzańska, and Krzysztof Stencel. 2010. Recursive query facilities in relational databases: A survey. In Database Theory and Application, Bio-Science and Bio-Technology. Springer, 89–99.
[111]
Fabrice Rastello. 2016. SSA-based Compiler Design. Springer Publishing Company, Incorporated.
[112]
Till Rohrmann, Sebastian Schelter, Tilmann Rabl, and Volker Markl. 2017. Gilbert: Declarative sparse linear algebra on massively parallel dataflow systems. In Proceedings of the Datenbanksysteme für Business, Technologie und Web (BTW’17). Gesellschaft für Informatik, Bonn, 269–288.
[113]
Tiark Rompf, Nada Amin, Adriaan Moors, Philipp Haller, and Martin Odersky. 2012. Scala-Virtualized: Linguistic reuse for deep embeddings. High.-ord. Symbol. Comput. 25, 1 (2012), 165–207.
[114]
Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering. 127–136.
[115]
Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (2012), 121–130.
[116]
Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 49–68.
[117]
Leonid Ryzhyk and Mihai Budiu. 2019. Differential datalog.Datalog 2 (2019), 4–5.
[118]
Afaf G. Bin Saadon and Hoda M. O. Mokhtar. 2019. Survey on iterative and incremental approaches in distributed computing environment. Int. J. Data Sci. 4, 1 (2019), 18–30.
[119]
Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. 2013. The family of MapReduce and large-scale data processing systems. ACM Comput. Surv. 46, 1 (2013), 11.
[120]
Semih Salihoglu and Jennifer Widom. 2014. Optimizing graph algorithms on Pregel-like systems. Proc. VLDB Endow. 7, 7 (2014), 577–588.
[121]
Sebastian Schelter, Stephan Ewen, Kostas Tzoumas, and Volker Markl. 2013. All roads lead to Rome: Optimistic recovery for distributed iterative data processing. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 1919–1928.
[122]
Jiwon Seo, Stephen Guo, and Monica S. Lam. 2013. SociaLite: Datalog extensions for efficient social network analysis. In Proceedings of the IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 278–289.
[123]
Prateek Sharma, Tian Guo, Xin He, David Irwin, and Prashant Shenoy. 2016. Flint: Batch-interactive data-intensive processing on transient servers. In Proceedings of the 11th European Conference on Computer Systems. ACM, 6.
[124]
Marianne Shaw, Paraschos Koutris, Bill Howe, and Dan Suciu. 2012. Optimizing large-scale Semi-Naïve datalog evaluation in Hadoop. In Proceedings of the International Datalog 2.0 Workshop. Springer, 165–176.
[125]
Avraham Shinnar, David Cunningham, Vijay Saraswat, and Benjamin Herta. 2012. M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB Endow. 5, 12 (2012), 1736–1747.
[126]
Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with Datalog queries on Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[127]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1–10.
[128]
Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proc. VLDB Endow. 3, 1–2 (2010), 703–710.
[129]
Marc Snir, William Gropp, Steve Otto, Steven Huss-Lederman, Jack Dongarra, and David Walker. 1998. MPI–the Complete Reference: The MPI Core. Vol. 1. The MIT Press.
[130]
Emad Soroush, Magdalena Balazinska, Simon Krughoff, and Andrew Connolly. 2015. Efficient iterative processing in the SciDB parallel array engine. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management. ACM, 39.
[131]
Quoc-Cuong To, Juan Soto, and Volker Markl. 2018. A survey of state management in big data processing systems. VLDB J. 27, 6 (2018), 847–872.
[132]
Laurence Tratt. 2008. Domain specific language implementation via compile-time meta-programming. ACM Trans. Prog. Lang. Syst. 30, 6 (2008), 1–40.
[133]
Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (1990), 103–111.
[134]
Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, and Shengliang Xu. 2017. The Myria big data management and analytics system and cloud services. In Proceedings of the Conference on Innovative Data Systems Research.
[135]
Jingjing Wang, Magdalena Balazinska, and Daniel Halperin. 2015. Asynchronous and fault-tolerant recursive Datalog evaluation in shared-nothing engines. Proc. VLDB Endow. 8, 12 (2015), 1542–1553.
[136]
Qiange Wang, Yanfeng Zhang, Hao Wang, Liang Geng, Rubao Lee, Xiaodong Zhang, and Ge Yu. 2020. Automating incremental and asynchronous evaluation for recursive aggregate data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 2439–2454.
[137]
Haijiang Wu, Jie Liu, Tao Wang, Dan Ye, Jun Wei, and Hua Zhong. 2016. Parallel materialization of Datalog programs with Spark for scalable reasoning. In Proceedings of the International Conference on Web Information Systems Engineering. Springer.
[138]
Chenning Xie, Rong Chen, Haibing Guan, Binyu Zang, and Haibo Chen. 2015. Sync or async: Time to fuse for distributed graph-parallel computation. In ACM SIGPLAN Notices, Vol. 50. ACM, 194–204.
[139]
Chen Xu, Markus Holzemer, Manohar Kaul, Juan Soto, and Volker Markl. 2017. On fault tolerance for distributed iterative dataflow processing. IEEE Trans. Knowl. Data Eng. 29, 8 (2017), 1709–1722.
[140]
Da Yan, Yingyi Bu, Yuanyuan Tian, and Amol Deshpande. 2017. Big graph analytics platforms. Found. Trends® Datab. 7, 1–2 (2017), 1–195.
[141]
Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, and Xiaoqiang Zheng. 2018. Dynamic control flow in large-scale machine learning. In Proceedings of the 13th EuroSys Conference. ACM, 18.
[142]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, Vol. 8. 1–14.
[143]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.
[144]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets.HotCloud 10 (2010).
[145]
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.
[146]
Qizhen Zhang, Akash Acharya, Hongzhi Chen, Simran Arora, Ang Chen, Vincent Liu, and Boon Thau Loo. 2019. Optimizing declarative graph queries at large scale. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1411–1428.
[147]
Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. 2011. PrIter: A distributed framework for prioritized iterative computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing. 1–14.
[148]
Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. 2012. iMapReduce: A distributed computing framework for iterative computation. J. Grid Comput. 10, 1 (2012), 47–68.
[149]
Kangfei Zhao and Jeffrey Xu Yu. 2017. All-in-one: Graph processing in RDBMSs revisited. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1165–1180.

Cited By

View all
  • (2025)Efficient Iterative Programs with Distributed Data CollectionsJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2025.101047(101047)Online publication date: Feb-2025
  • (2024)On the Semantic Overlap of Operators in Stream Processing EnginesProceedings of the 25th International Middleware Conference10.1145/3652892.3654790(8-21)Online publication date: 2-Dec-2024
  • (2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 54, Issue 9
December 2022
800 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3485140
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2021
Accepted: 01 July 2021
Revised: 01 June 2021
Received: 01 November 2020
Published in CSUR Volume 54, Issue 9

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Control flow
  2. iteration
  3. Distributed dataflows
  4. Programming models
  5. Higher-order functions

Qualifiers

  • Survey
  • Refereed

Funding Sources

  • German Federal Ministry of Education and Research as BIFOLD – Berlin Institute for the Foundations of Learning and Data

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)11
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Efficient Iterative Programs with Distributed Data CollectionsJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2025.101047(101047)Online publication date: Feb-2025
  • (2024)On the Semantic Overlap of Operators in Stream Processing EnginesProceedings of the 25th International Middleware Conference10.1145/3652892.3654790(8-21)Online publication date: 2-Dec-2024
  • (2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Analyzing the impact of various parameters on job scheduling in the Google cluster datasetCluster Computing10.1007/s10586-024-04377-827:6(7673-7687)Online publication date: 1-Sep-2024
  • (2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
  • (2022)The Emerging Role of the knowledge Driven Applications of Wireless Networks for Next Generation online Stream Processing2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)10.1109/ICACITE53722.2022.9823654(172-176)Online publication date: 28-Apr-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media