Abstract
The cluster-computing environment typified by Hadoop, the open-source implementation of map-reduce, is receiving serious attention as the way to execute queries and other operations on very large-scale data. Datalog execution presents several unusual issues for this enviroment. We discuss the best way to execute a round of seminaive evaluation on a computing cluster using the map-reduce. Using transitive closure as an example, we examine the cost of executing recursions in several different ways. Recursive processes such as evaluation of a recursive Datalog program do not fit the key map-reduce assumption that tasks deliver output only when they are completed. As a result, the resilience under compute-node failure that is a key element of the map-reduce framework is not supported for recursive programs. We discuss extensions to this framework that are suitable for executing recursive Datalog programs on very large-scale data in a way that allows progress to continue after node failures, without restarting the entire job.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT (2010)
Al-Kiswany, S., Ripeanu, M., Vazhkudai, S.S., Gharaibeh, A.: stdchk: A checkpoint storage system for desktop grid computing. In: ICDCS, pp. 613–624 (2008)
Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.: Boom analytics: exploring data-centric, declarative programming for the cloud. In: EuroSys, pp. 223–236 (2010)
Apache. Hadoop (2006), http://hadoop.apache.org/
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC 2010: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 119–130. ACM, New York (2010)
Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: Proceedings of the IEEE International Conference on Data Engineering (to appear, 2011)
Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.L.: Graph structure in the web. Computer Networks 33(1-6), 309–320 (2000)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.: Haloop: efficient iterative data processing on large clusters. In: VLDB Conference (2010)
Dar, S., Ramakrishnan, R.: A performance study of transitive closure algorithms. In: SIGMOD Conference, pp. 454–465 (1994)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
DeWitt, D.J., Paulson, E., Robinson, E., Naughton, J.F., Royalty, J., Shankar, S., Krioukov, A.: Clustera: an integrated computation and data management system. PVLDB 1(1), 28–41 (2008)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The complete book (2009)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: 19th ACM Symposium on Operating Systems Principles (2003)
Hellerstein, J.M.: The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Rec. 39, 1, 5–19 (2010)
Ioannidis, Y.E.: On the computation of the transitive closure of relational operators. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 403–411. Morgan Kaufmann Publishers Inc., San Francisco (1986)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys 2007 (2007)
Kabler, R., Ioannidis, Y.E., Carey, M.J.: Performance evaluation of algorithms for transitive closure. Inf. Syst. 17(5), 415–441 (1992)
Kontogiannis, S.C., Pantziou, G.E., Spirakis, P.G., Yung, M.: Robust parallel computations through randomization. Theory Comput. Syst. 33(5/6), 427–464 (2000)
Lam, M., et al.: Bdd-based deductive database. bddbddb.sourceforge.net (2008)
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD Conference (2010)
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD 2010: Proceedings of the 2010 International Conference on Management of Data, pp. 135–146. ACM, New York (2010)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets (2010)
Seong, S.-W., Nasielski, M., Seo, J., Sengupta, D., Hangal, S., Teh, S.K., Chu, R., Dodson, B., Lam, M.S.: The architecture and implementation of a decentralized social networking platform (2009), http://prpl.stanford.edu/papers/prpl09.pdf
Ullman, J.D.: Principles of Database and Knowledge-Base Systems (1989)
Valduriez, P., Boral, H.: Evaluation of recursive queries using join indices. In: Expert Database Conf., pp. 271–293 (1986)
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, L., Gunda, P.K., Currey, J.: Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In: Draves, R., van Renesse, R. (eds.) OSDI, pp. 1–14. USENIX Association (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D. (2011). Cluster Computing, Recursion and Datalog. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds) Datalog Reloaded. Datalog 2.0 2010. Lecture Notes in Computer Science, vol 6702. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24206-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-24206-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24205-2
Online ISBN: 978-3-642-24206-9
eBook Packages: Computer ScienceComputer Science (R0)