Skip to main content

Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop

  • Conference paper
Book cover Datalog in Academia and Industry (Datalog 2.0 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7494))

Included in the following conference series:

Abstract

We explore the design and implementation of a scalable Datalog system using Hadoop as the underlying runtime system. Observing that several successful projects provide a relational algebra-based programming interface to Hadoop, we argue that a natural extension is to add recursion to support scalable social network analysis, internet traffic analysis, and general graph query. We implement semi-naive evaluation in Hadoop, then apply a series of optimizations spanning fundamental changes to the Hadoop infrastructure to basic configuration guidelines that collectively offer a 10x improvement in our experiments. This work lays the foundation for a more comprehensive cost-based algebraic optimization framework for parallel recursive Datalog queries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)

    Google Scholar 

  2. Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT 2011, pp. 1–8. ACM, New York (2011)

    Chapter  Google Scholar 

  3. Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.: Boom analytics: exploring data-centric, declarative programming for the cloud. In: Morin, C., Muller, G. (eds.) EuroSys, pp. 223–236. ACM (2010)

    Google Scholar 

  4. Balduccini, M., Pontelli, E., Elkhatib, O., Le, H.: Issues in parallel execution of non-monotonic reasoning systems. Parallel Comput. 31(6), 608–647 (2005)

    Article  Google Scholar 

  5. Barga, R., Ekanayake, J., Jackson, J., Lu, W.: Daytona: Iterative mapreduce on windows azure, http://research.microsoft.com/en-us/projects/daytona/

  6. Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: Abiteboul, S., Böhm, K., Koch, C., Tan, K.-L. (eds.) ICDE, pp. 1151–1162. IEEE Computer Society (2011)

    Google Scholar 

  7. Bu, Y., Howe, B., Balazinska, M., Ernst, M.: Haloop: Efficient iterative data processing on large clusters. In: Proc. of International Conf. on Very Large Databases, VLDB (2010)

    Google Scholar 

  8. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  9. Eisner, J., Filardo, N.W.: Dyna: Extending Datalog for Modern AI. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 181–220. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: HPDC, pp. 810–818 (2010)

    Google Scholar 

  11. Hadoop, http://hadoop.apache.org/

  12. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Ferreira, P., Gross, T.R., Veiga, L. (eds.) EuroSys, pp. 59–72. ACM (2007)

    Google Scholar 

  13. Joslyn, C., Adolf, R., Al Saffar, S., Feo, J., Goodman, E., Haglin, D., Mackey, G., Mizell, D.: High performance semantic factoring of giga-scale semantic graph databases. In: Semantic Web Challenge Billion Triple Challenge (2010)

    Google Scholar 

  14. Loo, B.T., Gill, H., Liu, C., Mao, Y., Marczak, W.R., Sherr, M., Wang, A., Zhou, W.: Recent Advances in Declarative Networking. In: Russo, C., Zhou, N.-F. (eds.) PADL 2012. LNCS, vol. 7149, pp. 1–16. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  15. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Elmagarmid, A.K., Agrawal, D. (eds.) SIGMOD Conference, pp. 135–146. ACM (2010)

    Google Scholar 

  16. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Wang, J.T.-L. (ed.) SIGMOD Conference, pp. 1099–1110. ACM (2008)

    Google Scholar 

  17. Perri, S., Ricca, F., Sirianni, M.: A parallel asp instantiator based on dlv. In: Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, DAMP 2010, pp. 73–82. ACM, New York (2010)

    Chapter  Google Scholar 

  18. Shaw, M., Detwiler, L., Noy, N., Brinkley, J.: S.D.: vsparql: A view definition language for the semantic web. Journal of Biomedical Informatics (2010), doi:10.1016/j.jbi, 08.008

    Google Scholar 

  19. Zaharia, M., Chowdhury, N.M.M., Franklin, M., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. Technical Report UCB/EECS-2010-53, EECS Department, University of California, Berkeley (May 2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shaw, M., Koutris, P., Howe, B., Suciu, D. (2012). Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop. In: Barceló, P., Pichler, R. (eds) Datalog in Academia and Industry. Datalog 2.0 2012. Lecture Notes in Computer Science, vol 7494. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32925-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32925-8_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32924-1

  • Online ISBN: 978-3-642-32925-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics