Abstract
Hadoop-based data processing platforms translate join intensive queries into multiple “jobs” (MapReduce cycles). Such multi-job workflows lead to a significant amount of data movement through the disk, network and memory fabric of a Hadoop cluster which could negatively impact performance and scalability. Consequently, techniques that minimize sizes of intermediate results will be useful in this context. In this paper, we present an information passing technique (HIP) that can minimize the size of intermediate data on Hadoop-based data processing platforms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 107–113 (2008)
Apache Hadoop, http://hadoop.apache.org
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience. PVLDB 2(2), 1414–1425 (2009)
Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah. PVLDB 3(1), 518–529 (2010)
Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In: ACM SIGMOD, pp. 961–972. ACM, Athens (2011)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A Comparison of Join Algorithms for Log Processing in MapReduce. In: ACM SIGMOD, pp. 975–986. ACM, Indianapolis (2010)
Ives, Z.G., Taylor, N.E.: Sideways Information Passing for Push-Style Query Processing. In: 24th International Conference on ICDE, pp. 774–783. IEEE, Cancún (2008)
Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. In: ACM SIGMOD, pp. 627–640. ACM, Providence (2009)
Bernstein, P.A., Chiu, D.W.: Using Semi-Joins to Solve Relational Queries. J. ACM 28(1), 25–40 (1981)
Avnur, R., Hellerstein, J.M.: Eddies: Continuously Adaptive Query Processing. In: ACM SIGMOD, pp. 261–272. ACM, Dallas (2000)
Mumick, I.S., Pirahesh, H.: Implementation of Magic-sets in a Relational Database System. In: ACM SIGMOD, pp. 103–114. ACM, Minneapolis (1994)
Apache Hive, http://hive.apache.org
Apache Pig, http://pig.apache.org
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hong, S., Anyanwu, K. (2012). HIP: Information Passing for Optimizing Join-Intensive Data Processing Workloads on Hadoop. In: Liddle, S.W., Schewe, KD., Tjoa, A.M., Zhou, X. (eds) Database and Expert Systems Applications. DEXA 2012. Lecture Notes in Computer Science, vol 7447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32597-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-642-32597-7_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32596-0
Online ISBN: 978-3-642-32597-7
eBook Packages: Computer ScienceComputer Science (R0)