Abstract
The use of the MapReduce framework for iterative graph algorithms is challenging. To achieve high performance it is critical to limit the amount of intermediate results as well as the number of necessary iterations. We address these issues for the important problem of finding connected components in large graphs. We analyze an existing MapReduce algorithm, CC-MR, and present techniques to improve its performance including a memory-based connection of subgraphs in the map phase. Our evaluation with several large graph datasets shows that the improvements can substantially reduce the amount of generated data by up to a factor of 8.8 and runtime by up to factor of 3.5.









Similar content being viewed by others
References
Afrati FN, Borkar VR, Carey MJ, Polyzotis N, Ullman JD (2011) Map-Reduce extensions and recursive queries. In: Proc. of intl. conference on extending database technology, pp 1–8
Awerbuch B, Shiloach Y (1987) New connectivity and MSF algorithms for shuffle-exchange network and PRAM. IEEE Trans Comput 36(10):1258–1263
Bancilhon F, Maier D, Sagiv Y, Ullman JD (1986) Magic sets and other strange ways to implement logic programs. In: Proc. of symposium on principles of database systems, pp 1–15
Bu Y, Howe B, Balazinska M, Ernst MD (2012) The HaLoop approach to large-scale iterative data analysis. VLDB Journal 21(2):169–190
Bus L, Tvrd\'ık P (2001) A parallel algorithm for connected components on distributed memory machines. In: Proc. of European PVM/MPI users` group meeting, pp 280–287
Cheiney JP, de Maindreville C (1989) A parallel transitive closure algorithm using hash-based clustering. In: Proc. of intl. workshop on database machines, pp 301–316
Cohen J (2009) Graph twiddling in a MapReduce world. Comput Sci Eng 11(4):29–41
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proc. of symposium on operating system design and implementation, pp 137–150
Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proc. of symposium on parallelism in algorithms and architectures, pp 16–25
Hirschberg DS, Chandra AK, Sarwate DV (1979) Computing connected components on parallel computers. Commun ACM 22(8):461–464
Ioannidis YE (1986) On the computation of the transitive closure of relational operators. In: Proc. of intl. conference on very large databases, pp 403–411
Kang U, Tsourakakis CE, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system. In: Proc. of intl. conference on data mining, pp 229–238
Kolb L, Rahm E (2013) Parallel entity resolution with Dedoop. Datenbank-Spektrum 13(1):23–32
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proceedings of the VLDB endowment 5(12):1878–1881
Lattanzi S, Moseley B, Suri S, Vassilvitskii S (2011) Filtering: a method for solving graph problems in MapReduce. In: Proc. of symposium on parallelism in algorithms and architectures, pp 85–94
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proc. of the intl. conference on management of data, pp 135–146
Petermann A, Junghanns M, Mueller R, Rahm E (2014) BIIIG: enabling business intelligence with integrated instance graphs. In: Proc. of intl. workshop on graph data management (GDM)
Rastogi V, Machanavajjhala A, Chitnis L, Sarma AD (2013) Finding connected components in map-reduce in logarithmic rounds. In: Proc. of intl. conference on data engineering, pp 50–61
Seidl T, Boden B, Fries S (2012) CC-MR - finding connected components in huge graphs with MapReduce. In: Proc. of machine learning and knowledge discovery in databases, pp 458–473
Shiloach Y, Vishkin U (1982) An O(log n) Parallel connectivity algorithm. J Algorithms 3(1):57–67
Tarjan RE (1972) Depth-first search and linear graph algorithms. SIAM J Comput 1(2):146–160
Valduriez P, Khoshafian S (1988) Parallel evaluation of the transitive closure of a database relation. Int J Parallel Prog 17(1):19–37
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kolb, L., Sehili, Z. & Rahm, E. Iterative Computation of Connected Graph Components with MapReduce. Datenbank Spektrum 14, 107–117 (2014). https://doi.org/10.1007/s13222-014-0154-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-014-0154-1