Skip to main content
Log in

Iterative Computation of Connected Graph Components with MapReduce

  • SCHWERPUNKTBEITRAG
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

The use of the MapReduce framework for iterative graph algorithms is challenging. To achieve high performance it is critical to limit the amount of intermediate results as well as the number of necessary iterations. We address these issues for the important problem of finding connected components in large graphs. We analyze an existing MapReduce algorithm, CC-MR, and present techniques to improve its performance including a memory-based connection of subgraphs in the map phase. Our evaluation with several large graph datasets shows that the improvements can substantially reduce the amount of generated data by up to a factor of 8.8 and runtime by up to factor of 3.5.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://snap.stanford.edu/data/

References

  1. Afrati FN, Borkar VR, Carey MJ, Polyzotis N, Ullman JD (2011) Map-Reduce extensions and recursive queries. In: Proc. of intl. conference on extending database technology, pp 1–8

  2. Awerbuch B, Shiloach Y (1987) New connectivity and MSF algorithms for shuffle-exchange network and PRAM. IEEE Trans Comput 36(10):1258–1263

    Article  MATH  MathSciNet  Google Scholar 

  3. Bancilhon F, Maier D, Sagiv Y, Ullman JD (1986) Magic sets and other strange ways to implement logic programs. In: Proc. of symposium on principles of database systems, pp 1–15

  4. Bu Y, Howe B, Balazinska M, Ernst MD (2012) The HaLoop approach to large-scale iterative data analysis. VLDB Journal 21(2):169–190

    Article  Google Scholar 

  5. Bus L, Tvrd\'ık P (2001) A parallel algorithm for connected components on distributed memory machines. In: Proc. of European PVM/MPI users` group meeting, pp 280–287

  6. Cheiney JP, de Maindreville C (1989) A parallel transitive closure algorithm using hash-based clustering. In: Proc. of intl. workshop on database machines, pp 301–316

  7. Cohen J (2009) Graph twiddling in a MapReduce world. Comput Sci Eng 11(4):29–41

    Article  Google Scholar 

  8. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proc. of symposium on operating system design and implementation, pp 137–150

  9. Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proc. of symposium on parallelism in algorithms and architectures, pp 16–25

  10. Hirschberg DS, Chandra AK, Sarwate DV (1979) Computing connected components on parallel computers. Commun ACM 22(8):461–464

    Article  MATH  MathSciNet  Google Scholar 

  11. Ioannidis YE (1986) On the computation of the transitive closure of relational operators. In: Proc. of intl. conference on very large databases, pp 403–411

  12. Kang U, Tsourakakis CE, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system. In: Proc. of intl. conference on data mining, pp 229–238

  13. Kolb L, Rahm E (2013) Parallel entity resolution with Dedoop. Datenbank-Spektrum 13(1):23–32

    Article  Google Scholar 

  14. Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proceedings of the VLDB endowment 5(12):1878–1881

    Article  Google Scholar 

  15. Lattanzi S, Moseley B, Suri S, Vassilvitskii S (2011) Filtering: a method for solving graph problems in MapReduce. In: Proc. of symposium on parallelism in algorithms and architectures, pp 85–94

  16. Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proc. of the intl. conference on management of data, pp 135–146

  17. Petermann A, Junghanns M, Mueller R, Rahm E (2014) BIIIG: enabling business intelligence with integrated instance graphs. In: Proc. of intl. workshop on graph data management (GDM)

  18. Rastogi V, Machanavajjhala A, Chitnis L, Sarma AD (2013) Finding connected components in map-reduce in logarithmic rounds. In: Proc. of intl. conference on data engineering, pp 50–61

  19. Seidl T, Boden B, Fries S (2012) CC-MR - finding connected components in huge graphs with MapReduce. In: Proc. of machine learning and knowledge discovery in databases, pp 458–473

  20. Shiloach Y, Vishkin U (1982) An O(log n) Parallel connectivity algorithm. J Algorithms 3(1):57–67

    Article  MATH  MathSciNet  Google Scholar 

  21. Tarjan RE (1972) Depth-first search and linear graph algorithms. SIAM J Comput 1(2):146–160

    Article  MATH  MathSciNet  Google Scholar 

  22. Valduriez P, Khoshafian S (1988) Parallel evaluation of the transitive closure of a database relation. Int J Parallel Prog 17(1):19–37

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Kolb.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kolb, L., Sehili, Z. & Rahm, E. Iterative Computation of Connected Graph Components with MapReduce. Datenbank Spektrum 14, 107–117 (2014). https://doi.org/10.1007/s13222-014-0154-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-014-0154-1

Keywords

Navigation