CCFinder: using Spark to find clustering coefficient in big graphs

Alemi, Mehdi; Haghighi, Hassan; Shahrivari, Saeed

doi:10.1007/s11227-017-2040-8

CCFinder: using Spark to find clustering coefficient in big graphs

Published: 12 April 2017

Volume 73, pages 4683–4710, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Mehdi Alemi¹,
Hassan Haghighi¹ &
Saeed Shahrivari²

587 Accesses
7 Citations
Explore all metrics

Abstract

Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen’s method and NodeIterator++, introduced based on MapReduce.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’networks. Nature 393(6684):440–442
Article MATH Google Scholar
Newman ME (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256
Article MATH MathSciNet Google Scholar
Kim BJ (2004) Performance of networks of artificial neurons: the role of clustering. Phys Rev E 69(4):045101
Article Google Scholar
Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197
Article Google Scholar
Huang Z (2006) Link prediction based on graph topology: the predictive value of generalized clustering coefficient. Paper presented at the Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD2006)
Goldstein R, Vitevitch MS (2013) The influence of clustering coefficient on word-learning: how groups of similar sounding words facilitate acquisition. Front Psychol 5:1307–1307
Google Scholar
Newman ME (2009) Random graphs with clustering. Phys Rev Lett 103(5):058701
Article Google Scholar
Saramäki J, Kaski K (2004) Scale-free networks generated by random walkers. Phys A Stat Mech Appl 341:80–86
Article MathSciNet Google Scholar
Dorogovtsev SN, Goltsev AV, Mendes JFF (2002) Pseudofractal scale-free web. Phys Rev E 65(6):066122
Article Google Scholar
Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th International Conference on World Wide Web, 2011. ACM, pp 607–614
Chung FR, Lu L (2006) Complex graphs and networks, vol 107. American Mathematical Society, Providence
MATH Google Scholar
Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827
Article Google Scholar
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, 2010. ACM, pp 591–600
Ye P, Peyser BD, Spencer FA, Bader JS (2005) Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast. BMC Bioinform 6(1):270
Article Google Scholar
White T (2012) Hadoop: the definitive guide. O’Reilly Media, Newton
Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10–10):95
Google Scholar
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1–7
MATH MathSciNet Google Scholar
Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K (2017) A parallel random forest algorithm for big data in a Spark cloud computing environment. IEEE Trans Parallel Distrib Syst 28(4):919–933
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. ACM, pp 135–146
Quamar A, Deshpande A, Lin J (2016) NScale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J 25(2):125–150
Article Google Scholar
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727
Article Google Scholar
Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) GraphX: graph processing in a distributed dataflow framework. In: OSDI, 2014, pp 599–613
Quamar A, Deshpande A (2016) NScaleSpark: subgraph-centric graph analytics on Apache Spark. In: Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics, 2016. ACM, p 5
Soffer SN, Vazquez A (2005) Network clustering coefficient without degree-correlation biases. Phys Rev E 71(5):057101
Article Google Scholar
Spark: Lightning-fast cluster computing, http://spark.apache.org/docs/latest/programming-guide.html. Accessed 1 Oct 2016
Ortmann M, Brandes U (2014) Triangle listing algorithms: back from the diversion. In: 2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX), 2014. SIAM, pp 1–8
Schank T (2007) Algorithmic aspects of triangle-based network analysis. Dissertation, University Karlsruhe
Schank T, Wagner D (2005) counting and listing all triangles in large graphs, an experimental study. In: International Workshop on Experimental and Efficient Algorithms, 2005. Springer, pp 606–609
Latapy M (2008) Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor Comput Sci 407(1–3):458–473
Article MATH MathSciNet Google Scholar
Itai A, Rodeh M (1978) Finding a minimum circuit in a graph. SIAM J Comput 7(4):413–423
Article MATH MathSciNet Google Scholar
Arifuzzaman S, Khan M, Marathe M (2013) PATRIC: a parallel algorithm for counting triangles in massive networks. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 529–538
Cohen J (2009) Graph twiddling in a mapreduce world. Comput Sci Eng 11(4):29–41
Article Google Scholar
Park H-M, Silvestri F, Kang U, Pagh R (2014) Mapreduce triangle enumeration with guarantees. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, 2014. ACM, pp 1739–1748
Park H-M, Chung C-W (2013) An efficient MapReduce algorithm for counting triangles in a very large graph. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 2013. ACM, pp 539–548
Apache Giraph, http://giraph.apache.org/. Accessed 1 Oct 2016
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: OSDI, 2012, vol 1, p 2
Quick L, Wilkinson P, Hardcastle D (2012) Using pregel-like large scale graph processing frameworks for social network analysis. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), 2012. IEEE Computer Society, pp 457–463
SNAP: Stanford Network Analysis Project. http://snap.stanford.edu. Accessed 1 Oct 2016
Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213
Article Google Scholar
Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. ACM, pp 44–54
Cha M, Haddadi H, Benevenuto F, Gummadi PK (2010) Measuring user influence in twitter: the million follower fallacy. ICWSM 10(10–17):30
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Shahid Beheshti University, G. C., Tehran, Iran
Mehdi Alemi & Hassan Haghighi
Department of Computer Engineering, Tarbiat Modares University (TMU), Tehran, Iran
Saeed Shahrivari

Authors

Mehdi Alemi
View author publications
You can also search for this author inPubMed Google Scholar
Hassan Haghighi
View author publications
You can also search for this author inPubMed Google Scholar
Saeed Shahrivari
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hassan Haghighi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alemi, M., Haghighi, H. & Shahrivari, S. CCFinder: using Spark to find clustering coefficient in big graphs. J Supercomput 73, 4683–4710 (2017). https://doi.org/10.1007/s11227-017-2040-8

Download citation

Published: 12 April 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11227-017-2040-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CCFinder: using Spark to find clustering coefficient in big graphs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ScaleSCAN: Scalable Density-Based Graph Clustering

Distributed Graph Clustering Using Modularity and Map Equation

Graph partitioning MapReduce-based algorithms for counting triangles in large-scale graphs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

CCFinder: using Spark to find clustering coefficient in big graphs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ScaleSCAN: Scalable Density-Based Graph Clustering

Distributed Graph Clustering Using Modularity and Map Equation

Graph partitioning MapReduce-based algorithms for counting triangles in large-scale graphs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now