Abstract
This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders their scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size and memory cost, we also propose improved algorithms based on a compact data structure, and present several optimization techniques to accelerate the computation and reduce the memory consumption. The extensive experimental results show that our algorithms outperform existing competitors by several times on both synthetic graphs and real world graphs.
Similar content being viewed by others
References
Wang, J., Cheng, J.: Truss decomposition in massive networks. Proc. VLDB Endow. 5(9), 812–823 (2012)
Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)
Schank, T.: Algorithmic aspects of triangle-based network analysis. PhD in Computer Science, University Karlsruhe, vol 1 (2007)
Itai, A., Rodeh, M.: Finding a minimum circuit in a graph. SIAM J. Comput. 7(4), 413–423 (1978)
Alon, N., Yuster, R., Zwick, U.: Finding and counting given length cycles. Algorithmica 17(3), 209–223 (1997)
Batagelj, V., Mrvar, A.: A subquadratic triad census algorithm for large sparse networks with small maximum degree. Soc. Netw. 23(3), 237–243 (2001)
Schank, T., Wagner, D.: Finding, counting and listing all triangles in large graphs, an experimental study. In: Experimental and Efficient Algorithms, pp. 606–609. Springer, Berlin (2005)
Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407(1), 458–473 (2008)
Eppstein, D., Spiro, E.S.: The h-index of a graph and its application to dynamic subgraph statistics. In: Algorithms and Data Structures, pp. 278–289. Springer, Heidelberg (2009)
Menegola, B.: An External Memory Algorithm for Listing Triangles. Technical report. Universidade Federal do Rio Grande do Sul (2010)
Dementiev, R.: Algorithm engineering for large data sets. PhD Dissertation, Saarland University (2006)
Chu, S., Cheng, J.: Triangle listing in massive networks and its applications. In: Proceedings of SIGKDD, pp. 672–680. ACM (2011)
Hu, X., Tao, Y., Chung, C.-W.: Massive graph triangulation. In: Proceedings of SIGMOD, pp. 325–336. ACM, New York (2013)
Cohen, J.: Graph twiddling in a MapReduce world. Comput. Sci. Eng. 11(4), 29–41 (2009)
Suri, S., Vassilvitskii, S.: Counting triangles and the curse of the last reducer. In: Proceedings of WWW, pp. 607–614. ACM, New York (2011)
Park, H.-M., Silvestri, F., Kang, U., Pagh, R.: MapReduce triangle enumeration with guarantees. In: Proceedings of CIKM, pp. 1739–1748. ACM (2014)
Park, H.-M., Chung, C.-W.: An efficient MapReduce algorithm for counting triangles in a very large graph. In: Proceedings of CIKM, pp. 539–548. ACM (2013)
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of OSDI, pp. 17–30 (2012)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of SIGMOD, pp. 135–146. ACM, New York (2010)
Zhang, H., Zhu, Y., Qin, L., Cheng, H., Yu, J.X.: Efficient triangle listing for billion-scale graphs. In: IEEE BigData, pp. 813–822. IEEE (2016)
Leskovec, J., Krevl, A.: SNAP datasets: Stanford large network dataset collection. http://snap.stanford.edu/data (June 2014). Accessed 8 Mar 2016
Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: Proceedings of WWW, pp. 591–600. ACM, New York (2010)
http://lemurproject.org/clueweb09/index.php. Accessed 10 Mar 2016
Lai, L., Qin, L., Lin, X., Chang, L.: Scalable subgraph enumeration in mapreduce. Proc. VLDB Endow. 8(10), 974–985 (2015)
Cao, P.: Bloom filter introduction. http://pages.cs.wisc.edu/cao/papers/summary-cache/node8.html. Accessed 25 Mar 2016
Lam, C.: Hadoop in Action. Manning Publications Co., New York (2010)
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: SDM, vol. 4, pp. 442–446. SIAM (2004)
Khorasani, F., Vora, K., Gupta, R.: PaRMAT: a parallel generator for large R-MAT graphs (2015). https://github.com/farkhor/PaRMAT. Accessed 20 May 2016
Khorasani, F., Gupta, R., Bhuyan, L.N.: Scalable SIMD-efficient graph processing on GPUs. In: Proceedings of PACT, Series PACT ’15, pp. 39–50 (2015)
Kim, J., Han, W.S., Lee, S., Park, K., Yu, H.: OPT: a new framework for overlapped and parallel triangulation in large-scale graphs. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 637–648. ACM (2014)
Park, H.-M., Myaeng, S.-H., Kang, U.: PTE: enumerating trillion triangles on distributed systems. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1115–1124. ACM (2016)
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 599–613 (2014)
Acknowledgements
This work was partially supported by the Grants from the National Science Foundation of China (61502349), Hubei Provincial Natural Science Foundation of China (2015CFB339), the Scientific and Technologic Development Programme of SuZhou (SYG201442), Research Grants Council of the Hong Kong (14209314 and 14221716), Chinese University of Hong Kong Direct Grant (4055048) and Australian Research Council (DE140100999 and DP160101513).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, Y., Zhang, H., Qin, L. et al. Efficient MapReduce algorithms for triangle listing in billion-scale graphs. Distrib Parallel Databases 35, 149–176 (2017). https://doi.org/10.1007/s10619-017-7193-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-017-7193-1