Abstract
We study parallel connected components algorithms on GPUs in comparison with CPUs. Although straightforward implementation of PRAM algorithms performs relatively better on GPUs than on CPUs, the GPU memory subsystem performance is poor due to non-coalesced random accesses.
We argue that generic sort-based access coalescing is too costly on GPUs. We propose a new coalescing technique and a new meta algorithm to improve locality and performance. Our optimization achieves up to 2.7 times speedup over the straightforward implementation. Interestingly, our optimization also works well on CPUs.
Comparing the best-performing algorithms on GPUs and CPUs, we find our new algorithm is the fastest on GPUs and the second fastest on CPUs, while the parallel Rem’s algorithm is the fastest on CPUs but does not perform well on GPUs due to path divergence.
Chapter PDF
Similar content being viewed by others
References
Arge, L., Bender, M.A., Demaine, E.D., Holland-Minkley, B., Munro, J.I.: Cache-oblivious priority queue and graph algorithm applications. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, Montreal, Canada, pp. 268–276 (2002)
Arge, L., Goodrich, M.T., Nelson, M., Sitchinava, N.: Fundamental parallel algorithms for private-cache chip multiprocessors. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA 2008, pp. 197–206. ACM, New York (2008)
Arge, L., Goodrich, M.T., Sitchinava, N.: Parallel external memory graph algorithms. In: 24th IEEE International Parallel & Distributed Processing Symposium, Atlanta, Georgia, USA (2010)
Bader, D.A., Cong, G.: A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, New Mexico (April 2004)
Blelloch, G.E., Chowdhury, R.A., Gibbons, P.B., Ramachandran, V., Chen, S., Kozuch, M.: Provably good multicore cache performance for divide-and-conquer algorithms. In: In Proc. 19th ACM-SIAM Sympos. Discrete Algorithms, pp. 501–510 (2008)
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proc. 4th SIAM Intl. Conf. on Data Mining (April 2004)
Vitter, J.S.: External memory algorithms. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 1–25. Springer, Heidelberg (1998)
Chowdhury, R., Silvestri, F., Blakeley, B., Ramachandran, V.: Oblivious algorithms for multicores and network of processors. In: 24th IEEE International Parallel & Distributed Processing Symposium, Atlanta, Georgia, USA (2010)
Cong, G., Makarychev, K.: Optimizing large-scale graph analysis on multi-threaded, multi-core platforms. In: Proceedings of the 2012 IEEE International Parallel & Distributed Processing Symposium, IPDPS 2012, pp. 414–425. IEEE Computer Society, Washington, DC (2012)
Dehne, F., Yogaratnam, K.: Exploring the limits of GPUs with parallel graph algorithms. CoRR, abs/1002.4482 (2010)
Goh, K.-I., Oh, E., Jeong, H., Kahng, B., Kim, D.: Classification of scale-free networks. Proc. Natl. Acad. Sci. 99, 12583–12588 (2002)
Hong, S., Oguntebi, T., Olukotun, K.: Efficient parallel graph exploration on multi-core cpu and gpu. In: 2011 Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), pp. 78–88 (October 2011)
Ladner, R., Fix, J.D., LaMarca, A.: The cache performance of traversals and random accesses. In: Proc. 10th Ann. Symp. Discrete Algorithms (SODA-1999), pp. 613–622. ACM-SIAM, Baltimore (1999)
Lee, J., Lakshminarayanaand, N.B., Hyesoon, K., Vuduc, R.: Many-thread aware prefetching mechanisms for GPGPU applications. In: 43rd Annual IEEE/ACM Int’l Symp on Microarchitecture (MICRO), pp. 213–224 (December 2010)
Luo, L., Wong, M., Hwu, W.: An effective gpu implementation of breadth-first search. In: 2010 47th ACM/IEEE Design Automation Conference (DAC), pp. 52–55 (June 2010)
Palmer, E.M.: Graphical evolution. Wiley-Interscience Series in Discrete Mathematics. Wiley (1985)
Patwary, M.A., Ref, P., Manne, F.: Multi-core spanning forest algorithms using the disjoint-set data structure. In: Proceedings of the 2012 IEEE International Parallel & Distributed Processing Symposium, IPDPS 2012, pp. 827–835. IEEE Computer Society Press, Washington, DC (2012)
Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of the 2009 IEEE Int’l Symp. on Parallel&Distributed Processing, IPDPS 2009, pp. 1–10. IEEE Computer Society, Washington, DC (2009)
Shiloach, Y., Vishkin, U.: An O(logn) parallel connectivity algorithm. J. Algs 3(1), 57–67 (1982)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cong, G., Muzio, P. (2014). Fast Parallel Connected Components Algorithms on GPUs. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8805. Springer, Cham. https://doi.org/10.1007/978-3-319-14325-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-14325-5_14
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14324-8
Online ISBN: 978-3-319-14325-5
eBook Packages: Computer ScienceComputer Science (R0)