An adaptive breadth-first search algorithm on integrated architectures

Zhang, Feng; Lin, Heng; Zhai, Jidong; Cheng, Jie; Xiang, Dingyi; Li, Jizhong; Chai, Yunpeng; Du, Xiaoyong

doi:10.1007/s11227-018-2525-0

An adaptive breadth-first search algorithm on integrated architectures

Published: 11 August 2018

Volume 74, pages 6135–6155, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Feng Zhang ORCID: orcid.org/0000-0003-1983-7321^1,2,
Heng Lin³,
Jidong Zhai³,
Jie Cheng⁴,
Dingyi Xiang⁴,
Jizhong Li⁴,
Yunpeng Chai^1,2 &
…
Xiaoyong Du^1,2

763 Accesses
Explore all metrics

Abstract

In the big data era, graph applications are becoming increasingly important for data analysis. Breadth-first search (BFS) is one of the most representative algorithms; therefore, accelerating BFS using graphics processing units (GPUs) is a hot research topic. However, due to their random data access pattern, it is difficult to take full advantage of the power of GPUs. Recently, hardware designers have integrated CPUs and GPUs on the same chip, allowing both devices to share physical memory, which provides the convenience of switching between CPUs and GPUs with little cost. BFS processing can be divided into several levels, and various traversal orders can be used at each level. Using different traversal orders on different devices (CPUs or GPUs) results in diverse performances. Thus, the challenge in using BFS on integrated architectures is how to select the traversal order and the device for each level. Previous works have failed to address this problem effectively. In this study, we propose an adaptive performance model that automatically finds a suitable traversal order and device for each level. We evaluated our method on Graph500, where it not only shows the best energy efficiency but also achieves a giga-traversed edges per second (GTEPS) performance of approximately 2.1 GTEPS, which is a $2.3\,\times $ speed improvement over the state-of-the-art BFS on integrated architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Graph Accelerator with Distributed On-Chip Memory Hierarchy

Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures

Fast, Scalable, and Energy-Efficient Parallel Breadth-First Search

References

Agarwal V, Petrini F, Pasetto D, Bader DA (2010) Scalable graph exploration on multicore processors. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, pp 1–11
Ashari A, Sedaghati N, Eisenlohr J, Parthasarath S, Sadayappan P (2014) Fast sparse matrix–vector multiplication on GPUs for graph applications. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC14. IEEE, pp 781–792
AMD (2018) AMD Ryzen 5 2400G with Radeon RX Vega 11 Graphics. https://www.amd.com/en/products/apu/amd-ryzen-5-2400g
Beamer S, Asanović K, Patterson D (2013) Direction-optimizing breadth-first search. Sci Program 21(3–4):137–148
Google Scholar
Bouvier D, Sander B (2014) Applying AMDs Kaveri APU for heterogeneous computing. In: Hot Chips: A Symposium on High Performance Chips (HC26)
Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177
Article Google Scholar
Branover A, Foley D, Steinman M (2012) AMD fusion APU: Llano. IEEE Micro 32(2):28–37
Article Google Scholar
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. Comput Netw 33(1):309–320
Article Google Scholar
Chakrabarti D, Zhan Y, Faloutsos C (2004) R-MAT: a recursive model for graph mining. In: SDM, vol 4. SIAM, pp 442–446
Chhugani J, Satish N, Kim C, Sewall J, Dubey P (2012) Fast and efficient graph traversal algorithm for CPUs: maximizing single-node efficiency. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS). IEEE, pp 378–389
Cormen TH (2009) Introduction to algorithms. MIT Press, Cambridge
MATH Google Scholar
Daga M, Nutter M, Meswani M (2014) Efficient breadth-first search on a heterogeneous processor. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE, pp 373–382
Dongarra JJ, Meuer HW, Strohmaier E et al (1997) Top500 supercomputer sites. Supercomputer 13:89–111
Google Scholar
Erdös Rényi (1959) On random graphs I. Publ Math Debr 6:290–297
MATH Google Scholar
Hong S, Kim SK, Oguntebi T, Olukotun K (2011) Accelerating CUDA graph algorithms at maximum warp. In: ACM SIGPLAN Notices, vol 46. ACM, pp 267–276
Hong S, Oguntebi T, Olukotun K (2011) Efficient parallel graph exploration on multi-core CPU and GPU. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, pp 78–88
Intel Corporation (2014) The compute architecture of Intel processor graphics Gen7.5. https://software.intel.com/
Jensen TR, Toft B (2011) Graph coloring problems, vol 39. Wiley, London
MATH Google Scholar
Kepner J, Gilbert J (2011) Graph algorithms in the language of linear algebra. SIAM, Philadelphia
Book Google Scholar
Korf RE (1985) Depth-first iterative-deepening: an optimal admissible tree search. Artif Intell 27(1):97–109
Article MathSciNet Google Scholar
Korf RE, Schultze P (2005) Large-scale parallel breadth-first search. In: Association for the Advancement of Artificial Intelligence (AAAI), vol 5, pp 1380–1385
Kumar P, Huang HH (2016) G-store: high-performance graph store for trillion-edge processing. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC16. IEEE, pp 830–841
Li J, Tan G, Chen M, Sun N (2013) SMAT: an input adaptive auto-tuner for sparse matrix–vector multiplication. In: ACM SIGPLAN Notices, vol 48. ACM, pp 117–126
Liu H, Huang HH (2015) Enterprise: breadth-first graph traversal on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, p 68
Liu H, Huang HH (2017) Graphene: fine-grained IO management for graph computing. In: USENIX Conference on File and Storage Technologies (FAST), pp 285–300
Liu H, Huang HH, Hu Y (2016) iBFS: concurrent breadth-first search on GPUs. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp 403–416
Liu T, Chen CC, Kim W, Milor L (2015) Comprehensive reliability and aging analysis on SRAMs within microprocessor systems. Microelectron Reliab 55(9):1290–1296
Article Google Scholar
Liu T, Chen CC, Wu J, Milor L (2016) Sram stability analysis for different cache configurations due to bias temperature instability and hot carrier injection. In: 2016 IEEE 34th International Conference on Computer Design (ICCD). IEEE, pp 225–232
Liu W, Vinter B (2015) A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors. J Parallel Distrib Comput 85:47–61
Article Google Scholar
Liu W, Vinter B (2015) CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, pp 339–350
Liu W, Vinter B (2015) Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput 49:179–193
Article MathSciNet Google Scholar
Luo L, Wong M, Hwu W (2010) An effective GPU implementation of breadth-first search. In: Proceedings of the 47th Design Automation Conference. ACM, pp 52–55
Merrill D, Garland M, Grimshaw A (2012) Scalable GPU graph traversal. In: ACM SIGPLAN Notices, vol 47. ACM, pp 117–128
Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010) Introducing the Graph 500. In: Cray Users Group (CUG) Proceedings
YOKOGAWA (2017) WT210/WT230 digital power meters. http://tmi.yokogawa.com/products/digital-power-analyzers/
Nikolskiy VP, Stegailov VV, Vecher VS (2016) Efficiency of the Tegra K1 and X1 systems-on-chip for classical molecular dynamics. In: 2016 International Conference on High Performance Computing and Simulation (HPCS). IEEE, pp 682–689
Pearce R, Gokhale M, Amato NM (2013) Scaling techniques for massive scale-free graphs in distributed (external) memory. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, pp 825–836
Saad Y (1990) SPARSKIT: a basic tool kit for sparse matrix computations. NASA technical report, NASA, pp 1–30
Satish N, Sundaram N, Patwary MMA, Seo J, Park J, Hassaan MA, Sengupta S, Yin Z, Dubey P (2014) Navigating the maze of graph analytics frameworks using massive graph datasets. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, pp 979–990
Scarpazza DP, Villa O, Petrini F (2008) Efficient breadth-first search on the Cell/BE processor. IEEE Trans Parallel Distrib Syst 19(10):1381–1395
Article Google Scholar
Sedaghati N, Mu T, Pouchet LN, Parthasarathy S, Sadayappan P (2015) Automatic selection of sparse matrix representation on GPUs. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, pp 99–108
Shi X, Zheng Z, Zhou Y, Jin H, He L, Liu B, Hua QS (2018) Graph processing on GPUs: a survey. ACM Comput Surv 50(6):81
Article Google Scholar
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66–73
Article Google Scholar
Su BY, Keutzer K (2012) clSpMV: a cross-platform OpenCL SpMV framework on GPUs. In: Proceedings of the 26th ACM International Conference on Supercomputing. ACM, pp 353–364
Wang X, Liu W, Xue W, Wu L (2018) swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp 338–353
Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, p 11
Yan S, Li C, Zhang Y, Zhou H (2014) yaSpMV: yet another SpMV framework on GPUs. In: ACM SIGPLAN Notices, vol 49. ACM, pp 107–118
Yang C, Buluc A, Owens JD (2018) Implementing push–pull efficiently in GraphBLAS. In: International Conference on Parallel Processing (ICPP)
Yasui Y, Fujisawa K (2015) Fast and scalable NUMA-based thread parallel breadth-first search. In: 2015 International Conference on High Performance Computing and Simulation (HPCS). IEEE, pp 377–385
Zhang F, Zhai J, Chen W, He B, Zhang S (2015) To co-run, or not to co-run: a performance study on integrated architectures. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, pp 89–92
Zhang F, Wu B, Zhai J, He B, Chen W (2017) FinePar: irregularity-aware fine-grained workload partitioning on integrated architectures. In: International Symposium on Code Generation and Optimization (CGO). IEEE Press, pp 27–38
Zhang F, Zhai J, He B, Zhang S, Chen W (2017) Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Trans Parallel Distrib Syst 28(3):905–918
Article Google Scholar
Zhang R, Liu T, Yang K, Milor L (2017) Analysis of time-dependent dielectric breakdown induced aging of SRAM cache with different configurations. Microelectron Reliab 76:87–91
Article Google Scholar
Zhong J, He B (2014) Medusa: simplified graph processing on GPUs. IEEE Trans Parallel Distrib Syst 25(6):1543–1552
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors sincerely thank the anonymous reviewers for their valuable comments and suggestions. This work is partially supported by the National Key R&D Program of China (Grant No. 2016YFB0200100), National Natural Science Foundation of China (Grant Nos. 61732014, 61722208, 61472201, and 61472427). This work is also supported by Huawei Technologies Co. Ltd, Beijing Natural Science Foundation (No. 4172031), China Postdoctoral Science Foundation (2017M620992), and the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China (Nos. 16XNLQ02, 18XNLG07). Jidong Zhai is the corresponding author of this paper.

Author information

Authors and Affiliations

Key Laboratory of Data Engineering and Knowledge Engineering (MOE), Renmin University of China, Beijing, China
Feng Zhang, Yunpeng Chai & Xiaoyong Du
School of Information, Renmin University of China, Beijing, China
Feng Zhang, Yunpeng Chai & Xiaoyong Du
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Heng Lin & Jidong Zhai
Huawei Technologies Co., Ltd., Shenzhen, China
Jie Cheng, Dingyi Xiang & Jizhong Li

Authors

Feng Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Heng Lin
View author publications
You can also search for this author inPubMed Google Scholar
Jidong Zhai
View author publications
You can also search for this author inPubMed Google Scholar
Jie Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Dingyi Xiang
View author publications
You can also search for this author inPubMed Google Scholar
Jizhong Li
View author publications
You can also search for this author inPubMed Google Scholar
Yunpeng Chai
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jidong Zhai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, F., Lin, H., Zhai, J. et al. An adaptive breadth-first search algorithm on integrated architectures. J Supercomput 74, 6135–6155 (2018). https://doi.org/10.1007/s11227-018-2525-0

Download citation

Published: 11 August 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11227-018-2525-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive breadth-first search algorithm on integrated architectures

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Efficient Graph Accelerator with Distributed On-Chip Memory Hierarchy

Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures

Fast, Scalable, and Energy-Efficient Parallel Breadth-First Search

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now