skip to main content
research-article

Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms

Published: 03 June 2020 Publication History

Abstract

Graph processing is one of the important research topics in the big-data era. To build a general framework for graph processing by using a DRAM-based FPGA board with deep memory hierarchy, one of the reasonable methods is to partition a given big graph into multiple small subgraphs, represent the graph with a two-dimensional grid, and then process the subgraphs one after another to divide and conquer the whole problem. Such a method (grid-graph processing) stores the graph data in the off-chip memory devices (e.g., on-board or host DRAM) that have large storage capacities but relatively small bandwidths, and processes individual small subgraphs one after another by using the on-chip memory devices (e.g., FFs, BRAM, and URAM) that have small storage capacities but superior random access performances. However, directly exchanging graph (vertex and edge) data between the processing units in FPGA chip with slow off-chip DRAMs during grid-graph processing leads to limited performances and excessive data transmission amounts between the FPGA chip and off-chip memory devices.
In this article, we show that it is effective in improving the performance of grid-graph processing on DRAM-based FPGA hardware accelerators by leveraging the flexibility and programmability of FPGAs to build application-specific caching mechanisms, which bridge the performance gaps between on-chip and off-chip memory devices, and reduce the data transmission amounts by exploiting the localities on data accessing. We design two application-specific caching mechanisms (i.e., vertex caching and edge caching) to exploit two types of localities (i.e., vertex locality and subgraph locality) that exist in grid-graph processing, respectively. Experimental results show that with the vertex caching mechanism, our system (named as FabGraph) achieves up to 3.1× and 2.5× speedups for BFS and PageRank, respectively, over ForeGraph when processing medium graphs stored in the on-board DRAM. With the edge caching mechanism, the extension of FabGraph (named as FabGraph+) achieves up to 9.96× speedups for BFS over FPGP when processing large graphs stored in the host DRAM.

References

[1]
R. K. Ahuja, K. Mehlhorn, J. Orlin, and R. E. Tarjan. 1990. Faster algorithms for the shortest path problem. J. Amer. Comput. Mach. 37, 2 (1990), 213--223.
[2]
Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng. 2019. Clip: A disk I/O focused parallel out-of-core graph processing system. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2019), 45--62.
[3]
ARM. 2019. AMBA AXI and ACE Protocol Specification. Retrieved from https://static.docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf.
[4]
O. G. Attia, T. Johnson, K. Townsend, P. Jones, and J. Zambreno. 2014. CyGraph: A reconfigurable architecture for parallel breadth-first search. In Proceedings of the IEEE International Parallel Distributed Processing Symposium Workshops (IPDPSW’14). 228--235.
[5]
S. Beamer, K. Asanovic, and D. Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). 56--65.
[6]
P. Bedi and C. Sharma. 2016. Community detection in social networks. WIREs Data Min. Knowl. Discov. 6, 3 (May 2016), 115--135.
[7]
B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk. 2012. A reconfigurable computing approach for efficient and scalable parallel graph exploration. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP’12). 8--15.
[8]
C. E. Bichot and P. Siarry. 2013. Graph Partitioning. John Wiley 8 Sons, Ltd.
[9]
P. Boldi, M. Santini, and S. Vigna. 2008. A large time-aware web graph. ACM SIGIR Forum 42, 2 (2008), 33--38.
[10]
S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1–7 (1998), 107--117.
[11]
Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang. 2016. NXgraph: An efficient graph processing system on a single machine. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’16). 409--420.
[12]
Graph 500 Committees. 2017. Graph 500. Retrieved from http://graph500.org/.
[13]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2001. Introduction to Algorithms (2 ed.). MIT Press and McGraw-Hill. 531–539 pages.
[14]
G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph processing framework on FPGA a case study of breadth-first search. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’16). 105--110.
[15]
G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’17). 217--226.
[16]
M. deLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T. E. Uribe, T. F. Jr. Knight, and A. DeHon. 2006. GraphStep: A system architecture for sparse-graph algorithms. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 143--151.
[17]
M. Faloutsos, P. Faloutsos, and C. Faloutsos. 1999. On power-law relationships of the internet topology. SIGCOMM Comput. Commun. Rev. 29, 4 (1999), 251--262.
[18]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’12). 17--30.
[19]
T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--13.
[20]
J. L. Hennessy and D. A. Patterson. 2011. Computer Architecture: A Quantitative ApproachComputer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann.
[21]
D. Hilbert. 1891. Ueber die stetige Abbildung einer Linie auf ein Flächenstäck. Math. Ann. 38, 3 (1891), 459--460.
[22]
H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim. 2017. HBM (high bandwidth memory) DRAM technology and architecture. In Proceedings of the IEEE International Memory Workshop (IMW’17). 1--4.
[23]
N. Kapre. 2015. Custom FPGA-based soft-processors for sparse graph acceleration. In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’15). 9--16.
[24]
H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the ACM International Conference on World Wide Web (WWW’10). 591--600.
[25]
G. Lei, Y. Dou, R. Li, and F. Xia. 2016. An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Trans. Circ. Syst. II: Express Briefs 63, 5 (2016), 473--477.
[26]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proceedings of the VLDB Endowment (VLDB’12). 716–727.
[27]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 135--146.
[28]
Micron. 2015. DDR4 SDRAM for Automotive. Retrieved from https://www.micron.com/products/dram/ddr4-sdram/.
[29]
E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen: An FPGA framework for vertex-centric graph computation. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’14). 25--28.
[30]
T. Oguntebi and K. Olukotun. 2016. GraphOps: A dataflow library for graph analytics acceleration. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’16). 111--117.
[31]
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A cycle accurate memory system simulator. IEEE Comput. Architect. Lett. 10, 1 (2011), 16--19.
[32]
A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’13). 472--488.
[33]
Z. Shao, R. Li, D. Hu, X. Liao, and H. Jin. 2019. Improving performance of graph processing on FPGA-DRAM platform by two-level vertex caching. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). 320--329.
[34]
J. Shun and G. E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. ACM SIGPLAN Notices 48, 8 (2013), 135--146.
[35]
Stanford. 2018. SNAP large network dataset collection. Retrieved from http://snap.stanford.edu/data/index.html.
[36]
Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson. 2013. From “Think Like a Vertex” to “Think Like a Graph.” In Proceedings of the VLDB Endowment (VLDB’13). 193--204.
[37]
J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. 2011. The anatomy of the facebook social graph. Retrieved from http://arxiv.org/abs/1111.4503.
[38]
Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama, C. Yuan, W. Liu, A. T. Riffel, and J. D. Owens. 2017. Gunrock: GPU graph analytics. ACM Trans. Parallel Comput. 4, 1 (2017), 3:1–3:49.
[39]
Wikipedia. 2010. PCI Express. Retrieved from https://en.wikipedia.org/wiki/PCI_Express.
[40]
Xilinx. 2017. Block Memory Generator v8.4. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/blk_mem_gen/v8_4/.
[41]
Xilinx. 2018. UltraScale Architecture Memory Resources-User Guide. Retrieved from https://www.xilinx.com/support/documentation/user_guides/.
[42]
Xilinx. 2018. Xilinx Boards and Kits. Retrieved from https://www.xilinx.com/products/boards-and-kits.html.
[43]
J. Zhang, S. Khoram, and J. Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’17). 207--216.
[44]
J. Zhang and J. Li. 2018. Degree-aware hybrid graph traversal on FPGA-HMC platform. In Proceedings of the ACM/SIGDA International Conference on Field-Programmable Gate Arrays (FPGA’18). 229--238.
[45]
K. Zhang, R. Chen, and H. Chen. 2015. NUMA-aware graph-structured analytics. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 183--193.
[46]
J. Zhong and B. He. 2014. Medusa: A parallel graph processing system on graphics processors. ACM SIGMOD Record 43, 2 (2014), 35--40.
[47]
S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Accelerating large-scale single-source shortest path on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW’15). 129--136.
[48]
S. Zhou, C. Chelmis, and V. K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In Proceedings of the IEEE International Conference on ReConFigurable Computing and FPGAs (ReConFig’15). 1--6.
[49]
S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-throughput and energy-efficient graph processing on FPGA. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). 103--110.
[50]
X. Zhu, W. Han, and W. Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference (ATC’15). 375--386.
[51]
Y. Zou and M. Lin. 2018. GridGAS: An I/O-efficient heterogeneous FPGA+CPU computing platform for very large-scale graph analytics. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT’18). 246--249.
[52]
Y. Zou and M. Lin. 2018. Very large-scale and node-heavy graph analytics with heterogeneous FPGA+CPU computing platform. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’18). 638--643.

Cited By

View all
  • (2021)Large-scale graph processing on FPGAs with caches for thousands of simultaneous missesProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00054(609-622)Online publication date: 14-Jun-2021
  • (2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-x77:12(14502-14524)Online publication date: 1-Dec-2021

Index Terms

  1. Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 13, Issue 3
      September 2020
      182 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3404107
      • Editor:
      • Deming Chen
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 June 2020
      Online AM: 07 May 2020
      Accepted: 01 March 2020
      Revised: 01 February 2020
      Received: 01 November 2019
      Published in TRETS Volume 13, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Hardware accelerators
      2. graph analytics
      3. large graph processing

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • National Key Research and Development Program of China
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)19
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Large-scale graph processing on FPGAs with caches for thousands of simultaneous missesProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00054(609-622)Online publication date: 14-Jun-2021
      • (2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-x77:12(14502-14524)Online publication date: 1-Dec-2021

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media