Elsevier

Journal of Systems Architecture

Volume 70, October 2016, Pages 59-69
Journal of Systems Architecture

A comprehensive reconfigurable computing approach to memory wall problem of large graph computation

https://doi.org/10.1016/j.sysarc.2016.04.010Get rights and content

Highlights

  • The extension of the edge-streaming model with massive partitions to accelerate large graph computing in reconfigurable hardware.

  • A two-level shuffle network architecture to reduce the on-chip memory requirement while provide high processing throughput.

  • A compact storage designusing graph compression and a corresponding codec hardware to reduce the amount of transferred data.

  • Up to 3.85 times improvement in terms of performance to bandwidth ratio over the state-of-the-art hardware implementation.

Abstract

Graph computation problems that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures. Although recent studies use FPGA technology to tackle the memory wall problem of graph computation by adopting a massively multi-threaded architecture, the performance is still far less than optimal memory performance due to the long memory access latency. In this paper, we propose a comprehensive reconfigurable computing approach to address the memory wall problem. First, we present an extended edge-streaming model with massive partitions to provide better load balance while taking advantage of the streaming bandwidth of external memory in processing large graphs. Second, we propose a two-level shuffle network architecture to significantly reduce the on-chip memory requirement while provide high processing throughput that matches the bandwidth of the external memory. Third, we introduce a compact storage design based on graph compression schemes and propose the corresponding encoding and decoding hardware to reduce the data volume transferred between the processing engines and external memory. We validate the effectiveness of the proposed architecture by implementing three frequently-used graph algorithms on ML605 board, showing an up to 3.85 × improvement in terms of performance to bandwidth ratio over previously published FPGA-based implementations.

Introduction

Graphs are widely used abstraction to express relationships between real-world data elements such as web graphs, telecommunication networks [1] and task [2], whose mining calculation can be abstracted into graph computing. Existing systems for executing graph computations are mainly targeted on general-purpose computers [3] or clusters of general-purpose computers [4], [5], [6]. However, they are usually inefficient at performing graph computations due to the irregularity of memory accesses causing large numbers of cache misses and the high data access to computation ratio causing stalls of the floating-point computation units.

The memory wall problem [7] is the key issue in graph computation. Although new memory devices such as Reduced Latency Dynamic RAMs (RLDRAM) are developed to address the problem, the compromise on memory capacity for access latency limits their usage on large-scale graph computation.

Reconfigurable computing based on Field Programmable Gate Array (FPGA) technologies becomes an attractive option to attack the memory wall problem of the graph computation for its availability of massive parallel on-chip resources with flexible interconnect to support fine-grained communication as well as abundant I/O pins to provide high off-chip memory bandwidth. Recent studies have leveraged such advantages to solve graph problems such as breadth-first search [8], [9], all pairs shortest paths [10], [11], and Sparse Matrix-Vector Multiplication kernels [12] in FPGA-based platforms. Many of them adopt a massively multi-threaded architecture that allows issuing multiple outstanding memory requests to the parallel memory banks of shared off-chip memory with dedicated hardware support such as the Convey HC-1 [13]. Although such architecture can exploit the memory controller bandwidth by keeping it busy with every clock cycle either for writing or reading, there is still a large gap from its peak memory access performance. Because the irregular memory access patterns in the graph-based algorithms can cause more page misses in DRAM memories, so the memory throughput is decreased due to the relatively long memory latencies.

In this paper, we address the memory wall problem by taking advantage of sequential streaming bandwidth of external DRAM memory. Considering the fact that nature graphs have a much larger edge set than vertex set, access to edges and updates dominates the processing cost. Therefore, we adopt an edge-streaming model motivated by X-Stream [14], which iterates over edges and updates rather than over vertices. In our design, we stream edges from external DRAM memory while makes random access to the set of vertices in on-chip SRAM, leading to a fully utilization of external memory bandwidth in burst mode.

In order to support larger graphs which vertex data is not able to fit into the on-chip memory, we extend the edge-streaming model with massive partitions. In stead of shuffling the intermediate update results directly into its destination processing engine (PE), as we did in our previous work [15] to process small graphs, we wrote the intermediate updates back to slow storage (external DRAM memory), and read again when needed by the execution unit. We further analyzed several possible partition and workload assignment solutions to support such extension and proposed an optimized architecture with fine-grained partitions to achieve a more balanced load assignment among both PEs and on-chip memory banks.

Shuffling updates from all PEs into a much larger number of partitions leads to a significant challenge in designing the shuffle network architecture. In order to maintaining our design goal of exploiting sequential-access bandwidth to memory, it needs more on-chip memory to buffer shuffled updates. To address the problem, we proposed a two-level shuffle network, where each PE first shuffles the updates into several macro-partitions and then in the second level, these macro-partitions are further shuffled into final partitions by dedicated shuffle engines in parallel. The two-level shuffle network is able to reduce the on-chip memory requirement significantly while provide high processing throughput that matches the bandwidth of the external memory.

In addition to increase the utilization of the external memory bandwidth that improves the overall memory performance, we further ease the memory-bounded bottleneck by reducing the demand for memory access. We introduce a compact storage design based on graph compression schemes to reduce the data volume transferred between the processing engines and external memory. We also propose the hardware design of encoding and decoding logic to offload the computation overhead involved with graph compression while not break the streaming access pattern of external memory. With this compression scheme, data streams from the external memory can carry more information within the same memory bandwidth, which can in return increase the system throughput in terms of edges processed per second.

To verify our architecture, we experiment three graph algorithms under six different graphs in ML605 board. The major contributions of this work include:

  • The extension of the edge-streaming model with massive partitions for reconfigurable hardware acceleration of graph problems, which supports larger graphs while provides a better load balance.

  • A two-level shuffle network architecture to significantly reduce the on-chip memory requirement while provide high processing throughput that matches the bandwidth of the external memory.

  • A compact storage design based on graph compression schemes and the corresponding encoding and decoding hardware to reduce the data volume transferred between the processing engines and external memory.

  • Verification of our extended architecture on a ML605 system using three de-facto benchmark with detailed comparison of the state-of-the-art hardware implementation, showing up to 3.85 times improvement in terms of performance to bandwidth ratio.

The remainder of this paper is organized as follows. Section 2 introduces the background and motivation of our work. Section 3 presents our approach to address memory wall problem including the edge-streaming model with massive partitions, the two-level shuffle network architecture and the graph compression scheme. In Section 4, we verify our architecture with performance results. We review the related work in Section 5 and conclude our work in Section 6.

Section snippets

Preliminaries

Generally, graph structure is abstracted as an ordered pair sets G=(V,E), which comprises the set V of n vertices and the set E of m directed edges. An edge connecting vertex u and v can be denoted as e=(u,v), where u is the source and v is the destination. Consequently, e is an outgoing edge and inbound edge of u and v respectively. For a given vertex u, if Γ+(u)={vV|(u,v)E}, then Γ+(u) is the set of all outgoing neighbors of vertex u and its out-degree denoted as d+(u)=|Γ+(u)| is the number

Our approach

To address the memory wall problem, only making the external memory busy reading or writing is far from enough. More significant is that the external memory should operate at its optimum throughput. Considering the fact that SDRAMs are able to achieve much higher bandwidth under sequential memory accesses than under random accesses, we present an edge-streaming model and a corresponding on-chip distributed off-chip shared memory architecture to take fully use of such advantage.

Experiments

The overall architecture of our system based on the edge-streaming is applicable to diverse popular graph algorithms, the only implementation difference of which is the functional design of PE since different algorithms require different operations onto graphs during the execution. We evaluate our design using following algorithms:

  • PageRank: PageRank [22] is a widely used algorithm to measure the relative importance of web pages by computing the ranking for every web page based on the

Related work

As we step into the era of big data where the growth of the datasets is much faster than that of the computation and memory capacity in commodity computers, the efficient processing of large graph problems is receiving increasing attention in HPC community. This has motivated a substantial amount of previous work that deals with the design and implementation of graph processing systems in different architectures, which we categorize as single-PC systems, distributed-PC systems and dedicated

Conclusion

In this paper, we have proposed a reconfigurable computing method for efficient graph computation. We solved the memory wall problem by extending the edge-streaming model with massive partitions that makes fully use of streaming bandwidth of external DRAM memory while achieving a better load balance. We further proposed an two-level shuffle network architecture to significantly reduce the on-chip memory requirement while provide high processing throughput that matches the bandwidth of the

Acknowledgment

This paper is sponsored in part by the National High Technology and Research Development Program of China (863 Program, 2015AA050204), National Natural Science Foundation of China (61373032), and National Research Foundation (NRF), Prime Ministers Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme (R-706-000-101-281).

Xu Wang received the B.Eng. degree in electronics science and technology from the Huazhong University of Science and Technology, Wuhan, China, in 2005. He is currently pursuing the Ph.D. degree in computer science and technology from Shanghai Jiao Tong University, Shanghai, China. His current research interests include reconfigurable computing, computer architecture, machine learning and big data.

References (32)

  • H.F. Wedde et al.

    A comprehensive review of nature inspired routing algorithms for fixed telecommunication networks

    J. Syst. Archit.

    (2006)
  • X. Zhao et al.

    On the embeddability of random walk distances

    Proc. VLDB Endow.

    (2013)
  • J. Shun et al.

    Ligra: a lightweight graph processing framework for shared memory

    Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

    (2013)
  • K. Jeon et al.

    Large graph processing based on remote memory system

    Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC)

    (2010)
  • J.E. Gonzalez et al.

    Powergraph: distributed graph-parallel computation on natural graphs

    Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation

    (2012)
  • J.E. Gonzalez et al.

    Graphx: graph processing in a distributed dataflow framework

    Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

    (2014)
  • W.A. Wulf et al.

    Hitting the memory wall: implications of the obvious

    ACM SIGARCH Comput. Archit. News

    (1995)
  • B. Betkaoui et al.

    A reconfigurable computing approach for efficient and scalable parallel graph exploration

    Proceedings of the 23rd IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

    (2012)
  • O.G. Attia et al.

    Cygraph: A reconfigurable architecture for parallel breadth-first search

    Proceedings of the IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)

    (2014)
  • U. Bondhugula et al.

    Parallel FPGA-based all-pairs shortest-paths in a directed graph

    Proceedings of the 20th International Parallel and Distributed Processing Symposium, IPDPS .

    (2006)
  • B. Betkaoui et al.

    Parallel FPGA-based all pairs shortest paths for sparse networks: a human brain connectome case study

    Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL)

    (2012)
  • J. Fowers et al.

    A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication

    Proceedings of the 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

    (2014)
  • J.D. Bakos

    High-performance heterogeneous computing with the convey HC-1

    Comput. Sci. Eng.

    (2010)
  • A. Roy et al.

    X-stream: Edge-centric graph processing using streaming partitions

    Proceedings of the 24th ACM Symposium on Operating Systems Principles

    (2013)
  • X. Wang et al.

    Addressing memory wall problem of graph computation in reconfigurable system

    Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications (HPCC)

    (2015)
  • A. Apostolico et al.

    Graph compression by BFS

    Algorithms

    (2009)
  • Cited by (0)

    Xu Wang received the B.Eng. degree in electronics science and technology from the Huazhong University of Science and Technology, Wuhan, China, in 2005. He is currently pursuing the Ph.D. degree in computer science and technology from Shanghai Jiao Tong University, Shanghai, China. His current research interests include reconfigurable computing, computer architecture, machine learning and big data.

    Yongxin Zhu is an Associate Professor with the School of Microelectronics, Shanghai Jiao Tong University, China. He has been also a visiting Associate Professor with National University of Singapore since 2013. He is a senior member of China Computer Federation and a senior member of IEEE. He received his B.Eng. in EE from Hefei University of Technology, and M. Eng. in CS from Shanghai Jiao Tong University in 1991 and 1994 respectively. He received his Ph.D. in CS from National University of Singapore in 2001. His research interest is in computer architectures, embedded systems, medical electronics and multimedia. He has authored and co-authored over 90 English journal and conference papers and 30 Chinese journal papers. He has 18 Chinese patents approved.

    Linan Huang received the B.Eng. degree in Microelectronics from Shanghai Jiao Tong University in 2013. She is currently a Master degree Candidate in Integrated Circuits Engineering at Shanghai Jiao Tong University. Her current research interests include High-performance Accelerator Architecture and Reconfigurable Computing.

    View full text