skip to main content
10.1145/2688500.2688507acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

NUMA-aware graph-structured analytics

Published: 24 January 2015 Publication History

Abstract

Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing large-scale graphs on a single server-class machine, which may have several terabytes of RAM and 80 or more cores. However, prior graph-analytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graph-analytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential inter-node (i.e., remote) memory accesses have much higher bandwidth than both intra- and inter-node random ones. Based on them, this paper describes Polymer, a NUMA-aware graph-analytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, application-defined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edge-oriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80-core machine shows that Polymer often outperforms the state-of-the-art single-machine graph-analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.

References

[1]
The 9th dimacs implementation challenge - shortest paths. http://www.dis.uniroma1.it/challenge9/.
[2]
Graph 500. http://www.graph500.org.
[3]
numactl. http://oss.sgi.com/projects/libnuma/.
[4]
L. A. Adamic and B. A. Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002.
[5]
A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new os architecture for scalable multicore systems. In SOSP, 2009.
[6]
S. Beamer, K. Asanovi´c, and D. Patterson. Direction-optimizing breadth-first search. Scientific Programming, 21(3):137–148, 2013.
[7]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107–117, 1998.
[8]
I. Calciu, D. Dice, Y. Lev, V. Luchangco, V. J. Marathe, and N. Shavit. Numa-aware reader-writer locks. In PPoPP, 2013.
[9]
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, volume 4, pages 442–446. SIAM, 2004.
[10]
R. Chen and H. Chen. Tiled-mapreduce: Efficient and flexible mapreduce processing on multicore with tiling. ACM TACO, 10, 2013.
[11]
R. Chen, H. Chen, and B. Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In PACT, 2010.
[12]
R. Chen, J. Shi, Y. Chen, H. Guan, and H. Chen. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report 2013-11-001, IPADS, SJTU, 2013.
[13]
R. Chen, X. Ding, P. Wang, H. Chen, B. Zang, and H. Guan. Computation and communication efficient graph processing with distributed immutable view. In HPDC, 2014.
[14]
R. Chen, J. Shi, B. Zang, and H. Guan. Bipartite-oriented distributed graph partitioning for big learning. In APSys, 2014.
[15]
R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys, pages 85–98, 2012.
[16]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al. Introduction to algorithms, volume 2. MIT press Cambridge, 2001.
[17]
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: a holistic approach to memory placement on numa systems. In ASPLOS, 2013.
[18]
T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In SOSP, 2013.
[19]
D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing numa locks. In PPoPP, pages 247–256, 2012.
[20]
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251–262, 1999.
[21]
F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, V. Quema, and I. Grenoble. Large pages may be harmful on numa systems. In USENIX ATC, 2014.
[22]
J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012.
[23]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014.
[24]
W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, and E. Chen. Chronos: a graph engine for temporal graph analysis. In EuroSys, 2014.
[25]
U. Kang, D. Horng, et al. Inference of beliefs on billion-scale graphs. In SIGKDD-LDMTA, 2010.
[26]
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In EuroSys, pages 169–182, 2013.
[27]
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010.
[28]
A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI, 2012.
[29]
R. Lachaize, B. Lepers, V. Quéma, et al. Memprof: A memory profiler for numa multicore systems. In USENIX ATC, 2012.
[30]
X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In PPoPP, 2014.
[31]
Y. Liu, B. Wu, H. Wang, and P. Ma. Bpgm: A big graph mining tool. Tsinghua Science and Technology, 19(1):33–38, 2014.
[32]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB Endowment, 5(8):716–727, 2012.
[33]
A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5–20, 2007.
[34]
Z. Majo and T. R. Gross. Memory management in numa multicore systems: trapped between cache contention and interconnect overhead. In ISMM, 2011.
[35]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
[36]
J. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21–65, 1991.
[37]
U. Meyer and P. Sanders. δ -stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114–152, 2003.
[38]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439–455. ACM, 2013.
[39]
D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, 2013.
[40]
B. Panda, J. Herbach, S. Basu, and R. Bayardo. PLANET: massively parallel learning of tree ensembles with MapReduce. In VLDB, 2009.
[41]
V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan. Managing large graphs on multi-cores with graph awareness. In Usenix ATC, 2012.
[42]
A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In SOSP, 2013.
[43]
J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In PPoPP, 2013.
[44]
A. Smola and S. Narayanamurthy. An architecture for parallel topic models. In VLDB, 2010.
[45]
P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. Replication-based fault-tolerance for large-scale graph processing. In DSN, 2014.
[46]
C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In PPoPP, 2015.
[47]
X. Zhao, A. Chang, A. D. Sarma, H. Zheng, and B. Y. Zhao. On the embeddability of random walk distances. In VLDB, 2013.
[48]
J. Zhong and B. He. Medusa: Simplified Graph Processing on GPUs. TPDS, 2013.
[49]
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMUCALD-02-107, Carnegie Mellon University, 2002.

Cited By

View all
  • (2025)Nezha: An Efficient Distributed Graph Processing System on Heterogeneous HardwareProceedings of the ACM on Management of Data10.1145/37097073:1(1-27)Online publication date: 11-Feb-2025
  • (2024)CAVE: Concurrency-Aware Graph Processing on SSDsProceedings of the ACM on Management of Data10.1145/36549282:3(1-26)Online publication date: 30-May-2024
  • (2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 8
    PPoPP '15
    August 2015
    290 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2858788
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Graph-structured Analytics
  2. Non-uniform Memory Access (NUMA)

Qualifiers

  • Research-article

Conference

PPoPP '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)9
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Nezha: An Efficient Distributed Graph Processing System on Heterogeneous HardwareProceedings of the ACM on Management of Data10.1145/37097073:1(1-27)Online publication date: 11-Feb-2025
  • (2024)CAVE: Concurrency-Aware Graph Processing on SSDsProceedings of the ACM on Management of Data10.1145/36549282:3(1-26)Online publication date: 30-May-2024
  • (2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
  • (2024)WASP: Workload-Aware Self-Replicating Page-Tables for NUMA ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640369(1233-1249)Online publication date: 27-Apr-2024
  • (2024)TgStore: An Efficient Storage System for Large Time-Evolving GraphsIEEE Transactions on Big Data10.1109/TBDATA.2024.336608710:2(158-173)Online publication date: Apr-2024
  • (2023)A Scalable Data Structure for Efficient Graph Analytics and In-Place MutationsData10.3390/data81101668:11(166)Online publication date: 3-Nov-2023
  • (2023)DArray: A High Performance RDMA-Based Distributed ArrayProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605608(715-724)Online publication date: 7-Aug-2023
  • (2023)Connectivity-Aware Link Analysis for Skewed GraphsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605579(482-491)Online publication date: 7-Aug-2023
  • (2023)iQANProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577527(313-328)Online publication date: 25-Feb-2023
  • (2023)An Unequal Caching Strategy for Shared-Memory Graph AnalyticsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321888534:3(955-967)Online publication date: 1-Mar-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media