Abstract
Traversal is a fundamental procedure in most parallel graph algorithms. To explore the massive fine-grained parallelism in graph traversal, the fine-grained data synchronization is critical. On commodity multi-core processors, the widely adopted solution is fine-grained locks (i.e., one lock per vertex). However, in emerging graph analytics of massive irregular graphs (e.g., social network and web graph), it suffers huge memory cost and poor locality due to the large-scale vertex set and inherent random vertex access. In this paper, we propose a novel fine-grained lock mechanism—lock virtualization (vLock). The key idea is to map the huge logical lock space to a small fixed physical lock space that can reside in cache during runtime. The virtualization mechanism effectively reduces lock incurred extra memory cost and cache misses with only a slight increase of lock conflict rate, while it preserves high portability for legacy codes by providing Pthreads-like application programming interface. Our further analysis reveals that given the random access pattern, the lock conflict rate is no longer related to the size of vertex set but only the numbers of both physical locks and parallel threads, thus vLock is independent from graph topologies. This paper presents a complete description of the vLock method as well as its theoretic foundation. We implemented an vLock library and evaluated its performance in four classic graph traversal algorithms (BFS, SSSP, CC, PageRank). Experiments on the Intel Xeon E5 eight-core processor show that, compared to Pthreads fine-grained locks, vLock significantly reduces lock’s cache misses and achieves 4–20 % performance improvement.












Similar content being viewed by others
Notes
Similarly, \(\mathbb {O}\) can be associated with edges and uniquely indexed by edge \(id\). In this paper, we do not concern this case.
Note that the side effect of primitive \(trylock\) is different from that in traditional fine-grained locks. In traditional fine locks, returning failure implies some other thread is operating on \(v\) while in vLock it does not.
This is sometimes decided by compiling techniques. Intel C Compiler can automatically reorder instructions to hide the latency of \(DIV\) well, while GNU C Complier cannot.
For BFS and SSSP, normalized performance of each run is first calculated and then used to compute their harmonic mean and deviation. For CC and PageRank, however, the mean and standard deviation of runtime in all 16 runs are first calculated and then handled with normalization.
The size of physical lock space should be a prime number for hash by address and be power of 2 for hash by vertex.
References
Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J et al (2009) A view of the parallel computing landscape. In: Proceedings of communications of the ACM, p 52
National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington
Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 7(1):5–20
Malewicz G, Austern M, Bik A, Dehnert J, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceeding of SIGMOD’10, Indianapolis, USA, 2010
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th OSDI, Hollywood, Oct 2012
Shun J, Belloch G (2013) Ligra: a lightweight graph processing framework for shared memory. In: Proceeding of PPoPP’13, Shenzhen, China, Feb 2013
Gregor D, Lumsdaine A (2005) The parallel BGL: a generic library for distributed graph computations. In: Proceeding of POOSC’05, Bloomington, July 2005
Graph500 benchmark (2010). http://www.graph500.org. Accessed 23 June 2014
Bader D, Feo J, Gilbert J, Kepner J, Koester D, Loh E, Madduri K, Mann W, Meuse T (2007) HPCS scalable synthetic compact applications #2 graph analysis (ssca#2 v2.2 specification), Sep 2007
Leskovec J, Lang K, Dasgupta A, Mahoney W (2008) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Bader DA, Madduri K (2008) SNAP, small-world network analysis and partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of IPDPS, pp 1–12
Tu D, Tan G (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: Proceedings of international symposium on parallel and distributed processing with applications, pp 182–189
Cray XMT (2014). http://www.cray.com/Assets/PDF/products/xmt/CrayXMTBrochure.pdf. Accessed 23 June 2014
Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of ISCA ’07 San Diego, USA, 2007
Fraser K, Harris T (2007) Concurrent programming without locks. In: Proceedings of ACM transactions on computer systems, vol 25, issue 2, May 2007
Michael MM, Scott ML (1996) Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on principles of distributed computing, PODC ’96
Herlihy M, Moss JEB (1993) Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th annual international symposium on computer architecture, ISCA ’93
Harris T, Larus J, Rajwar R (2010) Transactional memory. In: Proceedings of synthesis lectures on computer architecture. Morgan & Claypool Publisher, San Rafael, USA, June 2010
Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Chew LP (2007) Optimistic parallelism requires abstractions. In: Proceedings of PLDI ’07, vol 7, San Diego, USA, June 2007
Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: Proceedings of the 31st annual international symposium on computer architecture, ISCA ’04
Yan J, Tan G, Zhang X, Yao E, Sun N (2013) vLock: lock virtualization mechanism for exploiting fine-grained parallelism in graph traversal algorithms. In: Proceedings of IEEE/ACM symposium on code generation and optimization (CGO ’13), pp 141–150
Intel Corporation (2014) Intel 64 and IA-32 architectures optimization reference manual. pp c1–c26
Bertsekas DP, Guerriero F, Musmanno R (1996) Parallel asynchronous label correcting methods for shortest paths. J Optim Theory Appl 88(2):297–320
Pearce R, Gokhale M, Amato NM (2010) Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: Proceeding of SC’10, New Orleans, Louisiana, USA, Nov 2010
Shiloach Y, Vishkin U (1982) An o(log n) parallel connectivity algorithm. J Algorithms 3:57–67
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of computer networks and ISDN systems. Elsevier Science Publishers, Oxford, UK pp 107–117
Chakrabarti D, Zhan Y, Faloutsos C (2004) R-MAT: a recursive model for graph mining. In: Proceedings of SDM’04, Toronto, Canada, August 2004
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media?. In: Proceedings of www’10, Raleigh, NC, USA pp 591–600
Boldi P, Codenotti B, Santini M, Vigna S (2008) A large time-aware graph. SIGIR Forum 42(2):33–38
LAWS datasets (2014). http://law.di.unimi.it/datasets.php. Accessed 23 June 2014
Yan J, Tan G, Sun N (2013) Graphine: programming graph-parallel computation of large natural graphs on multicore cluster, technical report, ICT-HPC-2013-2
UPC Consortium (2013) UPC language and library specifications v1.3, Lawrence Berkeley National Lab, technical report LBNL-6623E, Nov 2013
Akgul BE, Mooney VJ (2002) The system-on-a-chip Lock Cache. Int J Des Autom Embed Syst 7:139–174
Feo J, Harper D, Kahan S, Konecny P (2005) Eldorado. In: Proceedings of the 2nd conference on computing frontiers, CF ’05, New York, NY, USA, pp 28–34
Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture, ISCA ’00
Afek Y, Dauber D, Touitou D (1995) Wait-free made fast. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing (STOC ’95)
Valois JD (1995) Lock-free linked lists using compare-and-swap. In: Proceedings of the 14th annual ACM symposium on principles of distributed computing, PODC ’95
Michael MM (2002) High performance dynamic lock-free hash tables and list-based sets. In: Proceedings of the 14th annual ACM symposium on parallel algorithms and architectures, SPAA ’02
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Yan, J., Tan, G. & Sun, N. Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture. J Supercomput 69, 1462–1490 (2014). https://doi.org/10.1007/s11227-014-1239-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1239-1