Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Yan, Jie; Tan, Guangming; Sun, Ninghui

doi:10.1007/s11227-014-1239-1

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Published: 26 June 2014

Volume 69, pages 1462–1490, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jie Yan^1,2,
Guangming Tan¹ &
Ninghui Sun¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Traversal is a fundamental procedure in most parallel graph algorithms. To explore the massive fine-grained parallelism in graph traversal, the fine-grained data synchronization is critical. On commodity multi-core processors, the widely adopted solution is fine-grained locks (i.e., one lock per vertex). However, in emerging graph analytics of massive irregular graphs (e.g., social network and web graph), it suffers huge memory cost and poor locality due to the large-scale vertex set and inherent random vertex access. In this paper, we propose a novel fine-grained lock mechanism—lock virtualization (vLock). The key idea is to map the huge logical lock space to a small fixed physical lock space that can reside in cache during runtime. The virtualization mechanism effectively reduces lock incurred extra memory cost and cache misses with only a slight increase of lock conflict rate, while it preserves high portability for legacy codes by providing Pthreads-like application programming interface. Our further analysis reveals that given the random access pattern, the lock conflict rate is no longer related to the size of vertex set but only the numbers of both physical locks and parallel threads, thus vLock is independent from graph topologies. This paper presents a complete description of the vLock method as well as its theoretic foundation. We implemented an vLock library and evaluated its performance in four classic graph traversal algorithms (BFS, SSSP, CC, PageRank). Experiments on the Intel Xeon E5 eight-core processor show that, compared to Pthreads fine-grained locks, vLock significantly reduces lock’s cache misses and achieves 4–20 % performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Runtime Support for Distributed Dynamic Locality

Scalable and efficient graph traversal on high-throughput cluster

Article 10 November 2020

BlockGraphChi: Enabling Block Update in Out-of-Core Graph Processing

Article 23 October 2017

Notes

Similarly, $\mathbb {O}$ can be associated with edges and uniquely indexed by edge $id$. In this paper, we do not concern this case.
Note that the side effect of primitive $trylock$ is different from that in traditional fine-grained locks. In traditional fine locks, returning failure implies some other thread is operating on $v$ while in vLock it does not.
This is sometimes decided by compiling techniques. Intel C Compiler can automatically reorder instructions to hide the latency of $DIV$ well, while GNU C Complier cannot.
In analysis of Sects. 4.2–4.4, we assume that the random lock access is further uniform. In practice, this assumption may not hold for virtual locks although it typically holds for physical locks, which should be considered in cases requiring more accurate analysis.
For BFS and SSSP, normalized performance of each run is first calculated and then used to compute their harmonic mean and deviation. For CC and PageRank, however, the mean and standard deviation of runtime in all 16 runs are first calculated and then handled with normalization.
The size of physical lock space should be a prime number for hash by address and be power of 2 for hash by vertex.

References

Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J et al (2009) A view of the parallel computing landscape. In: Proceedings of communications of the ACM, p 52
National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington
Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 7(1):5–20
Article MathSciNet Google Scholar
Malewicz G, Austern M, Bik A, Dehnert J, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceeding of SIGMOD’10, Indianapolis, USA, 2010
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th OSDI, Hollywood, Oct 2012
Shun J, Belloch G (2013) Ligra: a lightweight graph processing framework for shared memory. In: Proceeding of PPoPP’13, Shenzhen, China, Feb 2013
Gregor D, Lumsdaine A (2005) The parallel BGL: a generic library for distributed graph computations. In: Proceeding of POOSC’05, Bloomington, July 2005
Graph500 benchmark (2010). http://www.graph500.org. Accessed 23 June 2014
Bader D, Feo J, Gilbert J, Kepner J, Koester D, Loh E, Madduri K, Mann W, Meuse T (2007) HPCS scalable synthetic compact applications #2 graph analysis (ssca#2 v2.2 specification), Sep 2007
Leskovec J, Lang K, Dasgupta A, Mahoney W (2008) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Article MathSciNet Google Scholar
Bader DA, Madduri K (2008) SNAP, small-world network analysis and partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of IPDPS, pp 1–12
Tu D, Tan G (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: Proceedings of international symposium on parallel and distributed processing with applications, pp 182–189
Cray XMT (2014). http://www.cray.com/Assets/PDF/products/xmt/CrayXMTBrochure.pdf. Accessed 23 June 2014
Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of ISCA ’07 San Diego, USA, 2007
Fraser K, Harris T (2007) Concurrent programming without locks. In: Proceedings of ACM transactions on computer systems, vol 25, issue 2, May 2007
Michael MM, Scott ML (1996) Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on principles of distributed computing, PODC ’96
Herlihy M, Moss JEB (1993) Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th annual international symposium on computer architecture, ISCA ’93
Harris T, Larus J, Rajwar R (2010) Transactional memory. In: Proceedings of synthesis lectures on computer architecture. Morgan & Claypool Publisher, San Rafael, USA, June 2010
Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Chew LP (2007) Optimistic parallelism requires abstractions. In: Proceedings of PLDI ’07, vol 7, San Diego, USA, June 2007
Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: Proceedings of the 31st annual international symposium on computer architecture, ISCA ’04
Yan J, Tan G, Zhang X, Yao E, Sun N (2013) vLock: lock virtualization mechanism for exploiting fine-grained parallelism in graph traversal algorithms. In: Proceedings of IEEE/ACM symposium on code generation and optimization (CGO ’13), pp 141–150
Intel Corporation (2014) Intel 64 and IA-32 architectures optimization reference manual. pp c1–c26
Bertsekas DP, Guerriero F, Musmanno R (1996) Parallel asynchronous label correcting methods for shortest paths. J Optim Theory Appl 88(2):297–320
Pearce R, Gokhale M, Amato NM (2010) Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: Proceeding of SC’10, New Orleans, Louisiana, USA, Nov 2010
Shiloach Y, Vishkin U (1982) An o(log n) parallel connectivity algorithm. J Algorithms 3:57–67
Article MathSciNet MATH Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of computer networks and ISDN systems. Elsevier Science Publishers, Oxford, UK pp 107–117
Chakrabarti D, Zhan Y, Faloutsos C (2004) R-MAT: a recursive model for graph mining. In: Proceedings of SDM’04, Toronto, Canada, August 2004
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media?. In: Proceedings of www’10, Raleigh, NC, USA pp 591–600
Boldi P, Codenotti B, Santini M, Vigna S (2008) A large time-aware graph. SIGIR Forum 42(2):33–38
Article Google Scholar
LAWS datasets (2014). http://law.di.unimi.it/datasets.php. Accessed 23 June 2014
Yan J, Tan G, Sun N (2013) Graphine: programming graph-parallel computation of large natural graphs on multicore cluster, technical report, ICT-HPC-2013-2
UPC Consortium (2013) UPC language and library specifications v1.3, Lawrence Berkeley National Lab, technical report LBNL-6623E, Nov 2013
Akgul BE, Mooney VJ (2002) The system-on-a-chip Lock Cache. Int J Des Autom Embed Syst 7:139–174
Article MATH Google Scholar
Feo J, Harper D, Kahan S, Konecny P (2005) Eldorado. In: Proceedings of the 2nd conference on computing frontiers, CF ’05, New York, NY, USA, pp 28–34
Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture, ISCA ’00
Afek Y, Dauber D, Touitou D (1995) Wait-free made fast. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing (STOC ’95)
Valois JD (1995) Lock-free linked lists using compare-and-swap. In: Proceedings of the 14th annual ACM symposium on principles of distributed computing, PODC ’95
Michael MM (2002) High performance dynamic lock-free hash tables and list-based sets. In: Proceedings of the 14th annual ACM symposium on parallel algorithms and architectures, SPAA ’02

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jie Yan, Guangming Tan & Ninghui Sun
University of Chinese Academy of Sciences, Beijing, China
Jie Yan

Authors

Jie Yan
View author publications
You can also search for this author inPubMed Google Scholar
Guangming Tan
View author publications
You can also search for this author inPubMed Google Scholar
Ninghui Sun
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jie Yan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 64 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, J., Tan, G. & Sun, N. Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture. J Supercomput 69, 1462–1490 (2014). https://doi.org/10.1007/s11227-014-1239-1

Download citation

Published: 26 June 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s11227-014-1239-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Runtime Support for Distributed Dynamic Locality

Scalable and efficient graph traversal on high-throughput cluster

BlockGraphChi: Enabling Block Update in Out-of-Core Graph Processing

Notes

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 64 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now