Skip to main content
Log in

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Traversal is a fundamental procedure in most parallel graph algorithms. To explore the massive fine-grained parallelism in graph traversal, the fine-grained data synchronization is critical. On commodity multi-core processors, the widely adopted solution is fine-grained locks (i.e., one lock per vertex). However, in emerging graph analytics of massive irregular graphs (e.g., social network and web graph), it suffers huge memory cost and poor locality due to the large-scale vertex set and inherent random vertex access. In this paper, we propose a novel fine-grained lock mechanism—lock virtualization (vLock). The key idea is to map the huge logical lock space to a small fixed physical lock space that can reside in cache during runtime. The virtualization mechanism effectively reduces lock incurred extra memory cost and cache misses with only a slight increase of lock conflict rate, while it preserves high portability for legacy codes by providing Pthreads-like application programming interface. Our further analysis reveals that given the random access pattern, the lock conflict rate is no longer related to the size of vertex set but only the numbers of both physical locks and parallel threads, thus vLock is independent from graph topologies. This paper presents a complete description of the vLock method as well as its theoretic foundation. We implemented an vLock library and evaluated its performance in four classic graph traversal algorithms (BFS, SSSP, CC, PageRank). Experiments on the Intel Xeon E5 eight-core processor show that, compared to Pthreads fine-grained locks, vLock significantly reduces lock’s cache misses and achieves 4–20 % performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Similarly, \(\mathbb {O}\) can be associated with edges and uniquely indexed by edge \(id\). In this paper, we do not concern this case.

  2. Note that the side effect of primitive \(trylock\) is different from that in traditional fine-grained locks. In traditional fine locks, returning failure implies some other thread is operating on \(v\) while in vLock it does not.

  3. This is sometimes decided by compiling techniques. Intel C Compiler can automatically reorder instructions to hide the latency of \(DIV\) well, while GNU C Complier cannot.

  4. In analysis of Sects. 4.24.4, we assume that the random lock access is further uniform. In practice, this assumption may not hold for virtual locks although it typically holds for physical locks, which should be considered in cases requiring more accurate analysis.

  5. For BFS and SSSP, normalized performance of each run is first calculated and then used to compute their harmonic mean and deviation. For CC and PageRank, however, the mean and standard deviation of runtime in all 16 runs are first calculated and then handled with normalization.

  6. The size of physical lock space should be a prime number for hash by address and be power of 2 for hash by vertex.

References

  1. Asanovic K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J et al (2009) A view of the parallel computing landscape. In: Proceedings of communications of the ACM, p 52

  2. National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington

  3. Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 7(1):5–20

    Article  MathSciNet  Google Scholar 

  4. Malewicz G, Austern M, Bik A, Dehnert J, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceeding of SIGMOD’10, Indianapolis, USA, 2010

  5. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th OSDI, Hollywood, Oct 2012

  6. Shun J, Belloch G (2013) Ligra: a lightweight graph processing framework for shared memory. In: Proceeding of PPoPP’13, Shenzhen, China, Feb 2013

  7. Gregor D, Lumsdaine A (2005) The parallel BGL: a generic library for distributed graph computations. In: Proceeding of POOSC’05, Bloomington, July 2005

  8. Graph500 benchmark (2010). http://www.graph500.org. Accessed 23 June 2014

  9. Bader D, Feo J, Gilbert J, Kepner J, Koester D, Loh E, Madduri K, Mann W, Meuse T (2007) HPCS scalable synthetic compact applications #2 graph analysis (ssca#2 v2.2 specification), Sep 2007

  10. Leskovec J, Lang K, Dasgupta A, Mahoney W (2008) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123

    Article  MathSciNet  Google Scholar 

  11. Bader DA, Madduri K (2008) SNAP, small-world network analysis and partitioning: an open-source parallel graph framework for the exploration of large-scale networks. In: Proceedings of IPDPS, pp 1–12

  12. Tu D, Tan G (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: Proceedings of international symposium on parallel and distributed processing with applications, pp 182–189

  13. Cray XMT (2014). http://www.cray.com/Assets/PDF/products/xmt/CrayXMTBrochure.pdf. Accessed 23 June 2014

  14. Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of ISCA ’07 San Diego, USA, 2007

  15. Fraser K, Harris T (2007) Concurrent programming without locks. In: Proceedings of ACM transactions on computer systems, vol 25, issue 2, May 2007

  16. Michael MM, Scott ML (1996) Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on principles of distributed computing, PODC ’96

  17. Herlihy M, Moss JEB (1993) Transactional memory: architectural support for lock-free data structures. In: Proceedings of the 20th annual international symposium on computer architecture, ISCA ’93

  18. Harris T, Larus J, Rajwar R (2010) Transactional memory. In: Proceedings of synthesis lectures on computer architecture. Morgan & Claypool Publisher, San Rafael, USA, June 2010

  19. Kulkarni M, Pingali K, Walter B, Ramanarayanan G, Bala K, Chew LP (2007) Optimistic parallelism requires abstractions. In: Proceedings of PLDI ’07, vol 7, San Diego, USA, June 2007

  20. Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: Proceedings of the 31st annual international symposium on computer architecture, ISCA ’04

  21. Yan J, Tan G, Zhang X, Yao E, Sun N (2013) vLock: lock virtualization mechanism for exploiting fine-grained parallelism in graph traversal algorithms. In: Proceedings of IEEE/ACM symposium on code generation and optimization (CGO ’13), pp 141–150

  22. Intel Corporation (2014) Intel 64 and IA-32 architectures optimization reference manual. pp c1–c26

  23. Bertsekas DP, Guerriero F, Musmanno R (1996) Parallel asynchronous label correcting methods for shortest paths. J Optim Theory Appl 88(2):297–320

  24. Pearce R, Gokhale M, Amato NM (2010) Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: Proceeding of SC’10, New Orleans, Louisiana, USA, Nov 2010

  25. Shiloach Y, Vishkin U (1982) An o(log n) parallel connectivity algorithm. J Algorithms 3:57–67

    Article  MathSciNet  MATH  Google Scholar 

  26. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of computer networks and ISDN systems. Elsevier Science Publishers, Oxford, UK pp 107–117

  27. Chakrabarti D, Zhan Y, Faloutsos C (2004) R-MAT: a recursive model for graph mining. In: Proceedings of SDM’04, Toronto, Canada, August 2004

  28. Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media?. In: Proceedings of www’10, Raleigh, NC, USA pp 591–600

  29. Boldi P, Codenotti B, Santini M, Vigna S (2008) A large time-aware graph. SIGIR Forum 42(2):33–38

    Article  Google Scholar 

  30. LAWS datasets (2014). http://law.di.unimi.it/datasets.php. Accessed 23 June 2014

  31. Yan J, Tan G, Sun N (2013) Graphine: programming graph-parallel computation of large natural graphs on multicore cluster, technical report, ICT-HPC-2013-2

  32. UPC Consortium (2013) UPC language and library specifications v1.3, Lawrence Berkeley National Lab, technical report LBNL-6623E, Nov 2013

  33. Akgul BE, Mooney VJ (2002) The system-on-a-chip Lock Cache. Int J Des Autom Embed Syst 7:139–174

    Article  MATH  Google Scholar 

  34. Feo J, Harper D, Kahan S, Konecny P (2005) Eldorado. In: Proceedings of the 2nd conference on computing frontiers, CF ’05, New York, NY, USA, pp 28–34

  35. Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture, ISCA ’00

  36. Afek Y, Dauber D, Touitou D (1995) Wait-free made fast. In: Proceedings of the twenty-seventh annual ACM symposium on theory of computing (STOC ’95)

  37. Valois JD (1995) Lock-free linked lists using compare-and-swap. In: Proceedings of the 14th annual ACM symposium on principles of distributed computing, PODC ’95

  38. Michael MM (2002) High performance dynamic lock-free hash tables and list-based sets. In: Proceedings of the 14th annual ACM symposium on parallel algorithms and architectures, SPAA ’02

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Yan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 64 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, J., Tan, G. & Sun, N. Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture. J Supercomput 69, 1462–1490 (2014). https://doi.org/10.1007/s11227-014-1239-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1239-1

Keywords