Analysis and performance results of computing betweenness centrality on IBM Cyclops64

Tan, Guangming; Sreedhar, Vugranam C.; Gao, Guang R.

doi:10.1007/s11227-009-0339-9

Analysis and performance results of computing betweenness centrality on IBM Cyclops64

Published: 13 November 2009

Volume 56, pages 1–24, (2011)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Guangming Tan^1,2,
Vugranam C. Sreedhar³ &
Guang R. Gao²

155 Accesses
15 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application—computing betweenness centrality—on a many-core architecture IBM Cyclops64. The characteristics of unstructured parallelism, dynamically non-contiguous memory access, and low arithmetic intensity in betweenness centrality pose an obstacle to an efficient mapping of parallel algorithms on such many-core architectures. By identifying several key architectural features, we propose and evaluate efficient strategies for achieving scalability on a massive multi-threading many-core architecture. We demonstrate several optimization strategies including multi-grain parallelism, just-in-time locality with explicit memory hierarchy and non-preemptive thread execution, and fine-grain data synchronization. Comparing with a conventional parallel algorithm, we get 4X-50X improvement in performance and 16X improvement in scalability on a 128-cores IBM Cyclops64 simulator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alderson D, Doyle JC, Li L, Willinger W (2005) Towards a theory of scale-free graphs: definition, properties, and implications. Internet Math 2(4):431–523
Article MATH MathSciNet Google Scholar
Bader DA (2006) Hpcs scalable synthetic compact applications 2 graph analysis. www.highproductivity.org/SSCABmks.htm
Bader DA, Madduri K (2006) Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In: The 35th international conference on parallel processing (ICPP 2006)
Bader DA, Madduri K (2006) Parallel algorithms for evaluating centrality indices in real-world networks. In: The 35th international conference on parallel processing (ICPP 2006)
Brandes U (2001) A faster algorithm for betweenness centrality. J Math Social 25(2):163–177
Article MATH Google Scholar
Chilimbi TM, Hirzel M (2002) Dynamic hot data stream prefetching for general-purpose programs. In: PLDI ’02: Proceedings of the ACM SIGPLAN 2002 conference on programming language design and implementation, New York, NY, USA, 2002. ACM Press, New York, pp 199–209
Chapter Google Scholar
Collins JD, Tullsen DM, Wang H, Shen JP (2001) Dynamic speculative precomputation. In: The 34th annual international symposium on microarchitecture
Collins JD, Wang H, Tullsen DM, Hughes C, Lavery D, Shen JP (2001) Speculative precomputation: long-range prefetching of delinquent loads. In: The 28th international symposium on computer architecture
del Cuvillo J, Zhu W, Gao GR (2005) Landing openmp on cyclops-64: an efficient mapping of openmp to a many-core system-on-a-chip. In: The 3rd ACM international conference on computing frontiers, Ischia, Italy
del Cuvillo J, Zhu W, Hu Z, Gao GR (2005) Fast: a functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on modeling, benchmarking and simulation (MoBS), held in conjunction with the annual international symposium on computer architecture (ISCA’05)
del Cuvillo J, Zhu W, Hu Z, Gao GR (2005) Tiny threads: a thread virtual machine for the cyclops-64 cellular architecture. In: Fifth workshop on massively parallel processing (WMPP), held in conjunction with the 19th international parallel and distributed processing system
Denneau M, Warren HS Jr (2005) 64-bit Cyclops: principles of operation. April 2005
Erez M, Ahn JH, Gummaraju J, Rosenblum M, Dally WJ (2007) Executing irregular scientific applications on stream architectures. In: ICS ’07: Proceedings of the 21st annual international conference on supercomputing, New York, NY, USA, 2007. ACM Press, New York, pp 93–104
Chapter Google Scholar
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41
Article Google Scholar
Ganusov I, Burtscher M (2005) Future execution: a hardware prefetching technique for chip multiprocessors. In: 2005 International conference on parallel architectures and compilation techniques, pp 350–360
Ganusov I, Burtscher M (2006) Efficient emulation of hardware prefetchers via event-driven helper threading. In: 2006 International conference on parallel architectures and compilation techniques, pp 144–153
Gao GR, Likharev KK, Messina PC, Sterling TL (1996) Hybrid technology multi-threaded architecture. In: Proceedings of frontiers ’96: the sixth symposium on the frontiers of massively parallel computation, pp 98–105
Gao G, Nelson Amaral J, Marquez A, Theobald K (1998) A refinement of the “htmt” program execution model. Technical report, CAPSL, University of Delaware, 1998
García Quinones C, Madriles C, Sánchez J, Marcuello P, González A, Tullsen DM (2005) Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In: PLDI ’05: Proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation, pp 269–279
Gordon M, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: International conference on architectural support for programming languages and operating systems, San Jose, CA, October 2006
Herlihy M (1991) Wait-free synchronization. ACM Trans Program Lang Syst 11(1):124–149
Article Google Scholar
Lin Y, Padua D (2000) Compiler analysis of irregular memory accesses. In: PLDI ’00: Proceedings of the ACM SIGPLAN 2000 conference on programming language design and implementation, New York, NY, USA, 2000. ACM Press, New York, pp 157–168
Chapter Google Scholar
Lu J, Das A, Hsu W-C, Nguyen K, Abraham SG (2005) Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In: MICRO 38: Proceedings of the 38th annual IEEE/ACM international symposium on microarchitecture, Washington, DC, USA, 2005. IEEE Computer Society, Los Alamitos, pp 93–104
Google Scholar
Luk C-K, Mowry TC (1999) Automatic compiler-inserted prefetching for pointer-based applications. IEEE Trans Comput 48(2)
Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9:1
Article Google Scholar
Mowry T, Gupta A (1991) Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J Parallel Distrib Comput 12(2):87–106
Article Google Scholar
Ponnusamy R, Saltz J, Choudhary A (1993) Runtime-compilation techniques for data partitioning and communication schedule reuse. In: Supercomputing’93
Rauchwerger L, Zhan Y, Torrellas J (1998) Hardware for speculative run-time parallelization in distributed shared memory multiprocessors. In: Proceedings of the 4th international symposium on high-performance computer architecture, p 162
Sharma S, Ponnusamy R, Moon B, Hwang Y, Das R, Saltz J (1994) Run-time and compile-time support for adaptive irregular problems. In: Supercomputing’94
Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture
Tan G, Tu D (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: The 2009 IEEE international symposium on parallel and distributed processing with applications (ISPA’09)
Tan G, Sreedhar VC, Gao GR (2008) Just-in-time locality and percolation for optimizing irregular applications on a manycore architecture. In: 21st Annual languages and compilers for parallel computing workshop
Wu Y (2002) Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In: PLDI ’02: Proceedings of the ACM SIGPLAN 2002 conference on programming language design and implementation, New York, NY, USA, 2002. ACM Press, New York, pp 210–221
Chapter Google Scholar
Zhang Z, Torrellas J (1995) Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group, prefetching. In: 22nd International symposium on computer architecture
Zhang W, Tullsen DM (2007) Accelerating and adapting precomputation threads for efficient prefetching. In: 3th International symposium on high performance computer architecture
Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: The 34th international symposium on computer architecture

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Guangming Tan
Computer Architecture and Parallel Systems Laboratory, University of Delaware, Newark, USA
Guangming Tan & Guang R. Gao
IBM T. J. Watson Research Center, Cambridge, USA
Vugranam C. Sreedhar

Authors

Guangming Tan
View author publications
You can also search for this author inPubMed Google Scholar
Vugranam C. Sreedhar
View author publications
You can also search for this author inPubMed Google Scholar
Guang R. Gao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Guangming Tan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, G., Sreedhar, V.C. & Gao, G.R. Analysis and performance results of computing betweenness centrality on IBM Cyclops64. J Supercomput 56, 1–24 (2011). https://doi.org/10.1007/s11227-009-0339-9

Download citation

Published: 13 November 2009
Issue Date: April 2011
DOI: https://doi.org/10.1007/s11227-009-0339-9

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis and performance results of computing betweenness centrality on IBM Cyclops64

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters

Scalable and efficient graph traversal on high-throughput cluster

Accelerating All-Sources BFS Metrics on Multi-core Clusters for Large-Scale Complex Network Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Analysis and performance results of computing betweenness centrality on IBM Cyclops64

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters

Scalable and efficient graph traversal on high-throughput cluster

Accelerating All-Sources BFS Metrics on Multi-core Clusters for Large-Scale Complex Network Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now