Abstract
This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application—computing betweenness centrality—on a many-core architecture IBM Cyclops64. The characteristics of unstructured parallelism, dynamically non-contiguous memory access, and low arithmetic intensity in betweenness centrality pose an obstacle to an efficient mapping of parallel algorithms on such many-core architectures. By identifying several key architectural features, we propose and evaluate efficient strategies for achieving scalability on a massive multi-threading many-core architecture. We demonstrate several optimization strategies including multi-grain parallelism, just-in-time locality with explicit memory hierarchy and non-preemptive thread execution, and fine-grain data synchronization. Comparing with a conventional parallel algorithm, we get 4X-50X improvement in performance and 16X improvement in scalability on a 128-cores IBM Cyclops64 simulator.
Similar content being viewed by others
References
Alderson D, Doyle JC, Li L, Willinger W (2005) Towards a theory of scale-free graphs: definition, properties, and implications. Internet Math 2(4):431–523
Bader DA (2006) Hpcs scalable synthetic compact applications 2 graph analysis. www.highproductivity.org/SSCABmks.htm
Bader DA, Madduri K (2006) Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In: The 35th international conference on parallel processing (ICPP 2006)
Bader DA, Madduri K (2006) Parallel algorithms for evaluating centrality indices in real-world networks. In: The 35th international conference on parallel processing (ICPP 2006)
Brandes U (2001) A faster algorithm for betweenness centrality. J Math Social 25(2):163–177
Chilimbi TM, Hirzel M (2002) Dynamic hot data stream prefetching for general-purpose programs. In: PLDI ’02: Proceedings of the ACM SIGPLAN 2002 conference on programming language design and implementation, New York, NY, USA, 2002. ACM Press, New York, pp 199–209
Collins JD, Tullsen DM, Wang H, Shen JP (2001) Dynamic speculative precomputation. In: The 34th annual international symposium on microarchitecture
Collins JD, Wang H, Tullsen DM, Hughes C, Lavery D, Shen JP (2001) Speculative precomputation: long-range prefetching of delinquent loads. In: The 28th international symposium on computer architecture
del Cuvillo J, Zhu W, Gao GR (2005) Landing openmp on cyclops-64: an efficient mapping of openmp to a many-core system-on-a-chip. In: The 3rd ACM international conference on computing frontiers, Ischia, Italy
del Cuvillo J, Zhu W, Hu Z, Gao GR (2005) Fast: a functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on modeling, benchmarking and simulation (MoBS), held in conjunction with the annual international symposium on computer architecture (ISCA’05)
del Cuvillo J, Zhu W, Hu Z, Gao GR (2005) Tiny threads: a thread virtual machine for the cyclops-64 cellular architecture. In: Fifth workshop on massively parallel processing (WMPP), held in conjunction with the 19th international parallel and distributed processing system
Denneau M, Warren HS Jr (2005) 64-bit Cyclops: principles of operation. April 2005
Erez M, Ahn JH, Gummaraju J, Rosenblum M, Dally WJ (2007) Executing irregular scientific applications on stream architectures. In: ICS ’07: Proceedings of the 21st annual international conference on supercomputing, New York, NY, USA, 2007. ACM Press, New York, pp 93–104
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41
Ganusov I, Burtscher M (2005) Future execution: a hardware prefetching technique for chip multiprocessors. In: 2005 International conference on parallel architectures and compilation techniques, pp 350–360
Ganusov I, Burtscher M (2006) Efficient emulation of hardware prefetchers via event-driven helper threading. In: 2006 International conference on parallel architectures and compilation techniques, pp 144–153
Gao GR, Likharev KK, Messina PC, Sterling TL (1996) Hybrid technology multi-threaded architecture. In: Proceedings of frontiers ’96: the sixth symposium on the frontiers of massively parallel computation, pp 98–105
Gao G, Nelson Amaral J, Marquez A, Theobald K (1998) A refinement of the “htmt” program execution model. Technical report, CAPSL, University of Delaware, 1998
García Quinones C, Madriles C, Sánchez J, Marcuello P, González A, Tullsen DM (2005) Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In: PLDI ’05: Proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation, pp 269–279
Gordon M, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: International conference on architectural support for programming languages and operating systems, San Jose, CA, October 2006
Herlihy M (1991) Wait-free synchronization. ACM Trans Program Lang Syst 11(1):124–149
Lin Y, Padua D (2000) Compiler analysis of irregular memory accesses. In: PLDI ’00: Proceedings of the ACM SIGPLAN 2000 conference on programming language design and implementation, New York, NY, USA, 2000. ACM Press, New York, pp 157–168
Lu J, Das A, Hsu W-C, Nguyen K, Abraham SG (2005) Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In: MICRO 38: Proceedings of the 38th annual IEEE/ACM international symposium on microarchitecture, Washington, DC, USA, 2005. IEEE Computer Society, Los Alamitos, pp 93–104
Luk C-K, Mowry TC (1999) Automatic compiler-inserted prefetching for pointer-based applications. IEEE Trans Comput 48(2)
Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9:1
Mowry T, Gupta A (1991) Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J Parallel Distrib Comput 12(2):87–106
Ponnusamy R, Saltz J, Choudhary A (1993) Runtime-compilation techniques for data partitioning and communication schedule reuse. In: Supercomputing’93
Rauchwerger L, Zhan Y, Torrellas J (1998) Hardware for speculative run-time parallelization in distributed shared memory multiprocessors. In: Proceedings of the 4th international symposium on high-performance computer architecture, p 162
Sharma S, Ponnusamy R, Moon B, Hwang Y, Das R, Saltz J (1994) Run-time and compile-time support for adaptive irregular problems. In: Supercomputing’94
Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture
Tan G, Tu D (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: The 2009 IEEE international symposium on parallel and distributed processing with applications (ISPA’09)
Tan G, Sreedhar VC, Gao GR (2008) Just-in-time locality and percolation for optimizing irregular applications on a manycore architecture. In: 21st Annual languages and compilers for parallel computing workshop
Wu Y (2002) Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In: PLDI ’02: Proceedings of the ACM SIGPLAN 2002 conference on programming language design and implementation, New York, NY, USA, 2002. ACM Press, New York, pp 210–221
Zhang Z, Torrellas J (1995) Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group, prefetching. In: 22nd International symposium on computer architecture
Zhang W, Tullsen DM (2007) Accelerating and adapting precomputation threads for efficient prefetching. In: 3th International symposium on high performance computer architecture
Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: The 34th international symposium on computer architecture
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tan, G., Sreedhar, V.C. & Gao, G.R. Analysis and performance results of computing betweenness centrality on IBM Cyclops64. J Supercomput 56, 1–24 (2011). https://doi.org/10.1007/s11227-009-0339-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0339-9