Article

Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

Authors:

Juan del Cuvillo,

Guang GaoAuthors Info & Claims

CF '06: Proceedings of the 3rd conference on Computing frontiers

Pages 41 - 50

https://doi.org/10.1145/1128022.1128030

Published: 03 May 2006 Publication History

Abstract

This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.

References

[1]

George S. Almási, Eduard Ayguadé, Călin Caşcaval, José Castaños, Jesús Labarta, Francisco Martíinez, Xavier Martorell, and José Moreira. Evaluation of Open MP for the Cyclops multithreaded architecture. In OpenMP Shared Memory Parallel Programming: International Workshop on OpenMP Applications and Tools, WOMPAT 2003, volume 2716 of Lecture Notes in Computer Science, pages 69--83, Toronto, Canada, June 26--27, 2003.]]

[2]

George S. Almási, Călin Caşcaval, José G. Castaños, Monty Denneau, Wilm Donath, Maria Eleftheriou, Mark Giampapa, Howard Ho, Derek Lieber, JoséE. Moreira, Dennis Newns, Marc Snir, and Henry S. Warren, Jr. Demonstrating the scalability of a molecular dynamics application on a petaflops computer. International Journal of Parallel Programming, 30(4):317--351, August 2002.]]

Digital Library

[3]

Thomas E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, January 1990.]]

Digital Library

[4]

Rudolf Berrendorf and Guido Nieken. Performance characteristics for Open MP constructs on different parallel computer architectures. Concurrency - Practice and Experience, 12(12):1261--1273, 2000.]]

[5]

J. Mark Bull. Measuring synchronization and scheduling overheads in Open MP. In Proceedings of the First European Workshop on Open MP, Lund, Sweden, September 30 - October 1, 1999.]]

[6]

Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. FAST: A functionally accurate simulation toolset for the C yclops64 cellular architecture. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation, pages 11--20, Madison, Wisconsin, June 4, 2005. Held in conjunction with the 32nd Annual International Symposium on Computer Architecture.]]

[7]

Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. Toward a software infrastructure for the C yclops-64 cellular architecture. In Proceedings of the 20th International Symposium on High Performance Computing Systems and Applications, St. John's, Newfoundland and Labrador, Canada, May 14--17, 2006.]]

Digital Library

[8]

Nathan R. Fredrickson, Ahmad Afsahi, and Ying Qian. Performance characteristics of Open MP constructs, and application benchmarks on a large symmetric multiprocessor. In Proceedings of the 2003 International Conference on Supercomputing, pages 140--149, New York, June 23--26 2003.]]

Digital Library

[9]

Gary Graunke and Shreekant Thakkar. Synchronization algorithms for shared-memory multiprocessors. Computer, 23:60--69, June 1990.]]

Digital Library

[10]

Michael B. Greenwald. Non-blocking synchronization and system design. PhD thesis, Stanford University, 1999.]]

Digital Library

[11]

Timothy L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, number 2180 in Lecture Notes in Computer Science, pages 300--314, Lisbon, Portugal, October 3--5, 2001.]]

Digital Library

[12]

Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 206--215, Barcelona, Spain, June 27--30, 2004.]]

Digital Library

[13]

Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Transactions on Computer Systems, 23(2):146--196, May 2005.]]

Digital Library

[14]

Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 289--300, San Diego, California, May 17--19, 1993.]]

Digital Library

[15]

IBM. IBM system/370 extended architecture, Principle of operation. 1983. Publication no. SA22-7085.]]

[16]

Sanjeev Kumar, Dongming Jiang, Rohit Chandra, and Jaswinder Pal Singh. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. ACM SIGMETRICS Performance Evaluation Review, 27(1):23--34, June 1999.]]

Digital Library

[17]

Kazuhiro Kusano, Shigehisa Satoh, and Mitsuhisa Sato. Performance evaluation of the O mni Open MP compiler. In Proceedings of the 3rd International Symposium on High Performance Computing, volume 1940 of Lecture Notes in Computer Science, pages 403--414, Tokyo, Japan, October 16--18, 2000.]]

Digital Library

[18]

Vladimir Lanin and Dennis Shasha. Concurrent set manipulation without locking. In the 7th ACM Symposium on Principles of Database Systems, pages 211--220, March 1988.]]

Digital Library

[19]

John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991.]]

Digital Library

[20]

Maged M. Michael. High performance dynamic lock-free hash tables and list-based sets. In the 14th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 73--82, August 2002.]]

Digital Library

[21]

Maged M. Michael. CAS -based lock-free algorithm for shared deques. In the 9th Euro-Par Conference on Parallel Processing, pages 651--660, August 2003.]]

[22]

Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst, 15(6):491--504, 2004.]]

Digital Library

[23]

Maged M. Michael and Michael L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 267--275, New York, USA, May 1996.]]

Digital Library

[24]

Open MP Architecture Review Board. Open MP FORTRAN application program interface. Technical Report 2.0, Open MP Architecture Review Board, November 2000.]]

[25]

Open MP Architecture Review Board. Open MP C and C ++ application program interface. Technical Report 2.0, Open MP Architecture Review Board, March 2002.]]

[26]

Achal Prabhakar, Vladimir Getov, and Barbara Chapman. Performance comparisons of basic Open MP constructs. In Proceedings of the 4th International Symposium on High Performance Computing, number 2327 in Lecture Notes in Computer Science, pages 413--424, Kansai Science City, Japan, May 15--17, 2002.]]

Digital Library

[27]

David Ródenas, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, George Almási, Călin Caşcaval, José Castaños, and José Moreira. Optimizing NANOS Open MP for the IBM Cyclops multithreaded architecture. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, page 110, Denver, Colorado, April 4--8, 2005.]]

Digital Library

[28]

Larry Rudolph and Zary Segall. Dynamic decentralized cache schemes for MIMD parallel processors. In Proceedings of the 11th Annual International Symposium on Computer Architecture, pages 340--347, Ann Arbor, Michigan, June 5--7, 1984.]]

Digital Library

[29]

John D. Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the 14th Annual ACM Symposium of Distributed Computing, pages 214--222, Ottawa, Ontario, Canada, August 2--23, 1995.]]

Digital Library

Cited By

Orozco DGarcia EPavel RArteaga JGao G(2016)The Design and Implementation of TIDeFlowInternational Journal of Parallel Programming10.1007/s10766-015-0373-644:2(278-307)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1007/s10766-015-0373-6
Leidel JBolding JRogers G(2013)Toward a Scalable Heterogeneous Runtime System for the Convey MX ArchitectureProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.18(1597-1606)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.18
Leidel JWadleigh KBolding JBrewer TWalker D(2012)CHOMPProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.39(232-239)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.1109/SC.Companion.2012.39
Show More Cited By

Index Terms

Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
An application-centric evaluation of OpenCL on multi-core CPUs

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we ...
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '06: Proceedings of the 3rd conference on Computing frontiers

May 2006

430 pages

ISBN:1595933026

DOI:10.1145/1128022

General Chairs:
Monica Alderighi
IASF - INAF
,
Valentina Salapura
IBM
,
Program Chair:
Sally A. McKee
Cornell University

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CF06

Sponsor:

CF06: Computing Frontiers Conference

May 3 - 5, 2006

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
654
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Orozco DGarcia EPavel RArteaga JGao G(2016)The Design and Implementation of TIDeFlowInternational Journal of Parallel Programming10.1007/s10766-015-0373-644:2(278-307)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1007/s10766-015-0373-6
Leidel JBolding JRogers G(2013)Toward a Scalable Heterogeneous Runtime System for the Convey MX ArchitectureProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.18(1597-1606)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.18
Leidel JWadleigh KBolding JBrewer TWalker D(2012)CHOMPProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.39(232-239)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.1109/SC.Companion.2012.39
Tan GSreedhar VGao G(2011)Analysis and performance results of computing betweenness centrality on IBM Cyclops64The Journal of Supercomputing10.1007/s11227-009-0339-956:1(1-24)Online publication date: 1-Apr-2011
https://dl.acm.org/doi/10.1007/s11227-009-0339-9
Liu TJi ZWang QZhu S(2010)Research on Efficiency of Signal Processing on Embedded Multicore SystemProceedings of the 2010 First International Conference on Pervasive Computing, Signal Processing and Applications10.1109/PCSPA.2010.224(907-911)Online publication date: 17-Sep-2010
https://dl.acm.org/doi/10.1109/PCSPA.2010.224
Armstrong RJones M(2010)CABSim: A cycle-accurate array processor simulation environment for digital radio astronomy2010 IEEE International Symposium on Phased Array Systems and Technology10.1109/ARRAY.2010.5613291(680-685)Online publication date: Oct-2010
https://doi.org/10.1109/ARRAY.2010.5613291
Gan GWang XManzano JGao G(2009)Tile PercolationProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_78(839-850)Online publication date: 23-Aug-2009
https://dl.acm.org/doi/10.1007/978-3-642-03869-3_78
Zhu Wdel Cuvillo JGao G(2008)Performance Characteristics of OpenMP Language Constructs on a Many-core-on-a-chip ArchitectureOpenMP Shared Memory Parallel Programming10.1007/978-3-540-68555-5_19(230-241)Online publication date: 2008
https://doi.org/10.1007/978-3-540-68555-5_19
Zhang YJeong TChen FWu HNitzsche RGao G(2006)A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architectureProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1898997(64-64)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1898997
Ying Ping Zhang Taikyeong Jeong Fei Chen Haiping Wu Nitzsche RGao G(2006)A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architectureProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639301(10 pp.)Online publication date: 2006
https://doi.org/10.1109/IPDPS.2006.1639301
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents