skip to main content
10.1145/1128022.1128030acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

Published: 03 May 2006 Publication History

Abstract

This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.

References

[1]
George S. Almási, Eduard Ayguadé, Călin Caşcaval, José Castaños, Jesús Labarta, Francisco Martíinez, Xavier Martorell, and José Moreira. Evaluation of Open MP for the Cyclops multithreaded architecture. In OpenMP Shared Memory Parallel Programming: International Workshop on OpenMP Applications and Tools, WOMPAT 2003, volume 2716 of Lecture Notes in Computer Science, pages 69--83, Toronto, Canada, June 26--27, 2003.]]
[2]
George S. Almási, Călin Caşcaval, José G. Castaños, Monty Denneau, Wilm Donath, Maria Eleftheriou, Mark Giampapa, Howard Ho, Derek Lieber, JoséE. Moreira, Dennis Newns, Marc Snir, and Henry S. Warren, Jr. Demonstrating the scalability of a molecular dynamics application on a petaflops computer. International Journal of Parallel Programming, 30(4):317--351, August 2002.]]
[3]
Thomas E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, January 1990.]]
[4]
Rudolf Berrendorf and Guido Nieken. Performance characteristics for Open MP constructs on different parallel computer architectures. Concurrency - Practice and Experience, 12(12):1261--1273, 2000.]]
[5]
J. Mark Bull. Measuring synchronization and scheduling overheads in Open MP. In Proceedings of the First European Workshop on Open MP, Lund, Sweden, September 30 - October 1, 1999.]]
[6]
Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. FAST: A functionally accurate simulation toolset for the C yclops64 cellular architecture. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation, pages 11--20, Madison, Wisconsin, June 4, 2005. Held in conjunction with the 32nd Annual International Symposium on Computer Architecture.]]
[7]
Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. Toward a software infrastructure for the C yclops-64 cellular architecture. In Proceedings of the 20th International Symposium on High Performance Computing Systems and Applications, St. John's, Newfoundland and Labrador, Canada, May 14--17, 2006.]]
[8]
Nathan R. Fredrickson, Ahmad Afsahi, and Ying Qian. Performance characteristics of Open MP constructs, and application benchmarks on a large symmetric multiprocessor. In Proceedings of the 2003 International Conference on Supercomputing, pages 140--149, New York, June 23--26 2003.]]
[9]
Gary Graunke and Shreekant Thakkar. Synchronization algorithms for shared-memory multiprocessors. Computer, 23:60--69, June 1990.]]
[10]
Michael B. Greenwald. Non-blocking synchronization and system design. PhD thesis, Stanford University, 1999.]]
[11]
Timothy L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, number 2180 in Lecture Notes in Computer Science, pages 300--314, Lisbon, Portugal, October 3--5, 2001.]]
[12]
Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 206--215, Barcelona, Spain, June 27--30, 2004.]]
[13]
Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Transactions on Computer Systems, 23(2):146--196, May 2005.]]
[14]
Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 289--300, San Diego, California, May 17--19, 1993.]]
[15]
IBM. IBM system/370 extended architecture, Principle of operation. 1983. Publication no. SA22-7085.]]
[16]
Sanjeev Kumar, Dongming Jiang, Rohit Chandra, and Jaswinder Pal Singh. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. ACM SIGMETRICS Performance Evaluation Review, 27(1):23--34, June 1999.]]
[17]
Kazuhiro Kusano, Shigehisa Satoh, and Mitsuhisa Sato. Performance evaluation of the O mni Open MP compiler. In Proceedings of the 3rd International Symposium on High Performance Computing, volume 1940 of Lecture Notes in Computer Science, pages 403--414, Tokyo, Japan, October 16--18, 2000.]]
[18]
Vladimir Lanin and Dennis Shasha. Concurrent set manipulation without locking. In the 7th ACM Symposium on Principles of Database Systems, pages 211--220, March 1988.]]
[19]
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991.]]
[20]
Maged M. Michael. High performance dynamic lock-free hash tables and list-based sets. In the 14th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 73--82, August 2002.]]
[21]
Maged M. Michael. CAS -based lock-free algorithm for shared deques. In the 9th Euro-Par Conference on Parallel Processing, pages 651--660, August 2003.]]
[22]
Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst, 15(6):491--504, 2004.]]
[23]
Maged M. Michael and Michael L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 267--275, New York, USA, May 1996.]]
[24]
Open MP Architecture Review Board. Open MP FORTRAN application program interface. Technical Report 2.0, Open MP Architecture Review Board, November 2000.]]
[25]
Open MP Architecture Review Board. Open MP C and C ++ application program interface. Technical Report 2.0, Open MP Architecture Review Board, March 2002.]]
[26]
Achal Prabhakar, Vladimir Getov, and Barbara Chapman. Performance comparisons of basic Open MP constructs. In Proceedings of the 4th International Symposium on High Performance Computing, number 2327 in Lecture Notes in Computer Science, pages 413--424, Kansai Science City, Japan, May 15--17, 2002.]]
[27]
David Ródenas, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, George Almási, Călin Caşcaval, José Castaños, and José Moreira. Optimizing NANOS Open MP for the IBM Cyclops multithreaded architecture. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, page 110, Denver, Colorado, April 4--8, 2005.]]
[28]
Larry Rudolph and Zary Segall. Dynamic decentralized cache schemes for MIMD parallel processors. In Proceedings of the 11th Annual International Symposium on Computer Architecture, pages 340--347, Ann Arbor, Michigan, June 5--7, 1984.]]
[29]
John D. Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the 14th Annual ACM Symposium of Distributed Computing, pages 214--222, Ottawa, Ontario, Canada, August 2--23, 1995.]]

Cited By

View all
  • (2016)The Design and Implementation of TIDeFlowInternational Journal of Parallel Programming10.1007/s10766-015-0373-644:2(278-307)Online publication date: 1-Apr-2016
  • (2013)Toward a Scalable Heterogeneous Runtime System for the Convey MX ArchitectureProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.18(1597-1606)Online publication date: 20-May-2013
  • (2012)CHOMPProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.39(232-239)Online publication date: 10-Nov-2012
  • Show More Cited By

Index Terms

  1. Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '06: Proceedings of the 3rd conference on Computing frontiers
    May 2006
    430 pages
    ISBN:1595933026
    DOI:10.1145/1128022
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 May 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chip multiprocessor
    2. openMP
    3. performance evaluation
    4. run-time system
    5. system-on-a-chip

    Qualifiers

    • Article

    Conference

    CF06
    Sponsor:
    CF06: Computing Frontiers Conference
    May 3 - 5, 2006
    Ischia, Italy

    Acceptance Rates

    Overall Acceptance Rate 273 of 785 submissions, 35%

    Upcoming Conference

    CF '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)The Design and Implementation of TIDeFlowInternational Journal of Parallel Programming10.1007/s10766-015-0373-644:2(278-307)Online publication date: 1-Apr-2016
    • (2013)Toward a Scalable Heterogeneous Runtime System for the Convey MX ArchitectureProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.18(1597-1606)Online publication date: 20-May-2013
    • (2012)CHOMPProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.39(232-239)Online publication date: 10-Nov-2012
    • (2011)Analysis and performance results of computing betweenness centrality on IBM Cyclops64The Journal of Supercomputing10.1007/s11227-009-0339-956:1(1-24)Online publication date: 1-Apr-2011
    • (2010)Research on Efficiency of Signal Processing on Embedded Multicore SystemProceedings of the 2010 First International Conference on Pervasive Computing, Signal Processing and Applications10.1109/PCSPA.2010.224(907-911)Online publication date: 17-Sep-2010
    • (2010)CABSim: A cycle-accurate array processor simulation environment for digital radio astronomy2010 IEEE International Symposium on Phased Array Systems and Technology10.1109/ARRAY.2010.5613291(680-685)Online publication date: Oct-2010
    • (2009)Tile PercolationProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_78(839-850)Online publication date: 23-Aug-2009
    • (2008)Performance Characteristics of OpenMP Language Constructs on a Many-core-on-a-chip ArchitectureOpenMP Shared Memory Parallel Programming10.1007/978-3-540-68555-5_19(230-241)Online publication date: 2008
    • (2006)A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architectureProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1898997(64-64)Online publication date: 25-Apr-2006
    • (2006)A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architectureProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639301(10 pp.)Online publication date: 2006
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media