skip to main content
10.1145/1454115.1454157acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

COMIC: a coherent shared memory interface for cell be

Published: 25 October 2008 Publication History

Abstract

The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).

References

[1]
Jairo Balart, Marc Gonzalez, Xavier Martorell, Eduard Ayguade, Zehra Sura, Tong Chen, Tao Zhang, Kevin O'brien, and Kathryn O'Brien. A novel asynchronous software cache implementation for the cell/be processor. In LCPC '07: Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing, October 2007.
[2]
Brian N. Bershad and Matthew J. Zekauskas. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical Report CMU-CS-91-170, School of Computer Science, Carnegie Mellon University, September 1991.
[3]
Angelos Bilas, Cheng Liao, and Jaswinder Pal Singh. Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems. In ISCA '99: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 282--293, May 1999.
[4]
OpenMP Architecture Review Board. OpenMP. http://www.openmp.org.
[5]
OpenMP Architecture Review Board. OpenMP Application Program Interface. OpenMP Architecture Review Board, version 2.5 edition, May 2005.
[6]
John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of munin. In SOSP '91: Proceedings of the thirteenth ACM Symposium on Operating Systems Principles, pages 152--164, October 1991.
[7]
Tong Chen, Zehra Sura, Kathryn M. O'Brien, and John K. O'Brien. Optimizing the use of static buffers for dma on a cell chip. In LCPC '06: Proceedings of the 19th International Workshop on Languages and Compilers for Parallel Computing, pages 314--329, November 2006. Also in Lecture Notes in Computer Science 4382, Springer 2007.
[8]
Tong Chen, Tao Zhang, Zehra Sura, Kathryn O'Brien, Kevin O'Brien, and Marc Gonzalez Tallada. Prefetching irregular references for software cache on cell. In CGO '08: Proceedings of the 2008 International Symposium on Code Generation and Optimization, April 2008.
[9]
Standard Performance Evaluation Corporation. SPEC 2000. http://www.spec.org/benchmarks.html.
[10]
David E. Culler and Jaswinder Pal Singh. Parallel Computer Architecture. Morgan Kaufmann, 1999.
[11]
IBM DevloperWorks. Cell broadband engine resouce center. http://www.ibm.com/developerworks/power/cell/downloads.html.
[12]
NASA Advanced Supercomputing Division. NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.
[13]
Susan J. Eggers and Tor E. Jeremiassen. Eliminating False Sharing. In ICPP '91: Proceedings of the 1991 International Conference on Parallel Processing, volume I, pages 377--381, August 1991.
[14]
Alexandre E. Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. Optimizing compiler for the cell processor. In PACT '05: Proceedings of the 4th International Conference on Parallel Architectures and Compilation Techniques, pages 161--172, September 2005.
[15]
B. Flachs et. al. A Streaming Processing Unit for a CELL Processor. IEEE International Solid-State Circuits Conference (ISSCC), February 2005.
[16]
Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 Supercomputing Conference, November 2006.
[17]
Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In ISCA '90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15--26, May 1990.
[18]
Michael Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd Conference on Computing Frontiers, pages 1--8, May 2006.
[19]
Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. Synergistic processing in cell's multicore architecture. IEEE Micro, 26(02):10--24, March/April 2006.
[20]
John L. Hennessy and David A. Patterson. Computer Architecture. Morgan Kaufmann, fourth edition, 2006.
[21]
Parry Husbands, Costin Iancu, and Katherine Yelick. A performance analysis of the berkeley upc compiler. In ICS '03: Proceedings of the 17th Annual International Conference on Supercomputing, pages 63--73, June 2003.
[22]
IBM. Software Development Kit for Multicore Acceleration version 3.0, Programmer's Guide. IBM, 2007. http://www.ibm.com/developerworks/power/cell/.
[23]
IBM, Sony, and Toshiba. Cell Broadband Engine Architecture. IBM, October 2007. http://www.ibm.com/developerworks/power/cell/.
[24]
Tor E. Jeremiassen and Susan J. Eggers. Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 179--188, New York, NY, USA, July 1995. ACM.
[25]
Peter Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy release consistency for software distributed shared memory. In ISCA'92: Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 13--21, May 1992.
[26]
Peter J. Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 1994 USENIX Technical Conference, pages 115--132, January 1994.
[27]
M. Kistler, M. Perrone, and F. Petrini. CELL Multiprocessor Communication Network: Built for Speed. IEEE Micro, 26(3), May/June 2006.
[28]
Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess progranm. IEEE Transactions on Computers, 28(9):690--691, September 1979.
[29]
Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. In PODC '86: Proceedings of the fifth Annual ACM Symposium on Principles of Distributed Computing, pages 229--239, August 1986.
[30]
Jason E. Miller and Anant Agarwal. Software-based instruction caching for embedded processors. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 293--302, October 2006.
[31]
M. Morita, T. Machino, M. Guo, and G. Wang. Design and implementation of stream processing system and library for CELL broadband engine processors. In Proceedings of the 2007 Parallel and Distributed Computing and Systems Conference, November 2007.
[32]
Kevin O'Brien, Kathryn O'Brien, Zehra Sura, Tong Chen, and Tao Zhang. Supporting openmp on cell. In IWOMP '07: Proceedings of the International Workshop on OpenMP, June 2007.
[33]
Kevin O'Brien, Kathryn M. O'Brien, Zehra Sura, Tong Chen, and Tao Zhang. Supporting openmp on cell. International Journal of Parallel Programming, 36(3):289--311, 2008.
[34]
Parallel and High Performance Applicational Software Exchange Editorial Committee. Omni OpenMP compiler project. http://phase.hpcc.jp/omni.
[35]
Rodric Rabbah. Beyond gaming: Programming the PLAYSTATION3 Cell architecture for cost-effective parallel processing. In Proceedings of the 5th International Conference on Hardware/Software Codesign and System Synthesis, 2007.
[36]
Daniel J. Scales, Kourosh Gharachorloo, and Anshu Aggarwal. Fine-grain software distributed shared memory on smp clusters. In HPCA '98: Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, pages 125--136, January 1998.
[37]
Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: a low overhead, software-only approach for supporting fine-grain shared memory. In ASPLOS-VII: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 174--185, October 1996.
[38]
Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain access control for distributed shared memory. In ASPLOS-VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, pages 297--306, October 1994.
[39]
Robert Stets, Sandhya Dwarkadas, Nikolaos Hardavellas, Galen Hunt, Leonidas Kontothanassis, Srinivasan Parthasarathy, and Michael Scott. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In SOSP '97: Proceedings of the sixteenth ACM Symposium on Operating Systems Principles, pages 170--183, October 1997.
[40]
HPC Challenge Team. HPC challenge benchmark. http://icl.cs.utk.edu/hpcc/.
[41]
Matthew J. Zekauskas, Wayne A. Sawdon, and Brian N. Bershad. Software write detection for distributed shared memory. In OSDI '94: Proceedings of the First Symposium on Operating Systems Design and Implementation, pages 87--100, November 1994.
[42]
Yuanyuan Zhou, Liviu Iftode, and Kai Li. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In OSDI '96: Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation, pages 75--88, October 1996.

Cited By

View all
  • (2018)Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core acceleratorsJournal of Real-Time Image Processing10.1007/s11554-015-0544-015:1(73-92)Online publication date: 1-Jun-2018
  • (2016)Partitioning and Data Mapping in Reconfigurable Cache and Scratchpad Memory--Based ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/293468022:1(1-25)Online publication date: 2-Sep-2016
  • (2016)Software Coherence Management on Non-coherent Cache Multi-coresProceedings of the 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID)10.1109/VLSID.2016.70(397-402)Online publication date: 4-Jan-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques
October 2008
328 pages
ISBN:9781605582825
DOI:10.1145/1454115
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cell BE
  2. OpenMP
  3. heterogeneous multicores
  4. software distributed shared memory
  5. software shared virtual memory

Qualifiers

  • Research-article

Conference

PACT '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core acceleratorsJournal of Real-Time Image Processing10.1007/s11554-015-0544-015:1(73-92)Online publication date: 1-Jun-2018
  • (2016)Partitioning and Data Mapping in Reconfigurable Cache and Scratchpad Memory--Based ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/293468022:1(1-25)Online publication date: 2-Sep-2016
  • (2016)Software Coherence Management on Non-coherent Cache Multi-coresProceedings of the 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID)10.1109/VLSID.2016.70(397-402)Online publication date: 4-Jan-2016
  • (2015)Architecture Support for Tightly-Coupled Multi-Core Clusters with Shared-Memory HW AcceleratorsIEEE Transactions on Computers10.1109/TC.2014.236052264:8(2132-2144)Online publication date: 1-Aug-2015
  • (2014)Design Space Exploration of Memory Model for Heterogeneous ComputingProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.9(160-167)Online publication date: 22-Oct-2014
  • (2014)Optimizing memory bandwidth in OpenVX graph execution on embedded many-core acceleratorsProceedings of the 2014 Conference on Design and Architectures for Signal and Image Processing10.1109/DASIP.2014.7115617(1-8)Online publication date: Oct-2014
  • (2014)Hybrid address spacesJournal of Systems and Software10.1016/j.jss.2014.06.05897:C(47-64)Online publication date: 1-Oct-2014
  • (2014)A Novel Object-Oriented Software Cache for Scratchpad-Based Multi-Core ClustersJournal of Signal Processing Systems10.1007/s11265-014-0881-477:1-2(77-93)Online publication date: 1-Oct-2014
  • (2013)A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clustersProceedings of the 2013 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2013.6567591(281-288)Online publication date: 5-Jun-2013
  • (2012)A Multidimensional Software Cache for Scratchpad-Based SystemsInnovations in Embedded and Real-Time Systems Engineering for Communication10.4018/978-1-4666-0912-9.ch004(59-78)Online publication date: 2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media