Abstract
The Cell Broadband Engine shows much promise in high performance computing applications. The Cell is a heterogeneous multi-core processor, with the bulk of the computational work load meant to be borne by eight co-processors called SPEs. Each SPE operates on a distinct 256 KB local store, and all the SPEs also have access to a shared 512 MB to 2 GB main memory through DMA. The unconventional architecture of the SPEs, and in particular their small local store, creates some programming challenges. We have provided an implementation of core features of MPI for the Cell to help deal with this. This implementation views each SPE as a node for an MPI process, with the local store used as if it were a cache. In this paper, we describe synchronous mode communication in our implementation, using the rendezvous protocol, which makes MPI communication for long messages efficient. We further present experimental results on the Cell hardware, where it demonstrates good performance, such as throughput up to 6.01 GB/s and latency as low as 0.65 μs on the pingpong test. This demonstrates that it is possible to efficiently implement MPI calls even on the simple SPE cores.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
An Introduction to Compiling for the Cell Broadband Engine Architecture, Part 4: Partitioning Large Tasks (February 2006), http://www-128.ibm.com/developerworks/edu/pa-dw-pa-cbecompile4-i.html
An Introduction to Compiling for the Cell Broadband Engine Architecture, Part 5: Managing Memory, Analyzing Calling Frequencies for Maximum SPE Partitioning Optimization (February 2006), http://www-128.ibm.com/developerworks/edu/pa-dw-pa-cbecompile5-i.html
Buntinas, D., Mercier, G., Gropp, W.: Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem. In: Proceedings of the Euro PVM/MPI Conference (2006)
Buntinas, D., Mercier, G., Gropp, W.: Data Transfers Between Processes in an SMP System: Performance Study and Application to MPI. In: Proceedings of the International Conference on Parallel Processing, pp. 487–496 (2006)
Buntinas, D., Mercier, G., Gropp, W.: Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem. In: Proceedings of the International Symposium on Cluster Computing and the Grid (2006)
Cell Broadband Engine Programming Handbook, Version 1.0 (April 19, 2006), http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F/file/BE_Handbook_v1.0_10May2006.pdf
Fatahalian, K., Knight, T.J., Houston, M., Erez, M.: Sequoia: Programming the Memory Hierarchy. In: Löwe, W., Südholt, M. (eds.) SC 2006. LNCS, vol. 4089. Springer, Heidelberg (2006)
Gropp, W., Lusk, E.: A High Performance MPI Implementation on a Shared Memory Vector Supercomputer. Parallel Computing 22, 1513–1526 (1997)
Gropp, W., Lusk, E.: Reproducible Measurements of MPI Performance Characteristics, Argonne National Lab Technical Report ANL/MCS/CP-99345 (1999)
Jin, H.-W., Panda, D.K.: LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster. In: Proceedings of the International Conference on Parallel Processing, pp. 184–191 (2005)
MultiCore Framework: Harnessing the Performance of the Cell BETM Processor, Mercury Computer Systems, Inc. (2006), http://www.mc.com/literature/literature_files/MCF-ds.pdf
Ohara, M., Inoue, H., Sohda, Y., Komatsu, H., Nakatani, T.: MPI Microtask for Programming the Cell Broadband EngineTM Processor. IBM Systems Journal 45, 85–102 (2006)
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI - The Complete Reference, The MPI Core, 2nd edn. vol. 1. MIT Press, Cambridge (1998)
Tang, H., Shen, K., Yang, T.: Program Transformation and Runtime Support for Threaded MPI Execution on Shared-Memory Machines. ACM Transactions on Programming Languages and Systems 22, 673–700 (2000)
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The Potential of the Cell Processor for Scientific Computing. In: Proceedings of the ACM International Conference on Computing Frontiers (2006)
Krishna, M., Kumar, A., Jayam, N., Senthilkumar, G., Baruah, P.K., Sharma, R., Srinivasan, A., Kapoor, S.: A Buffered Mode MPI Implementation for the Cell BETM Processor. In: Proceedings of the International Conference on Computational Science (ICCS), Lecture Notes in Computer Science (to appear, 2007)
Krishna, M., Kumar, A., Jayam, N., Senthilkumar, G., Baruah, P.K., Sharma, R., Srinivasan, A., Kapoor, S.: Brief Announcement: Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BETM Architecture. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (to appear, 2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Krishna, M. et al. (2007). A Synchronous Mode MPI Implementation on the Cell BETM Architecture. In: Stojmenovic, I., Thulasiram, R.K., Yang, L.T., Jia, W., Guo, M., de Mello, R.F. (eds) Parallel and Distributed Processing and Applications. ISPA 2007. Lecture Notes in Computer Science, vol 4742. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74742-0_86
Download citation
DOI: https://doi.org/10.1007/978-3-540-74742-0_86
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74741-3
Online ISBN: 978-3-540-74742-0
eBook Packages: Computer ScienceComputer Science (R0)