Optimizing the Use of Static Buffers for DMA on a CELL Chip

Chen, Tong; Sura, Zehra; O’Brien, Kathryn; O’Brien, John K.

doi:10.1007/978-3-540-72521-3_23

Tong Chen¹,
Zehra Sura¹,
Kathryn O’Brien¹ &
…
John K. O’Brien¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4382))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

561 Accesses
11 Citations

Abstract

The CELL architecture has one Power Processor Element (PPE) core, and eight Synergistic Processor Element (SPE) cores that have a distinct instruction set architecture of their own. The PPE core accesses memory via a traditional caching mechanism, but each SPE core can only access memory via a small 256K software-controlled local store. The PPE cache and SPE local stores are connected to each other and main memory via a high bandwidth bus. Software is responsible for all data transfers to and from the SPE local stores. To hide the high latency of DMA transfers, data may be prefetched into SPE local stores using loop blocking transformations and static buffers. We find that the performance of an application can vary depending on the size of the buffers used, and whether a single-, double-, or triple-buffer scheme is used. Constrained by the limited space available for data buffers in the SPE local store, we want to choose the optimal buffering scheme for a given space budget. Also, we want to be able to determine the optimal buffer size for a given scheme, such that using a larger buffer size results in negligible performance improvement. We develop a model to automatically infer these parameters for static buffering, taking into account the DMA latency and transfer rates, and the amount of computation in the application loop being targeted. We test the accuracy of our prediction model using a research prototype compiler developed on top of the IBM XL compiler infrastructure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, W., Iancu, C., Yelick, K.: Communication optimizations for fine-grained UPC applications. In: Parallel Architecture and Compilation Techniques (2005)
Google Scholar
Eichenberger, A.E., et al.: Optimizing compiler for the CELL processor. In: Parallel Architecture and Compilation Techniques, September (2005)
Google Scholar
Flachs, B., et al.: A streaming processing unit for a CELL processor. In: IEEE International Solid-State Circuits Conference (ISSCC), February 2005, IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Pham, D., et al.: The design and implementation of a first-generation CELL processor. In: IEEE International Solid-State Circuits Conference, February 2005, IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
European Center for Parallelism of Barcelona (CEPBA). Paraver: Parallel program visualization and analysis tool reference manual (November 2000), http://www.cepba.upc.es/paraver
Hiranandani, S., Kennedy, K., Tseng, C-W.: Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines. In: International Conference on Supercomputing (1992)
Google Scholar
Iancu, C., Husbands, P., Chen, W.: Message strip mining heuristics for high speed networks. In: 6th International Meeting of VECPAR (2004)
Google Scholar
Iancu, C., Husbands, P., Hargrove, P.: HUNTing the overlap. In: Parallel Architecture and Compilation Techniques, September (2005)
Google Scholar
Ishizaki, K., Komatsu, H., Nakatani, T.: A loop transformation algorithm for communication overlapping. International Journal of Parallel Programming (2000)
Google Scholar
Kapasi, U.J., et al.: Stream scheduling. In: Proceedings of the 3rd Workshop on Media and Streaming Processors (2001)
Google Scholar
Kistler, M., Perrone, M., Petrini, F.: CELL multiprocessor communication network: Built for speed. IEEE Micro 26(3) (2006)
Google Scholar
Leu, J.S., Agrawal, D.P., Mauney, J.: Modeling of parallel software for efficient computation communication overlap. In: Proceedings of the 1987 Fall Joint Computer Conference on Exploring Technology: Today and Tomorrow (1987)
Google Scholar
Palermo, D.J., et al.: Communication optimizations used in the PARADIGM compiler for distributed memory multicomputers. In: International Conference on Parallel Processing, August (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, Yorktown Heights, NY 10598,
Tong Chen, Zehra Sura, Kathryn O’Brien & John K. O’Brien

Authors

Tong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zehra Sura
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn O’Brien
View author publications
You can also search for this author in PubMed Google Scholar
John K. O’Brien
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

George Almási Călin Caşcaval Peng Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, T., Sura, Z., O’Brien, K., O’Brien, J.K. (2007). Optimizing the Use of Static Buffers for DMA on a CELL Chip. In: Almási, G., Caşcaval, C., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2006. Lecture Notes in Computer Science, vol 4382. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72521-3_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-72521-3_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72520-6
Online ISBN: 978-3-540-72521-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics