Abstract
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that targets enabling pre-fetch techniques. Memory accesses are classified at compile time in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. The cache design enables automatic pre-fetch and modulo scheduling transforma-tions. Performance evaluation indicates that the optimized software-cache structures combined with the proposed pre-fetch techniques translate into speed-up between 10% and 20%. Evaluation is done on a set of parallel NAS applications.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Peter Hofstee, H.: Power Efficient Processor Architecture and The Cell Processor. In: Proceedings of the 11th Int’l. Symposium on High-Performance Computer Architecture (2005)
Pham, D., et al.: The Design and Implementation of a First-Generation Cell Processor. In: Proceedings the IEEE International Solid-State Circuits Conference (2005)
Kistler, M., et al.: Cell Multiprocessor Communication Network: Built for Speed. IEEE Micro 26(3), 10–23 (2006)
Gschwind, M., et al.: A Novel SIMD Architecture for the Cell Heterogeneous Chip-Multiprocessor. In: Hot Chips, vol. 17 (2005)
Eichenberger, A.E., et al.: Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Systems Journal 45(1) (2006)
McCalpin, John, D.: Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) (1995)
Ramakrishna Rau, B., et al.: Code Generation Schema for Modulo Scheduling Loops. In: Proccedings of the 25th Annual International Symposium on Microarchitecture (1992)
Ramakrishna Rau, B., et al.: Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops. In: Proceedings of the 27th annual International Symposium on Microarchitecture (1994)
Lavery, D.M.: Modulo Scheduling of Loops in Control-intensive Non-numeric Programs. In: Proceedings of the 29th annual ACM/IEEE International Symposium on Microarchitecture (1996)
Bailey, D., et al.: The NAS parallel benchmarks. Technical Report TR RNR-91-002, NASA Ames (August 1991)
Sinharoy, B., et al.: POWER 5 system micro-architecture. IBM Journal of Research and Development 49(4/5) (July/September 2005)
Chen, T., et al.: Prefetching irregular references for software cache on cell. In: Proceedings of the sixth annual IEEE/ACM international symposium on Code Generation and Optimization, pp. 155–164 (2008)
Dasygenis, M., et al.: A Combined DMA and Application-Specific Prefetching Approach for Tackling the Memory Bottleneck. IEEE Transactions on Very Large Integration (VLSI) Systems 14(3), 279–291 (2006)
Chen, T.-F.: An Effective Programmable Prefetch Engine for On-Chip Caches. In: Proceedings of the 28th Annual International Symposium on Microarchitecture (1995)
Batcher, K.W., et al.: Interrupt Triggered Software Prefetching for Embedded CPU Instruction Cache. In: Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vujić, N., Gonzàlez, M., Martorell, X., Ayguadé, E. (2008). Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture. In: Amaral, J.N. (eds) Languages and Compilers for Parallel Computing. LCPC 2008. Lecture Notes in Computer Science, vol 5335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89740-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-89740-8_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89739-2
Online ISBN: 978-3-540-89740-8
eBook Packages: Computer ScienceComputer Science (R0)