ABSTRACT
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.
- Aho, A., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley. Google ScholarDigital Library
- Allen, E., Chase, D., Luchangco, V., Maessen, J.-W., Ryu, S., Steele, G., and Tobin-Hochstadt., S., 2005. The Fortress language specification version 0.707. Technical report. Sun Microsystems.Google Scholar
- Alpern, B., Carter, L., and Ferrante, J. 1993. Modeling parallel computers as memory hierarchies. In Proc. Programming Models for Massively Parallel Computers.Google Scholar
- Alpern, B., Carter, L., Feig, E., and Selker, T. 1994. The uniform memory hierarchy model of computation. Algorithmica 12, 2/3, 72--109.Google Scholar
- Alpern, B., Carter, L., and Ferrante, J. 1995. Space-limited procedures: A methodology for portable high performance. In International Working Conference on Massively Parallel Programming Models. Google ScholarDigital Library
- Alverson, G. A., and Notkin, D. 1993. Program structuring for effective parallel portability. IEEE Trans. Parallel Distrib. Syst. 4, 9, 1041--1059. Google ScholarDigital Library
- Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B. B., Garzarn, M. J., Padua, D., and von Praun, C. 2006. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 48--57. Google ScholarDigital Library
- Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th Symposium on Principles and Practice of Parallel Programming. Google ScholarDigital Library
- Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786. Google ScholarDigital Library
- Callahan, D., Chamberlain, B. L., and Zima, H. P. 2004. The Cascade high productivity language. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, IEEE Computer Society, 52--60.Google Scholar
- Carlson, W. W., Draper, J. M., Culler, D. E., Yelick, K., Brooks, E., and Warren, K., 1999. Introduction to UPC and language specification. University of California-Berkeley Technical Report: CCS-TR-99-157.Google Scholar
- Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., and Sarkar, V. 2005. X10: An object-oriented approach to nonuniform cluster computing. In OOPSLA '05: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, 519--538. Google ScholarDigital Library
- Chow, A., Fossum, G., and Brokenshire, D., 2005. A programming example: Large FFT on the Cell Broadband Engine.Google Scholar
- Culler, D. E., Arpaci-Dusseau, A. C., Goldstein, S. C., Krishnamurthy, A., Lumetta, S., Von Eicken, T., and Yelick, K. A. 1993. Parallel programming in Split-C. In Supercomputing, 262--273. Google ScholarDigital Library
- Dagum, L., and Menon, R. 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1, 46--55. Google ScholarDigital Library
- Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labonte, F., Ahn, J.-H. Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., and Buck, I. 2003. Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, 35. Google ScholarDigital Library
- Deitz, S. J., Chamberlain, B. L., and Snyder, L. 2004. Abstractions for dynamic data distribution. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, IEEE Computer Society, 42--51.Google Scholar
- Eager, D. L., and Jahorjan, J. 1993. Chores: Enhanced run-time support for shared-memory parallel computing. ACM Trans. Comput. Syst. 11, 1, 1--32. Google ScholarDigital Library
- Frigo, M., and Strumpen, V. 2005. Cache oblivious stencil computations. In ICS '05: Proceedings of the 19th Annual International Conference on Supercomputing, 361--366. Google ScholarDigital Library
- Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, IEEE Computer Society, Washington, DC, USA, 285. Google ScholarDigital Library
- Frigo, M. 1999. A fast Fourier transform compiler. In Proc. 1999 ACM SIGPLAN Conf. on Programming Language Design and Implementation, vol. 34, 169--180. Google ScholarDigital Library
- Fukushige, T., Makino, J., and Kawai, A. 2005. GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems. Publications of the Astronomical Society of Japan 57 (dec), 1009--1021.Google Scholar
- Gustavson, F. G. 1997. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41, 6, 737--756. Google ScholarDigital Library
- Guyer, S. Z., and Lin, C. 1999. An annotation language for optimizing software libraries. In Second Conference on Domain-Specific Languages, 39--52. Google ScholarDigital Library
- Horn, D. R., Houston, M., and Hanrahan, P. 2005. ClawHMMER: A streaming HMMer-search implementation. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, Washington, DC, USA, 11. Google ScholarDigital Library
- Intel, 2005. Math kernel library. http://www.intel.com/software/products/mkl.Google Scholar
- Jia-Wei, H., and Kung, H. T. 1981. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, 326--333. Google ScholarDigital Library
- Kapasi, U., Dally, W. J., Rixner, S., Owens, J. D., and Khailany, B. 2002. The Imagine stream processor. In Proceedings 2002 IEEE International Conference on Computer Design, 282--288. Google ScholarDigital Library
- Kennedy, K., Broom, B., Cooper, K., Dongarra, J., Fowler, R., Gannon, D., Johnsson, L., Mellor-Crummey, J., and Torczon, L. 2001. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. Journal of Parallel Distributed Computing 61 (December), 1803--1826.Google ScholarDigital Library
- Labonte, F., Mattson, P., Buck, I., Kozyrakis, C., and Horowitz, M. 2004. The stream virtual machine. In Proceedings of the 2004 International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Lim, A. W., Liao, S.-W., and Lam, M. S. 2001. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 103--112. Google ScholarDigital Library
- Mattson, P. 2002. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University. Google ScholarDigital Library
- McPeak, S., and Wilkerson, D., 2005. Elsa: The Elkhound-based C/C++ Parser. http://www.cs.berkeley.edu/~smcpeak/elkhound.Google Scholar
- Numrich, R. W., and Reid, J. 1998. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17, 2, 1--31. Google ScholarDigital Library
- Pham, D., Asano, S., Bolliger, M., Day, M. N., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., and Yazawa, K. 2005. The design and implementation of a first-generation CELL processor. In IEEE International Solid-State Circuits Conference.Google Scholar
- Vitter, J. S. 2002. External memory algorithms. In Handbook of Massive Data Sets, Kluwer Academic Publishers, Norwell, MA, USA, 359--416. Google ScholarDigital Library
- Whaley, R. C., Petitet, A., and Dongarra, J. J. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Computing 27, 1--2, 3--35.Google ScholarCross Ref
- Yelick, K., Semenzato, L., Pike, G., Miyamoto, C., Liblit, B., Krishnamurthy, A., Hilfinger, P., Graham, S., Gay, D., Colella, P., and Aiken, A. 1998. Titanium: A high-performance Java dialect. In ACM 1998 Workshop on Java for High-Performance Network Computing.Google Scholar
Index Terms
Sequoia: programming the memory hierarchy
Recommendations
Sequoia: A High-Endurance NVM-Based Cache Architecture
Emerging nonvolatile memory technologies, such as spin-transfer torque RAM or resistive RAM, can increase the capacity of the last-level cache (LLC) in a latency and power-efficient manner. These technologies endure $10^{9}$ – $10^{12}$ writes per cell, making a ...
The SEQUOIA 2000 storage benchmark
SIGMOD '93: Proceedings of the 1993 ACM SIGMOD international conference on Management of dataThis paper presents a benchmark that concisely captures the data base requirements of a collection of Earth Scientists working in the SEQUOIA 2000 project on various aspects of global change research. This benchmark has the novel characteristic that it ...
Comments