Abstract
FT64 is the first 64-bit stream processor designed for scientific computing. It is critical to exploit optimizing streamization approaches for scientific applications on FT64 due to the inefficiency of direct streamization approach. In this paper, we propose a novel matrix-based streamization approach for improving locality and parallelism of scientific applications on FT64. First, a Data&Computation Matrix is built to abstract the relationship between loops and arrays of the original programs, and it is helpful for formulating the streamization problem. Second, three key techniques for optimizing streamization approach are proposed based on the transformations of the matrix, i.e., coarse-grained program transformations, fine-grained program transformations, and stream organization optimizations. Finally, we apply our approach to ten typical scientific application kernels on FT64. The experimental results show that the matrix-based streamization approach achieves an average speedup of 2.76 over the direct streamization approach, and performs equally to or better than the corresponding Fortran programs on Itanium 2 except CG. It is certain that the matrix-based streamization approach is a promising and practical solution to efficiently exploit the tremendous potential of FT64.
Similar content being viewed by others
References
Kapasi UJ, Rixner S, Dally WJ et al (2003) Programmable stream processors. IEEE Comput 54–62
Khailany B (2003) The VLSI implementation and evaluation of area-and energy-efficient streaming media processors. Ph.D. thesis, Stanford University
Taylor M, Kim J, Miller J et al. (2002) The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25–35
Burger D, Keckler SW, McKinley KS et al. (2004) Scaling to the end of silicon with EDGE architectures. Computer 37(7):44–55
Gordon MI, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proceedings of ASPLOS’06, California, USA
Andrew AL, Thies W, Amarasinghe S (2003) Linear analysis and optimization of stream programs. In: Proceedings of the SIGPLAN’03 conference on programming language design and implementation, San Diego, CA
Owens JD, Rixner S et al (2002) Media processing applications on the imagine stream processor. In: Proceedings of the 2002 international conference on computer design
Yang X, Yan X, Xing Z et al. (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA’07: Proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210–219
Amarasinghe S et al (2003) Stream languages and programming models. In: Proceedings of the international conference on parallel architectures and compilation techniques 2003
Mattson P (2002) A programming system for the imagine media processor. Ph.D. thesis, Dept of Electrical Engineering, Stanford University
Du J, Yang X et al (2007) Architecture-based optimization for mapping scientific applications to imagine. In: ISPA’07: Proceedings of the 2007 international symposium on parallel and distributed processing with applications, Ontario, Canada
Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT’06: Proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33–42
Johnsson O, Stenemo M, ul-Abdin Z (2005) Programming & implementation of streaming applications. Master’s thesis, Computer and Electrical Engineering Halmstad University
Ahn JH, Dally WJ et al (2004). Evaluating the imagine stream architecture. In: Proceedings of the annual international symposium on computer architecture 2004
Jayasena NS (2005) Memory hierarchy design for stream computing. Ph.D. thesis, Stanford University
Wolf ME, Lam M (1991) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst 2(4):452–471
Kuck D, Kuhn R et al (1981) Dependence graphs and compiler optimizations. In: Conference record of the eighth annual ACM symposium on the principles of programming languages, Williamsburg, VA, January 1981
Wolfe MJ (1996) High performance compilers for parallel computing. Addison-Wesley, Reading
Du J, Yang X et al (2006) Scientific computing applications on the imagine stream processor. In: Proceedings of the 11th Asia-pacific computer systems architecture conference, Shanghai, China
Fan Z, Qiu F et al (2004) Gpu cluster for high performance computing. In: Proceedings of supercomputing conference 2004
Harris MJ, Baxter WV et al (2003) Simulation of cloud dynamics on graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on graphics hardware, Switzerland, pp 92–101
Bolz J, Farmer I, Grinspun E, Schr Öder P (2003) Sparse matrix solvers on the Gpu: conjugate gradients and multigrid. ACM Trans Graph 22(3):917–924
Dally WJ, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: Proceedings of supercomputing conference 2003
Erez M, Ahn J et al (2004) Analysis and performance results of a molecular modeling application on Merrimac. In: Proceedings of supercomputing conference 2004
Erez M (2007) Merrimac—high-performance, highly-efficient scientific computing with streams. Ph.D. thesis, Dept of Electrical Engineering, Stanford University
Erez M, Ahn J et al (2007) Executing irregular scientific applications on stream architectures. In: (ICS’07): Proceedings of the 21th ACM international conference on supercomputing
Griem G, Oliker L (2003) Transitive closure on the imagine stream processor. In: Proceedings of the 5th workshop on media and streaming processors, San Diego, CA
Ahn J, Dally WJ, Erez M (2007) Tradeoff between data-, instruction-, and thread-level parallelism in stream processors. In: (ICS’07): Proceedings of the 21th ACM international conference on supercomputing
Sermulins J, Thies W et al (2005) Cache aware optimization of stream programs. In: Proceedings of LCTES’05, Chicago, Illinois, USA
Wolf M, Lam M (1991) A data locality optimizing algorithm. In: Proceedings of ACM SIGPLAN’91 conference on programming language design and implementation, Ontario, Canada, pp 30–44
McKinley K, Carr S, Tseng CW (1996) Improving data locality with loop transformations. ACM Trans Program Lang Syst
Li W (1993) Compiling for NUMA parallel machines. Ph.D. thesis, Cornell University
Kandemir M, Choudhary A et al. (1999) A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans Parallel Distrib Syst 10(2):115–135
Cierniak M, Li W (1995) Unifying Data and control transformations for distributed shared memory machines. In: ACM SIGPLAN IPDPS, pp 205–217
Kandemir M, Choudhary A et al (1998) Improving locality using loop and data transformations in an integrated framework. In: Proceedings of international symposium on microarchitecture, pp 285–297
Kandemir M, Banerjee P et al. (2001) Static and dynamic locality optimizations using integer linear programming. IEEE Trans Parallel Distrib Syst 12(9):922–940
Kandemir M et al (1999) A graph based framework to detect optimal memory layouts for improving data locality. In: Proceedings of the 13th international parallel processing symposium, San Juan, Puerto Rico, pp 738–743
O’Boyle M, Knijnenburg P (1996) Non-singular data transformations: definition, validity, applications. In: Proceedings of 6th workshop on compilers for parallel computers, pp 287–297
Garcia J, Ayguade E et al (1996) Dynamic data distribution with control flow analysis. In: Proceedings of supercomputing conference 1996
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, X., Du, J., Yan, X. et al. Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor. J Supercomput 47, 171–197 (2009). https://doi.org/10.1007/s11227-008-0186-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-008-0186-0