Skip to main content
Log in

Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

FT64 is the first 64-bit stream processor designed for scientific computing. It is critical to exploit optimizing streamization approaches for scientific applications on FT64 due to the inefficiency of direct streamization approach. In this paper, we propose a novel matrix-based streamization approach for improving locality and parallelism of scientific applications on FT64. First, a Data&Computation Matrix is built to abstract the relationship between loops and arrays of the original programs, and it is helpful for formulating the streamization problem. Second, three key techniques for optimizing streamization approach are proposed based on the transformations of the matrix, i.e., coarse-grained program transformations, fine-grained program transformations, and stream organization optimizations. Finally, we apply our approach to ten typical scientific application kernels on FT64. The experimental results show that the matrix-based streamization approach achieves an average speedup of 2.76 over the direct streamization approach, and performs equally to or better than the corresponding Fortran programs on Itanium 2 except CG. It is certain that the matrix-based streamization approach is a promising and practical solution to efficiently exploit the tremendous potential of FT64.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kapasi UJ, Rixner S, Dally WJ et al (2003) Programmable stream processors. IEEE Comput 54–62

  2. Khailany B (2003) The VLSI implementation and evaluation of area-and energy-efficient streaming media processors. Ph.D. thesis, Stanford University

  3. Taylor M, Kim J, Miller J et al. (2002) The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25–35

    Article  Google Scholar 

  4. Burger D, Keckler SW, McKinley KS et al. (2004) Scaling to the end of silicon with EDGE architectures. Computer 37(7):44–55

    Article  Google Scholar 

  5. Gordon MI, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proceedings of ASPLOS’06, California, USA

  6. Andrew AL, Thies W, Amarasinghe S (2003) Linear analysis and optimization of stream programs. In: Proceedings of the SIGPLAN’03 conference on programming language design and implementation, San Diego, CA

  7. Owens JD, Rixner S et al (2002) Media processing applications on the imagine stream processor. In: Proceedings of the 2002 international conference on computer design

  8. Yang X, Yan X, Xing Z et al. (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA’07: Proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210–219

    Chapter  Google Scholar 

  9. Amarasinghe S et al (2003) Stream languages and programming models. In: Proceedings of the international conference on parallel architectures and compilation techniques 2003

  10. Mattson P (2002) A programming system for the imagine media processor. Ph.D. thesis, Dept of Electrical Engineering, Stanford University

  11. Du J, Yang X et al (2007) Architecture-based optimization for mapping scientific applications to imagine. In: ISPA’07: Proceedings of the 2007 international symposium on parallel and distributed processing with applications, Ontario, Canada

  12. Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT’06: Proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33–42

    Chapter  Google Scholar 

  13. Johnsson O, Stenemo M, ul-Abdin Z (2005) Programming & implementation of streaming applications. Master’s thesis, Computer and Electrical Engineering Halmstad University

  14. Ahn JH, Dally WJ et al (2004). Evaluating the imagine stream architecture. In: Proceedings of the annual international symposium on computer architecture 2004

  15. Jayasena NS (2005) Memory hierarchy design for stream computing. Ph.D. thesis, Stanford University

  16. Wolf ME, Lam M (1991) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst 2(4):452–471

    Article  Google Scholar 

  17. Kuck D, Kuhn R et al (1981) Dependence graphs and compiler optimizations. In: Conference record of the eighth annual ACM symposium on the principles of programming languages, Williamsburg, VA, January 1981

  18. Wolfe MJ (1996) High performance compilers for parallel computing. Addison-Wesley, Reading

    MATH  Google Scholar 

  19. Du J, Yang X et al (2006) Scientific computing applications on the imagine stream processor. In: Proceedings of the 11th Asia-pacific computer systems architecture conference, Shanghai, China

  20. Fan Z, Qiu F et al (2004) Gpu cluster for high performance computing. In: Proceedings of supercomputing conference 2004

  21. Harris MJ, Baxter WV et al (2003) Simulation of cloud dynamics on graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on graphics hardware, Switzerland, pp 92–101

  22. Bolz J, Farmer I, Grinspun E, Schr Öder P (2003) Sparse matrix solvers on the Gpu: conjugate gradients and multigrid. ACM Trans Graph 22(3):917–924

    Article  Google Scholar 

  23. Dally WJ, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: Proceedings of supercomputing conference 2003

  24. Erez M, Ahn J et al (2004) Analysis and performance results of a molecular modeling application on Merrimac. In: Proceedings of supercomputing conference 2004

  25. Erez M (2007) Merrimac—high-performance, highly-efficient scientific computing with streams. Ph.D. thesis, Dept of Electrical Engineering, Stanford University

  26. Erez M, Ahn J et al (2007) Executing irregular scientific applications on stream architectures. In: (ICS’07): Proceedings of the 21th ACM international conference on supercomputing

  27. Griem G, Oliker L (2003) Transitive closure on the imagine stream processor. In: Proceedings of the 5th workshop on media and streaming processors, San Diego, CA

  28. Ahn J, Dally WJ, Erez M (2007) Tradeoff between data-, instruction-, and thread-level parallelism in stream processors. In: (ICS’07): Proceedings of the 21th ACM international conference on supercomputing

  29. Sermulins J, Thies W et al (2005) Cache aware optimization of stream programs. In: Proceedings of LCTES’05, Chicago, Illinois, USA

  30. Wolf M, Lam M (1991) A data locality optimizing algorithm. In: Proceedings of ACM SIGPLAN’91 conference on programming language design and implementation, Ontario, Canada, pp 30–44

  31. McKinley K, Carr S, Tseng CW (1996) Improving data locality with loop transformations. ACM Trans Program Lang Syst

  32. Li W (1993) Compiling for NUMA parallel machines. Ph.D. thesis, Cornell University

  33. Kandemir M, Choudhary A et al. (1999) A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans Parallel Distrib Syst 10(2):115–135

    Article  Google Scholar 

  34. Cierniak M, Li W (1995) Unifying Data and control transformations for distributed shared memory machines. In: ACM SIGPLAN IPDPS, pp 205–217

  35. Kandemir M, Choudhary A et al (1998) Improving locality using loop and data transformations in an integrated framework. In: Proceedings of international symposium on microarchitecture, pp 285–297

  36. Kandemir M, Banerjee P et al. (2001) Static and dynamic locality optimizations using integer linear programming. IEEE Trans Parallel Distrib Syst 12(9):922–940

    Article  Google Scholar 

  37. Kandemir M et al (1999) A graph based framework to detect optimal memory layouts for improving data locality. In: Proceedings of the 13th international parallel processing symposium, San Juan, Puerto Rico, pp 738–743

  38. O’Boyle M, Knijnenburg P (1996) Non-singular data transformations: definition, validity, applications. In: Proceedings of 6th workshop on compilers for parallel computers, pp 287–297

  39. Garcia J, Ayguade E et al (1996) Dynamic data distribution with control flow analysis. In: Proceedings of supercomputing conference 1996

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, X., Du, J., Yan, X. et al. Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor. J Supercomput 47, 171–197 (2009). https://doi.org/10.1007/s11227-008-0186-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-008-0186-0

Keywords

Navigation