Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

Yang, Xuejun; Du, Jing; Yan, Xiaobo; Deng, Yu

doi:10.1007/s11227-008-0186-0

Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

Published: 16 March 2008

Volume 47, pages 171–197, (2009)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xuejun Yang¹,
Jing Du¹,
Xiaobo Yan¹ &
…
Yu Deng¹

59 Accesses
5 Citations
Explore all metrics

Abstract

FT64 is the first 64-bit stream processor designed for scientific computing. It is critical to exploit optimizing streamization approaches for scientific applications on FT64 due to the inefficiency of direct streamization approach. In this paper, we propose a novel matrix-based streamization approach for improving locality and parallelism of scientific applications on FT64. First, a Data&Computation Matrix is built to abstract the relationship between loops and arrays of the original programs, and it is helpful for formulating the streamization problem. Second, three key techniques for optimizing streamization approach are proposed based on the transformations of the matrix, i.e., coarse-grained program transformations, fine-grained program transformations, and stream organization optimizations. Finally, we apply our approach to ten typical scientific application kernels on FT64. The experimental results show that the matrix-based streamization approach achieves an average speedup of 2.76 over the direct streamization approach, and performs equally to or better than the corresponding Fortran programs on Itanium 2 except CG. It is certain that the matrix-based streamization approach is a promising and practical solution to efficiently exploit the tremendous potential of FT64.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

In-memory database acceleration on FPGAs: a survey

Article Open access 26 October 2019

References

Kapasi UJ, Rixner S, Dally WJ et al (2003) Programmable stream processors. IEEE Comput 54–62
Khailany B (2003) The VLSI implementation and evaluation of area-and energy-efficient streaming media processors. Ph.D. thesis, Stanford University
Taylor M, Kim J, Miller J et al. (2002) The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro 22(2):25–35
Article Google Scholar
Burger D, Keckler SW, McKinley KS et al. (2004) Scaling to the end of silicon with EDGE architectures. Computer 37(7):44–55
Article Google Scholar
Gordon MI, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proceedings of ASPLOS’06, California, USA
Andrew AL, Thies W, Amarasinghe S (2003) Linear analysis and optimization of stream programs. In: Proceedings of the SIGPLAN’03 conference on programming language design and implementation, San Diego, CA
Owens JD, Rixner S et al (2002) Media processing applications on the imagine stream processor. In: Proceedings of the 2002 international conference on computer design
Yang X, Yan X, Xing Z et al. (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA’07: Proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210–219
Chapter Google Scholar
Amarasinghe S et al (2003) Stream languages and programming models. In: Proceedings of the international conference on parallel architectures and compilation techniques 2003
Mattson P (2002) A programming system for the imagine media processor. Ph.D. thesis, Dept of Electrical Engineering, Stanford University
Du J, Yang X et al (2007) Architecture-based optimization for mapping scientific applications to imagine. In: ISPA’07: Proceedings of the 2007 international symposium on parallel and distributed processing with applications, Ontario, Canada
Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT’06: Proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33–42
Chapter Google Scholar
Johnsson O, Stenemo M, ul-Abdin Z (2005) Programming & implementation of streaming applications. Master’s thesis, Computer and Electrical Engineering Halmstad University
Ahn JH, Dally WJ et al (2004). Evaluating the imagine stream architecture. In: Proceedings of the annual international symposium on computer architecture 2004
Jayasena NS (2005) Memory hierarchy design for stream computing. Ph.D. thesis, Stanford University
Wolf ME, Lam M (1991) A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans Parallel Distrib Syst 2(4):452–471
Article Google Scholar
Kuck D, Kuhn R et al (1981) Dependence graphs and compiler optimizations. In: Conference record of the eighth annual ACM symposium on the principles of programming languages, Williamsburg, VA, January 1981
Wolfe MJ (1996) High performance compilers for parallel computing. Addison-Wesley, Reading
MATH Google Scholar
Du J, Yang X et al (2006) Scientific computing applications on the imagine stream processor. In: Proceedings of the 11th Asia-pacific computer systems architecture conference, Shanghai, China
Fan Z, Qiu F et al (2004) Gpu cluster for high performance computing. In: Proceedings of supercomputing conference 2004
Harris MJ, Baxter WV et al (2003) Simulation of cloud dynamics on graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on graphics hardware, Switzerland, pp 92–101
Bolz J, Farmer I, Grinspun E, Schr Öder P (2003) Sparse matrix solvers on the Gpu: conjugate gradients and multigrid. ACM Trans Graph 22(3):917–924
Article Google Scholar
Dally WJ, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: Proceedings of supercomputing conference 2003
Erez M, Ahn J et al (2004) Analysis and performance results of a molecular modeling application on Merrimac. In: Proceedings of supercomputing conference 2004
Erez M (2007) Merrimac—high-performance, highly-efficient scientific computing with streams. Ph.D. thesis, Dept of Electrical Engineering, Stanford University
Erez M, Ahn J et al (2007) Executing irregular scientific applications on stream architectures. In: (ICS’07): Proceedings of the 21th ACM international conference on supercomputing
Griem G, Oliker L (2003) Transitive closure on the imagine stream processor. In: Proceedings of the 5th workshop on media and streaming processors, San Diego, CA
Ahn J, Dally WJ, Erez M (2007) Tradeoff between data-, instruction-, and thread-level parallelism in stream processors. In: (ICS’07): Proceedings of the 21th ACM international conference on supercomputing
Sermulins J, Thies W et al (2005) Cache aware optimization of stream programs. In: Proceedings of LCTES’05, Chicago, Illinois, USA
Wolf M, Lam M (1991) A data locality optimizing algorithm. In: Proceedings of ACM SIGPLAN’91 conference on programming language design and implementation, Ontario, Canada, pp 30–44
McKinley K, Carr S, Tseng CW (1996) Improving data locality with loop transformations. ACM Trans Program Lang Syst
Li W (1993) Compiling for NUMA parallel machines. Ph.D. thesis, Cornell University
Kandemir M, Choudhary A et al. (1999) A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans Parallel Distrib Syst 10(2):115–135
Article Google Scholar
Cierniak M, Li W (1995) Unifying Data and control transformations for distributed shared memory machines. In: ACM SIGPLAN IPDPS, pp 205–217
Kandemir M, Choudhary A et al (1998) Improving locality using loop and data transformations in an integrated framework. In: Proceedings of international symposium on microarchitecture, pp 285–297
Kandemir M, Banerjee P et al. (2001) Static and dynamic locality optimizations using integer linear programming. IEEE Trans Parallel Distrib Syst 12(9):922–940
Article Google Scholar
Kandemir M et al (1999) A graph based framework to detect optimal memory layouts for improving data locality. In: Proceedings of the 13th international parallel processing symposium, San Juan, Puerto Rico, pp 738–743
O’Boyle M, Knijnenburg P (1996) Non-singular data transformations: definition, validity, applications. In: Proceedings of 6th workshop on compilers for parallel computers, pp 287–297
Garcia J, Ayguade E et al (1996) Dynamic data distribution with control flow analysis. In: Proceedings of supercomputing conference 1996

Download references

Author information

Authors and Affiliations

PDL, School of Computer, National University of Defense Technology, Changsha, Hunan, 410073, China
Xuejun Yang, Jing Du, Xiaobo Yan & Yu Deng

Authors

Xuejun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Du
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobo Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yu Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, X., Du, J., Yan, X. et al. Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor. J Supercomput 47, 171–197 (2009). https://doi.org/10.1007/s11227-008-0186-0

Download citation

Received: 31 January 2008
Accepted: 11 February 2008
Published: 16 March 2008
Issue Date: February 2009
DOI: https://doi.org/10.1007/s11227-008-0186-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

Abstract

Access this article

Similar content being viewed by others

Performance improvement of the triangular matrix product in commodity clusters

Parallelizing the dual revised simplex method

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

Abstract

Access this article

Similar content being viewed by others

Performance improvement of the triangular matrix product in commodity clusters

Parallelizing the dual revised simplex method

In-memory database acceleration on FPGAs: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation