Reducing Off-Chip Memory Access via Stream-Conscious Tiling on Multimedia Applications

Zhang, Chunhui; Kurdahi, Fadi

doi:10.1007/s10766-006-0027-9

Reducing Off-Chip Memory Access via Stream-Conscious Tiling on Multimedia Applications

Published: 10 February 2007

Volume 35, pages 63–98, (2007)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Chunhui Zhang¹ &
Fadi Kurdahi¹

74 Accesses
1 Citation
Explore all metrics

The iteration space of a loop nest is the set of all loop iterations bounded by the loop limits. Tiling the iteration space can effectively exploit the available parallelism, which is essential to multiprocessor compiling and pipelined architecture design. Another improvement brought by tiling is the better data locality that can dramatically reduce memory access and, consequently, the relevant memory access energy consumptions. However, previous studies on tiling were based on the data dependence, thus arrays without dependencies such as input arrays (data streams) were not considered. In this paper, we extend the tiling exploration to also accommodate those dependence-free arrays, and propose a stream-conscious tiling scheme for off-chip memory access optimization. We show that input arrays are as important, if not more, as the arrays with data dependencies when the focus is on memory access optimization instead of parallelism extraction. Our approach is verified on TI’s low power C55X DSP with popular multimedia applications, exhibiting off-chip memory access reduction by 67% on average over the traditional iteration space tiling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Article 24 June 2014

Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

Article 21 October 2014

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Article 17 February 2015

References

R. Andonov, H. Bourzoufi, and S. Rajopadhye, Two-dimensional Orthogonal Tiling: From Theory to Practice. in Proceedings of HPC ’96, pp. 225–231 (1996).
L. Carter, J. Ferrante, and S. F. Hummel, Hierarchical Tiling for Improved Superscalar Performance. in Proceedings of IPPS ’95, pp. 239–245 (1995).
Fei Chen and E. Sha. Loop Scheduling and Partitions for Hiding Memory Latencies. in Proceedings of ISSS ’99, pp. 64–70 (1999).
Karp R.M., Miller R.E., Winograd S. (July 1967). The Organization of Computations for Uniform recurrence equations. J. ACM 14(3):563–590
Article MATH MathSciNet Google Scholar
U. Banerjee, Loop Transformations for Restructuring Compilers. Kluwer Academic Publishers (1993).
J. Ramanujam and P. Sadayappan, Tiling Multidimensional Iteration Spaces for Nonshared Memory Machines. in Proceedings Supercomputing ’91, pp. 111–120 (1991).
Wang Q., Sha E., Passos N.L (Dec. 1996). Optimal Data Scheduling for Uniform Multi-Dimensional Applications. IEEE Trans. Computers 45(12):1439–1444
Article MATH Google Scholar
M. Wolfe, High Performance Compilers for Parallel Computing, Addison Wesley Publishing Company (1996).
P. -Y. Calland, J. Dongarra, and Y. Robert, Tiling with Limited Resources. in Proceedings ASSAP ’97, pp. 229–238 (1997).
P. R. Panda, N. D. Dutt, and A. Nicolau, Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications, in Proceedings of EDTC ’97, pp. 7–11 (1997).
P. Marwedel, L. Wehmeyer, M. Verma, Stefan Steinke, and Urs Helmig, Fast, Predictable and Low Energy Memory References Through Architecture-Aware Compilation. in Proceedings of ASP-DAC’04, pp. 4–11 (2004).
Kandemir M., Ramanujam J., Irwin M., Narayanan V., Kadayif I., Parikh A. (Feb. 2004). A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems. IEEE Trans. CAD 23(2):243–260
Google Scholar
Kadayif I., Kandemir M. (May 2005). Data Space-Oriented Tiling for Enhancing Locality. ACM Trans on Embedded Comput Sys 4(2):388–414
Article Google Scholar
A. Darte and G. Huard. Complexity of Multi-Dimensional Loop Alignment. in Proceedings of STACS’02, pp. 179–191 (2002).
J. J. Navarro, E. G. Diego, and J. R. Herrero. Data Prefetching and Multilevel Blocking for Linear Algebra Operations. in Proceedings of Supercomputing ’96, pp. 109–116 (1996).
TMS320C55x DSP Functional Overview, Texas Instruments Inc., http://focus.ti.com/lit/ug/spru307a/spru307a.pdf.
ADSP-21xx Processor, Analog Devices Inc., http://www.analog.com/processors/processors/ADSP/.
Texas Instruments, Inc. TMS320VC5510 Power Consumption Summary (SPRA972) (2003).
Peir J.-K., Cytron R. (1989). Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors. IEEE Trans. on Comp. 38(8):1203–1211
Article Google Scholar
J. Xue, Loop Tiling for Parallelism. Kluwer Academic Publishers (2000).
P. C. Shields. Elementary Linear Algebra. Worth Publishers, Inc. (1980).
Darte A., Silber G.-A., Vivien F. (1997). Combining Retiming and Scheduling Techniques for Loop Parallelization and Loop Tiling. Parallel Process. Lett. 7(4):379–392
Article Google Scholar
M. W. Hall, S. Hiranandani, K. Kennedy, and C. W. Tseng. Inter-Procedural Compilation of Fortran D for MIMD Distributed-Memory Machines. in Proceedings of Supercomputing ’92, pp. 522–534 (1992).
D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee, Communication Optimizations Used in the PARADIGM Compiler for Distributed Memory Multicomputers. in Proceedings of Supercomputing ’94, pp. 1–10 (1994).
V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, 2nd edn., Kluwer Academic (1997).
Code Composer Studio Product, Texas Instruments Inc., http://www.go-dsp.com/mm-help/swfs/profiler.htm.
W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press (1992).
Stan M.R., Burleson W.P. (Mar. 1995). Bus-invert coding for low-power i/o. IEEE Trans. VLSI 3(1):49–58
Article Google Scholar
S. Wuytack, F. Catthoor, L. Nachtergaele, and H. De Man, Power Exploration for Data Dominated Video Applications. in Proceedings of ISLPED’96, pp. 359–364 (1996).

Download references

Author information

Authors and Affiliations

Department of EECS, University of California, Irvine, CA, 92697, USA
Chunhui Zhang & Fadi Kurdahi

Authors

Chunhui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fadi Kurdahi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunhui Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, C., Kurdahi, F. Reducing Off-Chip Memory Access via Stream-Conscious Tiling on Multimedia Applications. Int J Parallel Prog 35, 63–98 (2007). https://doi.org/10.1007/s10766-006-0027-9

Download citation

Published: 10 February 2007
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10766-006-0027-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reducing Off-Chip Memory Access via Stream-Conscious Tiling on Multimedia Applications

Access this article

Similar content being viewed by others

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reducing Off-Chip Memory Access via Stream-Conscious Tiling on Multimedia Applications

Access this article

Similar content being viewed by others

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation