The iteration space of a loop nest is the set of all loop iterations bounded by the loop limits. Tiling the iteration space can effectively exploit the available parallelism, which is essential to multiprocessor compiling and pipelined architecture design. Another improvement brought by tiling is the better data locality that can dramatically reduce memory access and, consequently, the relevant memory access energy consumptions. However, previous studies on tiling were based on the data dependence, thus arrays without dependencies such as input arrays (data streams) were not considered. In this paper, we extend the tiling exploration to also accommodate those dependence-free arrays, and propose a stream-conscious tiling scheme for off-chip memory access optimization. We show that input arrays are as important, if not more, as the arrays with data dependencies when the focus is on memory access optimization instead of parallelism extraction. Our approach is verified on TI’s low power C55X DSP with popular multimedia applications, exhibiting off-chip memory access reduction by 67% on average over the traditional iteration space tiling.
Similar content being viewed by others
References
R. Andonov, H. Bourzoufi, and S. Rajopadhye, Two-dimensional Orthogonal Tiling: From Theory to Practice. in Proceedings of HPC ’96, pp. 225–231 (1996).
L. Carter, J. Ferrante, and S. F. Hummel, Hierarchical Tiling for Improved Superscalar Performance. in Proceedings of IPPS ’95, pp. 239–245 (1995).
Fei Chen and E. Sha. Loop Scheduling and Partitions for Hiding Memory Latencies. in Proceedings of ISSS ’99, pp. 64–70 (1999).
Karp R.M., Miller R.E., Winograd S. (July 1967). The Organization of Computations for Uniform recurrence equations. J. ACM 14(3):563–590
U. Banerjee, Loop Transformations for Restructuring Compilers. Kluwer Academic Publishers (1993).
J. Ramanujam and P. Sadayappan, Tiling Multidimensional Iteration Spaces for Nonshared Memory Machines. in Proceedings Supercomputing ’91, pp. 111–120 (1991).
Wang Q., Sha E., Passos N.L (Dec. 1996). Optimal Data Scheduling for Uniform Multi-Dimensional Applications. IEEE Trans. Computers 45(12):1439–1444
M. Wolfe, High Performance Compilers for Parallel Computing, Addison Wesley Publishing Company (1996).
P. -Y. Calland, J. Dongarra, and Y. Robert, Tiling with Limited Resources. in Proceedings ASSAP ’97, pp. 229–238 (1997).
P. R. Panda, N. D. Dutt, and A. Nicolau, Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications, in Proceedings of EDTC ’97, pp. 7–11 (1997).
P. Marwedel, L. Wehmeyer, M. Verma, Stefan Steinke, and Urs Helmig, Fast, Predictable and Low Energy Memory References Through Architecture-Aware Compilation. in Proceedings of ASP-DAC’04, pp. 4–11 (2004).
Kandemir M., Ramanujam J., Irwin M., Narayanan V., Kadayif I., Parikh A. (Feb. 2004). A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems. IEEE Trans. CAD 23(2):243–260
Kadayif I., Kandemir M. (May 2005). Data Space-Oriented Tiling for Enhancing Locality. ACM Trans on Embedded Comput Sys 4(2):388–414
A. Darte and G. Huard. Complexity of Multi-Dimensional Loop Alignment. in Proceedings of STACS’02, pp. 179–191 (2002).
J. J. Navarro, E. G. Diego, and J. R. Herrero. Data Prefetching and Multilevel Blocking for Linear Algebra Operations. in Proceedings of Supercomputing ’96, pp. 109–116 (1996).
TMS320C55x DSP Functional Overview, Texas Instruments Inc., http://focus.ti.com/lit/ug/spru307a/spru307a.pdf.
ADSP-21xx Processor, Analog Devices Inc., http://www.analog.com/processors/processors/ADSP/.
Texas Instruments, Inc. TMS320VC5510 Power Consumption Summary (SPRA972) (2003).
Peir J.-K., Cytron R. (1989). Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors. IEEE Trans. on Comp. 38(8):1203–1211
J. Xue, Loop Tiling for Parallelism. Kluwer Academic Publishers (2000).
P. C. Shields. Elementary Linear Algebra. Worth Publishers, Inc. (1980).
Darte A., Silber G.-A., Vivien F. (1997). Combining Retiming and Scheduling Techniques for Loop Parallelization and Loop Tiling. Parallel Process. Lett. 7(4):379–392
M. W. Hall, S. Hiranandani, K. Kennedy, and C. W. Tseng. Inter-Procedural Compilation of Fortran D for MIMD Distributed-Memory Machines. in Proceedings of Supercomputing ’92, pp. 522–534 (1992).
D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee, Communication Optimizations Used in the PARADIGM Compiler for Distributed Memory Multicomputers. in Proceedings of Supercomputing ’94, pp. 1–10 (1994).
V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, 2nd edn., Kluwer Academic (1997).
Code Composer Studio Product, Texas Instruments Inc., http://www.go-dsp.com/mm-help/swfs/profiler.htm.
W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press (1992).
Stan M.R., Burleson W.P. (Mar. 1995). Bus-invert coding for low-power i/o. IEEE Trans. VLSI 3(1):49–58
S. Wuytack, F. Catthoor, L. Nachtergaele, and H. De Man, Power Exploration for Data Dominated Video Applications. in Proceedings of ISLPED’96, pp. 359–364 (1996).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, C., Kurdahi, F. Reducing Off-Chip Memory Access via Stream-Conscious Tiling on Multimedia Applications. Int J Parallel Prog 35, 63–98 (2007). https://doi.org/10.1007/s10766-006-0027-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-006-0027-9