research-article

Trade-offs in loop transformations

Authors:

Martin Palkovic,

Francky Catthoor,

Henk CorporaalAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 14, Issue 2

Article No.: 22, Pages 1 - 30

https://doi.org/10.1145/1497561.1497565

Published: 07 April 2009 Publication History

Abstract

Nowadays, multimedia systems deal with huge amounts of memory accesses and large memory footprints. To alleviate the impact of these accesses and reduce the memory footprint, high-level memory exploration and optimization techniques have been proposed. These techniques try to more efficiently utilize the memory hierarchy. An important step in these optimization techniques are loop transformations (LT). They have a crucial effect on later data memory footprint optimization steps and code generation. However, the state-of-the-art work has focused only on individual objectives. The main one in literature involves improving the locality of data accesses, and thus reducing the data memory footprint. It does not consider the trade-offs in the LT step in relation to successive optimization steps. Therefore, it is not globally efficient in mapping the application on the target platform.

In this article we will discuss several trade-offs during the loop transformations. To our knowledge, we are the first ones considering these global trade-offs. Previous work always gave mostly one solution, having the best locality and thus the optimized memory footprint, even though some research in two-dimensional trade-offs in this area exists as well. We start from this state-of-the-art solution with minimal footprint. We show that by sacrificing the footprint, we can obtain gains in data reuse (crucial for energy reduction) and reduce the control-flow complexity. We demonstrate our approach on a real-life application, namely the QSDPCM video coder. At the end, we show that considering trade-offs for this application leads to 16% energy reduction in a two-layer memory subsystem and 10% cycle reduction on the ARM platform.

References

[1]

Absar, J., Catthoor, F., and Das, K. 2003. Call-instance based function inlining for increasing data access related optimisation opportunities. Tech. rep., IMEC, Leuven, Belgium.

[2]

Allen, R. and Kennedy, K. 1987. Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4, 491--542.

Digital Library

[3]

Ancourt, C., Barthou, D., Guettier, C., Irigoin, F., Jeannet, B., Jourdan, J., and Mattioli, J. 1997. Automatic data mapping of signal processing applications. In Proceedings of the IEEE International Conference on Application-Specific Processors, 350--362.

Digital Library

[4]

Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., and Marwedel, P. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proceedings of the International Symposium on Hardware/Software Codesign (CODES), 73--78.

Digital Library

[5]

Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. 1993. Automatic program parallelization. Proc. IEEE 81, 2, 211--243.

[6]

Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In IEEE International Conference on Parallel Architectureand Compilation Techniques (PACT'13), 7--16.

Digital Library

[7]

Bastoul, C., Cohen, A., Girbal, S., Sharma, S., and Temam, O. 2003. Putting polyhedral loop transformations to work. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computers (LCPC'16). Lecture Notes in Computer Science, vol. 2958, 209--225.

[8]

Brockmeyer, E., Miranda, M., Catthoor, F., and Corporaal, H. 2003. Layer assignment techniques for low energy in multi-layered memory organisations. In Proceedings of the 6th ACM/IEEE Conference on Design and Test in Europe, 1070--1075.

Digital Library

[9]

Carr, S., McKinley, K. S., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, 252--262.

Digital Library

[10]

Catthoor, F. 2005. Meta model for human interaction and decision-making: meta-concepts and balloonist vs road-builder metaphore. slide set.

[11]

Catthoor, F., Danckaert, K., Kulkarni, C., Brockmeyer, E., Kjeldsberg, P. G., Van Achteren, T., and Omnes, T. 2002. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Amsterdam.

[12]

Cierniak, M. and Li, W. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'95), ACM, New York, 205--217.

Digital Library

[13]

Clauss, P. and Loechner, V. 1998. Parametric analysis of polyhedral iteration spaces. J. VLSI Signal Processing 19, 179--194.

Digital Library

[14]

Danckaert, K., Catthoor, F., and De Man, H. 2000. A preprocessing step for global loop transformations for data transfer and storage optimization. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems, 34--40.

Digital Library

[15]

Darte, A. and Robert, Y. 1995. Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. J. Parall. Distrib. Comput. 29, 1, 43--59.

Digital Library

[16]

De Greef, E., Catthoor, F., and De Man, H. 1997. Memory size reduction through storage order optimization for embedded parallel multimedia applications. J. Parall. Comput. 23, 12. (Special issue on Parallel Processing and Multimedia)

Digital Library

[17]

De Man, H., Catthoor, F., Goossens, G., Vanhoof, J., Van Meerbergen, J., Note, S., and Huisken, J. 1990. Architecture-driven synthesis techniques for vlsi implementation of dsp algorithms. Proc. IEEE, 78, 2. (Special issue on The Future of Computer-Aided Design) 319--335.

[18]

Dezan, C., Le Verge, H., Quinton, P., and Saouter, Y. 1992. The Alpha du CENTAUR experiment. In Algorithms and parallel VLSI architectures II, P. Quinton and Y. Robert (eds.), Elsevier, Amsterdam, 325--334.

Digital Library

[19]

Falk, H. and Marwedel, P. 2003. Control flow optimization by loop nest splitting at the source code level. In Proceedingsof the 6th ACM/IEEE Conference on Design and Test in Europe, 410--415.

Digital Library

[20]

Feautrier, P. 1992. Some efficient solutions to the affine scheduling problems. Int. J. Parall. Program. 21, 5, 389--420.

Digital Library

[21]

Fraboulet, A., Huard, G., and Mignotte, A. 1999. Loop alignment for memory access optimisation. In Proceedings of the 12th ACM/IEEE International Symposium on System-Level Synthesis (ISSS). 71--70.

Digital Library

[22]

Franke, B. and O'Boyle, M. 2003. Array recovery and high-level transformations for dsp applications. ACM Trans. Embed. Comput. Syst. 2, 2, 132--162.

Digital Library

[23]

Hu, Q. 2007. Hierarchical memory size estimation for loop transformation and data memory platform exploration. Ph.D. thesis, Norway University of Science and Technology.

[24]

Kandemir, M., Ramanujam, J., Choudhary, A., and Banerjee, P. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Computers 50, 12, 1321--1335.

Digital Library

[25]

Kelly, W. and Pugh, W. 1993. A framework for unifying reordering transformations. Tech. rep. CS-TR-3193, Department of Computer Science, Univ. of Maryland, College Park MD.

[26]

Kjeldsberg, P. G. 2001. Storage requirement estimation and optimization for data intensive applications. Ph.D. thesis, Norwegian University of Science and Technology.

[27]

Li, W. and Pingali, K. 1992. A singular loop transformation framework based on non-singular matrices. In Proceedings of the 5th Annual Workshop on Languages and Compilers for Parallelism.

Digital Library

[28]

Manjiakian, N. and Abdelrahman, T. 1995. Fusion of loops for parallelism and locality. Tech. rep. CSRI-315, Computer Systems Reserch Institute, University of Toronto.

[29]

Meng, T. H., Gordon, B., Tsern, E., and Hung, A. 1995. Portable video-on-demand in wireless communication. Proc. IEEE 83, 4 (Special Issue on Low Power Electronics). 659--680.

[30]

Olsen, R. and Gao, G. 1992. Collective analysis and transformation of loop clusters. Tech. rep. ACAPS Technical Memo 24, McGill University.

[31]

Palkovic, M., Corporaal, H., and Catthoor, F. 2006. Dealing with data dependent conditions to enable general global source code transformations. Int. J. Embed. Syst. To appear.

[32]

Polychronopoulos, C. 1988. Compiler optimizations for enhancing parallelism and their impact on the architecture design. IEEE Trans. Computers 37, 8, 991--1004.

Digital Library

[33]

Pugh, W. 1992. The omega test: A fast and practical integer programming algorithm for dependence analysis. Comm. ACM 35, 8.

Digital Library

[34]

Qin, W. and Malik, S. 2003. Flexible and formal modeling of microprocessors with application to retargetable simulation. In Proceedings of the Conference on Design Automation and Test in Europe (DATE 03). 556--561.

Digital Library

[35]

Quillere, F., Rajopadhye, S., and Wilde, D. 2000. Generation of efficient nested loops from polyhedra. Int. J. Parall. Program. 28, 5.

Digital Library

[36]

Sarkar, V. and Gao, G. R. 1991. Optimization of array accesses by collective loop transformations. In Proceedings of the 5th International Conference on Supercomputing. 194--205.

Digital Library

[37]

Schuster, T., Bougard, B., Raghavan, P., Priewasser, R., Novo, D., der Perre, L. V., and Catthoor, F. 2007. Design of a low power pre-synchronization asip for multimode sdr terminals. In Proceedings of the International Symposium on Systems, Architectures, Modeling and Simulation (SAMOS). 322--332.

Digital Library

[38]

Semeria, L. and De Micheli, G. 1998. Spc: synthesis of pointers in c. In Proceedings of the IEEE International Conference on Computer Aided Design. 340--346.

Digital Library

[39]

Song, Y., Xu, R., Wang, C., and Li, Z. 2001. Data locality enhancement by memory reduction. In International Conference on Supercomputing. 50--64.

Digital Library

[40]

Strobach, P. 1988. Qsdpcm: A new technique in scene adaptive coding. In Proceedings of the 4th European Signal Processing Conference (EUSIPCO-88). 1141--1144.

[41]

Thies, W., Vivien, F., Sheldon, J., and Amarasinghe, S. 2001. A unified framework for schedule and storage optimization. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation. 232--242.

Digital Library

[42]

Van Achteren, T., Deconinck, G., Catthoor, F., and Lauwereins, R. 2002. Data reuse exploration methodology for loop-dominated applications. In Proceedings of the 5th ACM/IEEE Conference on Design and Test in Europe (DATE). 428--435.

Digital Library

[43]

van Swaaij, M., Franssen, F., Catthoor, F., and De Man, H. 1992. Modelling data and control flow for high-level memory management. In Proceedings of the 3rd ACM/IEEE European Design Automation Conference. 8--13.

[44]

Vanbroekhoven, P., Janssens, G., Bruynooghe, M., Corporaal, H., and Catthoor, F. 2003. Advanced copy propagation for arrays. In Proceedings of the SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'03). 24--33.

Digital Library

[45]

Vander Aa, T., Corporaal, H., Catthoor, F., and Deconinck, G. 2005. Combining data and instruction memory energy optimizations for embedded applications. In Proceedings of the 3rd workshop on Embedded Systems for Real-Time Multimedia. 121--126.

[46]

Verdoolaege, S., Catthoor, F., Bruynooghe, M., and Janssens, G. 2003. Multidimensional incremental loop fusion for data locality. In Proceedings of the IEEE 14th International Conference on Application-specific Systems, Architectures and Processors.

[47]

Wilde, D. 1993. A library for doing polyhedral operations. M.S. thesis, Oregon State University.

[48]

Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the Conference on Programming Language Design and Implementation (SIGPLAN'91). 30--43.

Digital Library

[49]

Yang, P., Wong, C., Marchal, P., Catthoor, F., Desmet, D., Verkest, D., and Lauwereins, R. 2001. Energy-aware runtime scheduling for embedded multi-processor socs. IEEE Design Test of Comput. 18, 5. (Special issue on Application-Specific Multiprocessor Mapping), 46--58.

Digital Library

Cited By

Kelefouras VDjemame K(2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-zOnline publication date: 13-May-2019
https://doi.org/10.1007/s11227-019-02880-z
Kelefouras VDjemame KKaeli DPericàs M(2018)A methodology for efficient code optimizations and memory managementProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203274(105-112)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203274
Pouchet LZhang PSadayappan PCong JHutchings BBetz V(2013)Polyhedral-based data reuse optimization for configurable computingProceedings of the ACM/SIGDA international symposium on Field programmable gate arrays10.1145/2435264.2435273(29-38)Online publication date: 11-Feb-2013
https://dl.acm.org/doi/10.1145/2435264.2435273
Show More Cited By

Index Terms

Trade-offs in loop transformations

Recommendations

DRAM Refresh Mechanisms, Penalties, and Trade-Offs
Ever-growing application data footprints demand faster main memory with larger capacity. DRAM has been the technology choice for main memory due to its low latency and high density. However, DRAM cells must be refreshed periodically to preserve their ...
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation Conference

Resistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
Loop transformations for flash memory: cost models and performance effects

Loop optimization, made of a sequence of loop transformations, plays an important role in performance improvement in data centric applications. Programs using flash memory are no exception to this, but, under certain conditions careless applications of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 14, Issue 2

March 2009

384 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/1497561

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 07 April 2009

Accepted: 01 December 2008

Revised: 01 November 2008

Received: 01 December 2006

Published in TODAES Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
419
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kelefouras VDjemame K(2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-zOnline publication date: 13-May-2019
https://doi.org/10.1007/s11227-019-02880-z
Kelefouras VDjemame KKaeli DPericàs M(2018)A methodology for efficient code optimizations and memory managementProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203274(105-112)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203274
Pouchet LZhang PSadayappan PCong JHutchings BBetz V(2013)Polyhedral-based data reuse optimization for configurable computingProceedings of the ACM/SIGDA international symposium on Field programmable gate arrays10.1145/2435264.2435273(29-38)Online publication date: 11-Feb-2013
https://dl.acm.org/doi/10.1145/2435264.2435273
Karfa CBanerjee KSarkar DMandal C(2013)Verification of Loop and Arithmetic Transformations of Array-Intensive BehaviorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2013.227253632:11(1787-1800)Online publication date: 1-Nov-2013
https://dl.acm.org/doi/10.1109/TCAD.2013.2272536
Ding WKandemir MHu A(2012)Improving last level cache locality by integrating loop and data transformationsProceedings of the International Conference on Computer-Aided Design10.1145/2429384.2429398(65-72)Online publication date: 5-Nov-2012
https://dl.acm.org/doi/10.1145/2429384.2429398
Cong JZhang PZou YGroeneveld PSciuto DHassoun S(2012)Optimizing memory hierarchy allocation with loop transformations for high-level synthesisProceedings of the 49th Annual Design Automation Conference10.1145/2228360.2228586(1233-1238)Online publication date: 3-Jun-2012
https://dl.acm.org/doi/10.1145/2228360.2228586
Cong JZhang PZou YPhillips JHu AGraeb H(2011)Combined loop transformation and hierarchy allocation for data reuse optimizationProceedings of the International Conference on Computer-Aided Design10.5555/2132325.2132368(185-192)Online publication date: 7-Nov-2011
https://dl.acm.org/doi/10.5555/2132325.2132368
Karfa CBanerjee KSarkar DMandal C(2011)Equivalence Checking of Array-Intensive ProgramsProceedings of the 2011 IEEE Computer Society Annual Symposium on VLSI10.1109/ISVLSI.2011.61(156-161)Online publication date: 4-Jul-2011
https://dl.acm.org/doi/10.1109/ISVLSI.2011.61
Cong JZhang PZou Y(2011)Combined loop transformation and hierarchy allocation for data reuse optimizationProceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design10.1109/ICCAD.2011.6105324(185-192)Online publication date: 7-Nov-2011
https://dl.acm.org/doi/10.1109/ICCAD.2011.6105324
Verdoolaege SPalkovič MBruynooghe MJanssens GCatthoor F(2010)Experience with Widening Based Equivalence Checking in Realistic Multimedia SystemsJournal of Electronic Testing: Theory and Applications10.1007/s10836-009-5140-426:2(279-292)Online publication date: 1-Apr-2010
https://dl.acm.org/doi/10.1007/s10836-009-5140-4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents