skip to main content
article

Incremental hierarchical memory size estimation for steering of loop transformations

Published: 01 September 2007 Publication History

Abstract

Modern embedded multimedia and telecommunications systems need to store and access huge amounts of data. This becomes a critical factor for the overall energy consumption, area, and performance of the systems. Loop transformations are essential to improve the data access locality and regularity in order to optimally design or utilize a memory hierarchy. However, due to abstract high-level cost functions, current loop transformation steering techniques do not take the memory platform sufficiently into account. They usually also result in only one final transformation solution. On the other hand, the loop transformation search space for real-life applications is huge, especially if the memory platform is still not fully fixed. Use of existing loop transformation techniques will therefore typically lead to suboptimal end-products. It is critical to find all interesting loop transformation instances. This can only be achieved by performing an evaluation of the effect of later design stages at the early loop transformation stage.
This article presents a fast incremental hierarchical memory-size requirement estimation technique. It estimates the influence of any given sequence of loop transformation instances on the mapping of application data onto a hierarchical memory platform. As the exact memory platform instantiation is often not yet defined at this high-level design stage, a platform-independent estimation is introduced with a Pareto curve output for each loop transformation instance. Comparison among the Pareto curves helps the designer, or a steering tool, to find all interesting loop transformation instances that might later lead to low-power data mapping for any of the many possible memory hierarchy instances. Initially, the source code is used as input for estimation. However, performing the estimation repeatedly from the source code is too slow for large search space exploration. An incremental approach, based on local updating of the previous result, is therefore used to handle sequences of different loop transformations. Experiments show that the initial approach takes a few seconds, which is two orders of magnitude faster than state-of-the-art solutions but still too costly to be performed interactively many times. The incremental approach typically takes just a few milliseconds, which is another two orders of magnitude faster than the initial approach. This huge speedup allows us for the first time to handle real-life industrial-size applications and get realistic feedback during loop transformation exploration.

References

[1]
Bacon, D. F., Grahan, S. L., and Sharp, O. J. 1994. Compiler transformations for high-performance computing. ACM Comput. Surv. 26, 4 (Dec.), 245--20.
[2]
Balasa, F., catthoor, R., and De Man, H. 1995. Background memory area estimation for multidimensional signal processing systems. IEEE Trans. VLSI Syst. 3, 2 (Jun.), 157--172.
[3]
Banerjee, U. 1993. Loop Transformation for Restructuring Compilers: The Foundations. Kluwer Academic, Boston, MA.
[4]
Belady, L. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5, 6, 78--101.
[5]
Benini, L., macii, A., and Poncing, M. 2000. Increasing energy efficiency of embedded systems by application-specific memory hierarchy generation. IEEE Des. Test Comput. 17, 2 (Apr.), 74--85.
[6]
Beyls, K. and d'hollander, E. 2001. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems, T. Gonzalez, Ed. (Anaheim, CA), 617--622.
[7]
Brockmeyer, E., Miranda, M., Corporaal, H., and Catthoor, F. 2003. Layer assignment techniques for low energy in multi-layered memory organisations. In Proceedings of the 6th ACM/IEEE Design and Test in Europe Conference (Munich, Germany), 1070--1075.
[8]
Catthoor, F., C. Kulkarni, K.D., Brockmeyer, E., Kjeldsberg, P.G., van Achteren, T., and Omnes, T. 2002. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Academic, Boston, MA.
[9]
Catthoor, F., Wuytack, S., de Greef, E., Balasa, F., Nachtergaele, L., and Vandecappelle, A. 1998. Custom Memory Management Methodology---Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic, Boston, MA.
[10]
Cohen, A., Girbal, S., and Temam, O. 2004. A polyhedral approach to ease the composition of program transformations. In Proceedings of the EuroPar Conference (Pisa, Italy). Lecture Notes in Computer Science vol. 3149. Springer, 292--303.
[11]
Danckaert, K., Catthoor, F., And De Man, H. 2000. A loop transformation approach for combined par-allelization and data transfer and storage optimization. In Proceedings of the ACM Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), (Las Vegas, NV), 2591--2597.
[12]
Darte, A. 2000. On the complexity of loop fusion. Parallel Comput. 26, 9, 1175--1193.
[13]
Darte, A. and Robert, Y. 1995. Affine-by-Statement scheduling of uniform and affine loop nests over parametric domains. J. Parallel Distrib. Comput. 29, 1 (Aug.), 43--59.
[14]
Fraboulet, A., Huard, G., and Mignotte, A. 1999. Loop alignment for memory access optimization. In Proceedings of the 12th ACM/IEEE International Symposium on System Synthesis (San Jose, CA). 71--77.
[15]
Girbal, S., Vasilache, N., Bastoul, C., Cohen, A., Parello, D., Sigler, M., and Temam, O. 2006. Simi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Program. 34, 3, 261--317.
[16]
Grun, P., Balasa, F., and Dutt, N. 1998. Memory size estimation for multimedia applications. In Proceedings of the ACM/IEEE Workshop on Hardware/Software Co-Design (Codes). (Seattle WA), 145--149.
[17]
Hu, Q., Vandecappelle, A., Kjeldsberg, P. G., Catthoor, F., and Palkovic, M. 2007. Fast memory footprint estimation based on dependency distance vector calculation. In Proceedings of the 10th ACM/IEEE Design and Test in Europe Conference (Nice, France).
[18]
Hu, Q., Vandecappelle, A., Palkovic, M., Kjeldsberg, P.G., Brockmeyer, E., and Catthoor, F. 2006. Hierarchical memory size estimation for loop fusion and loop shifting in data-dominated applications. In Proceedings of the 11th IEEE Asia and South Pacific Design Automation Conference (ASPDAC). (Yokohama, Japan), 606--611.
[19]
Hu, Q., Brockmeyer, E., Palkovic, M., Kjeldsberg, P.G., and Catthoor, F. 2004. Memory hierarchy usage estimation for global loop transformations. In Proceedings of the IEEE Norchip Conference (Oslo, Norway), 301--304.
[20]
IMEC. 2006. Atomium website, http://www.imec.be/design/atomium/.
[21]
Issenin, I., Brockmeyer, E., Miranda, M., and Dutt, N. 2004. Data reuse analysis technique for software-controlled memory hierarchies. In Proceedings of the 3rd ACM/IEEE Design and Test in Europe Conference (Paris), 202--207.
[22]
Kandemir, M. and Choudhary, A. 2002. Compiler-directed scratch pad memory hierarchy design and management. In Proceedings of the 38th ACM/IEEE Design Automation Conference (New Orleans, LA), 628--633.
[23]
Kelly, W. and Pugh, W. 1993. A framework for unifying reordering transformations. Rep UMIACS-TR-92-126.1, University of Maryland at College Park, Institute for Advanced Computer Studies, Tech. Maryland, USA.
[24]
Kim, H.S., Vljaykrishnan, N., Kandemir, M., Brockmeyer, E., Catthoor, F., and J. Irwin, M. 2003. Estimating influence of data layout optimizations on sdRAM energy consumption. In Proceedings of the Interantional Symposium on Low Power Electronics and Design (ISLPED)., 40--43.
[25]
Kjeldsberg, P.G.,Catthoor, F., and Aas, E.J. 2003. Data dependency size estimation for use in memory optimization. IEEE Trans. Comput. Aided Desi. 22, 7 (Jul.), 908--921.
[26]
McKlnley, K., Carr, S., and Tsend, C.W. 1996. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18, 4. (Jul.).
[27]
Nguyen, N., Dominguez, A., and Barua, R. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embedded Comput. Syst. 5, 2, 472--511.
[28]
Panda, P., Catthoor, F., Dutt, N.D., Danckaert, K., Brockmeyer, E., Kulkarni, C., Vander-Cappelle, A., and Kjeldsberg, P.G. 2001. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst. 6, 2 (Apr.), 149--206.
[29]
Panda, P.R., Dutt, N.D., and Nicolau, A. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 5th ACM/IEEE European Design and Test Conference (Paris), 7--11.
[30]
Song, Y., Xu, R., Wang, C., and Li, Z. 2004. Improving data locality by array contraction. IEEE Trans. Comput. 53, 9, 1073--1084.
[31]
Steinke, S., Wehmeyer, L., Lee, B.-S., and Marwedel, P. 2002. Assigning program and data objects to scratchpad for energy reduction. In Proceedings of the 5th ACM/IEEE Design and Test in Europe Conference (Paris). 409--415.
[32]
Udayakumaran, S. and Barua, R. 2006. An integrated scratch-pad allocator for affine and non-affine code. In Proceedings of the ACM/IEEE Design and Test in Europe Conference (Munich, Germany), 925--930.
[33]
van Achteren, T., deconinck, G., Catthoor, F., and Lauwereins, R. 2002. Data reuse exploration techniques for loop-dominated application. In Proceedings of the 5th ACM/IEEE Design and Test in Europe Conference (Paris), 428--535.
[34]
Verbauwhede, I., Catthoor, F., Vandewalle, J., and de Man, H. 1989. Background memory management for the synthesis of algebraic algorithms on multi-processor DSP chips. In Proceedings of the Very Large Scale Integration International Conference (Munich, Germany), 209--218.
[35]
Verbauwhede, I., Scheers, C., and Rabaey, J. 1994. Memory estimation for high-level synthesis. In Proceedings of the 31st ACM/IEEE Design Automation Conference (San Diego), CA, 143--148.
[36]
Verdoolaege, S., Bruynooghe, M., Janssens, G., and Catthoor, F. 2003. Multi-Dimensional incremental loop fusion for data locality. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, andProcessors(ASAP). (Leiden, The Netherlands), 17--27.
[37]
Wilde, D.K. 1993. A library for doing polyhedral operations. M.S. thesis, Oregon State University, Corvallis, Orgon also Tech. Repo. PI-785, IRISA, Rennes, France.
[38]
Wolf, M.E. and Lam, M.S. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2, 4 (Oct.), 452--471.
[39]
Wuytack, S., Dlguet, J.P., Catthoor, F., And de Man, H. 1998. Formalized methodology for data reuse exploration for low-power hierarchical memory mappings. IEEE Trans. VLSI Syst. 6, 4 (Dec.), 529-537.
[40]
Zhao, Y. and Malik, S. 1999. Exact memory size estimation for array computation without loop unrolling. In Proceedings of the 36th ACM/IEEE Design Automation Conference (New Orleans), 811--816.
[41]
Zhu, H., Luican, I. I., and Balasa, F. 2006. Memory size computation for multimedia processing applications. In Proceedings of the 11th IEEE Asia and South Pacific Design Automation Conference (ASPDAC) (Yokohama, Japan), 802--807.

Cited By

View all
  • (2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-zOnline publication date: 13-May-2019
  • (2014)Using buffer-to-BRAM mapping approaches to trade-off throughput vs. memory use2014 24th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2014.6927469(1-8)Online publication date: Sep-2014
  • (2013)Polyhedral-based data reuse optimization for configurable computingProceedings of the ACM/SIGDA international symposium on Field programmable gate arrays10.1145/2435264.2435273(29-38)Online publication date: 11-Feb-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 12, Issue 4
September 2007
449 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/1278349
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 01 September 2007
Published in TODAES Volume 12, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data optimization
  2. code transformation
  3. high-level synthesis
  4. memory architecture exploration
  5. memory size estimation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-zOnline publication date: 13-May-2019
  • (2014)Using buffer-to-BRAM mapping approaches to trade-off throughput vs. memory use2014 24th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2014.6927469(1-8)Online publication date: Sep-2014
  • (2013)Polyhedral-based data reuse optimization for configurable computingProceedings of the ACM/SIGDA international symposium on Field programmable gate arrays10.1145/2435264.2435273(29-38)Online publication date: 11-Feb-2013
  • (2012)Integrating Memory Optimization with Mapping Algorithms for Multi-Processors System-on-ChipACM Transactions on Embedded Computing Systems10.1145/2345770.234577611:3(1-26)Online publication date: 1-Sep-2012
  • (2012)Optimizing memory hierarchy allocation with loop transformations for high-level synthesisProceedings of the 49th Annual Design Automation Conference10.1145/2228360.2228586(1233-1238)Online publication date: 3-Jun-2012
  • (2012)Design space exploration in application-specific hardware synthesis for multiple communicating nested loops2012 International Conference on Embedded Computer Systems (SAMOS)10.1109/SAMOS.2012.6404166(128-135)Online publication date: Jul-2012
  • (2012)Transformation-Based Exploration of Data Parallel Architecture for Customizable HardwareProceedings of the 2012 15th Euromicro Conference on Digital System Design10.1109/DSD.2012.133(774-781)Online publication date: 5-Sep-2012
  • (2011)Combined loop transformation and hierarchy allocation for data reuse optimizationProceedings of the International Conference on Computer-Aided Design10.5555/2132325.2132368(185-192)Online publication date: 7-Nov-2011
  • (2011)Constructing application-specific memory hierarchies on FPGAsTransactions on high-performance embedded architectures and compilers III10.5555/1980776.1980790(201-216)Online publication date: 1-Jan-2011
  • (2011)Combined loop transformation and hierarchy allocation for data reuse optimizationProceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design10.1109/ICCAD.2011.6105324(185-192)Online publication date: 7-Nov-2011
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media