research-article

An automatic code overlaying technique for multicores with explicitly-managed memory hierarchies

Authors:

Jaejin LeeAuthors Info & Claims

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

Pages 219 - 229

https://doi.org/10.1145/2259016.2259045

Published: 31 March 2012 Publication History

Abstract

The explicitly-managed memory hierarchies, where a hierarchy of distinct memories is exposed to the programmer and managed explicitly by software, are not only found in typical embedded processors but also found in a class of high performance multicore architectures. Code overlay techniques have been widely used to execute a program whose code is bigger than the available code memory in the system. To generate an efficient overlaid executable with maximum storage savings as well as minimum performance overhead, the overlay structure should be designed carefully. In this paper, we propose an efficient code overlay technique that automatically generates an overlay structure for a given memory size for multicores with explicitly-managed memory hierarchies. We observe that finding an efficient overlay structure with minimum memory copying and run-time check overhead is similar to the problem that finds a code placement with minimum conflict misses in the instruction cache. Our algorithm exploits the temporal-ordering information between functions during program execution. The information is obtained from profiling the program. Experimental results with 11 parallel applications on the Cell BE processor indicate that our approach is effective and promising.

References

[1]

F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A post-compiler approach to scratchpad mapping of code. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 259--267, New York, NY, USA, 2004. ACM.

Digital Library

[2]

ARM Ltd. http://www.arm.com, 2009.

[3]

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In CODES '02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 73--78, New York, NY, USA, 2002. ACM.

Digital Library

[4]

L. M. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Transactions on Computers, 27(12):1112--1118, 1978.

Digital Library

[5]

Y.-C. Chen and A. V. Veidenbaum. A software coherence scheme with the assistance of directories. In ICS '91: Proceedings of the 5th international conference on Supercomputing, pages 284--294, New York, NY, USA, 1991. ACM.

Digital Library

[6]

J. d. Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. Tiny threads: A thread virtual machine for the cyclops64 cellular architecture. In IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14, page 265.2, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[7]

R. Cytron and P. G. Loewner. An automatic overlay generator. IBM Journal of Research and Development, 30(6):603--608, 1986.

Digital Library

[8]

B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. A dynamic code placement technique for scratchpad memory using postpass optimization. In CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 223--233, New York, NY, USA, 2006. ACM.

Digital Library

[9]

B. Egger, J. Lee, and H. Shin. Scratchpad memory management for portable systems with a memory management unit. In EMSOFT '06: Proceedings of the 6th ACM & IEEE International conference on Embedded software, pages 321--330, New York, NY, USA, 2006. ACM.

Digital Library

[10]

B. Egger, J. Lee, and H. Shin. Scratchpad memory management in a multitasking environment. In EMSOFT '08: Proceedings of the 8th ACM international conference on Embedded software, pages 265--274, New York, NY, USA, 2008. ACM.

Digital Library

[11]

N. Gloy, T. Blackwell, M. D. Smith, and B. Calder. Procedure placement using temporal ordering information. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 303--313, Washington, DC, USA, 1997. IEEE Computer Society.

Digital Library

[12]

C. Guillon, F. Rastello, T. Bidault, and F. Bouchez. Procedure placement using temporal-ordering information: dealing with code size expansion. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 268--279, New York, NY, USA, 2004. ACM.

Digital Library

[13]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive nuca: near-optimal block placement and replication in distributed caches. In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture, pages 184--195, New York, NY, USA, 2009. ACM.

Digital Library

[14]

A. H. Hashemi, D. R. Kaeli, and B. Calder. Efficient procedure mapping using cache line coloring. In PLDI '97: Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, pages 171--182, New York, NY, USA, 1997. ACM.

Digital Library

[15]

H. He, S. K. Debray, and G. R. Andrews. The revenge of the overlay: automatic compaction of os kernel code via on-demand code loading. In EMSOFT '07: Proceedings of the 7th ACM & IEEE international conference on Embedded software, pages 75--83, New York, NY, USA, 2007. ACM.

Digital Library

[16]

IBM. An introduction to compiling for the cell broadband engine architecture. 2006.

[17]

IBM. Software Development Kit for Multicore Acceleration version 3.1, Programmer's Guide. IBM, 2008.

[18]

IBM, Sony, and Toshiba. Cell Broadband Engine Architecture. IBM, October 2007. http://www.ibm.com/developerworks/power/cell/.

[19]

Intel Corporation. Single-chip cloud computer. http://techresearch.intel.com/articles/Tera-Scale/1826.htm, 2009.

[20]

J. Kalamatianos and D. Kaeli. Temporal-based procedure reordering for improved instruction cache performance. In HPCA '98: Proceedings of the 4th International Symposium on High-Performance Computer Architecture, page 244, Washington, DC, USA, 1998. IEEE Computer Society.

Digital Library

[21]

J. Lee, S. Seo, C. Kim, J. Kim, P. Chun, Z. Sura, J. Kim, and S. Han. COMIC: a coherent shared memory interface for cell be. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 303--314, New York, NY, USA, 2008. ACM.

Digital Library

[22]

D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148--159, New York, NY, USA, 1990. ACM.

Digital Library

[23]

J. R. Levine. Linkers and Loaders. Morgan Kaufmann, 2000.

Digital Library

[24]

NASA Advanced Supercomputing Division. NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.

[25]

A. Pabalkar, A. Shrivastava, A. Kannan, and J. Lee. Sdrm: simultaneous determination of regions and function-to-region mapping for scratchpad memories. In HiPC'08: Proceedings of the 15th international conference on High performance computing, pages 569--582, Berlin, Heidelberg, 2008. Springer-Verlag.

Digital Library

[26]

K. Pettis and R. C. Hansen. Profile guided code positioning. In PLDI '90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, pages 16--27, New York, NY, USA, 1990. ACM.

Digital Library

[27]

S. Seo, J. Lee, and Z. Sura. Design and implementation of software-managed caches for multicores with local memory. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, pages 55--66, 2009.

[28]

Standard Performance Evaluation Corporation. SPEC 2000. http://www.spec.org/benchmarks.html.

[29]

S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, page 409, Washington, DC, USA, 2002. IEEE Computer Society.

Digital Library

[30]

P. Stenström. A survey of cache coherence schemes for multiprocessors. Computer, 23(6):12--24, 1990.

Digital Library

[31]

Texas Instruments Inc. http://www.ti.com, 2009.

[32]

S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits, 43(1):29--41, jan. 2008.

[33]

M. Verma, L. Wehmeyer, and P. Marwedel. Cache-aware scratchpad allocation algorithm. In DATE '04: Proceedings of the conference on Design, automation and test in Europe, page 21264, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[34]

M. Verma, L. Wehmeyer, and P. Marwedel. Dynamic overlay of scratchpad memory for energy minimization. In CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 104--109, New York, NY, USA, 2004. ACM.

Digital Library

[35]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: characterization and methodological considerations. In ISCA '95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24--36, New York, NY, USA, 1995. ACM.

Digital Library

Index Terms

An automatic code overlaying technique for multicores with explicitly-managed memory hierarchies
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Automatic code overlay generation and partially redundant code fetch elimination

There is an increasing interest in explicitly managed memory hierarchies, where a hierarchy of distinct memories is exposed to the programmer and managed explicitly in software. These hierarchies can be found in typical embedded systems and an emerging ...
SRC: an automatic code overlaying technique for multicores with explicitly-managed memory hierarchies
ICS '11: Proceedings of the international conference on Supercomputing

In this paper, we propose an efficient code overlay technique that automatically generates an overlay structure for a given memory size for multicores with explicitly-managed memory hierarchies. We observe that finding an efficient overlay structure ...
Write-once-memory-code phase change memory
DATE '14: Proceedings of the conference on Design, Automation & Test in Europe

This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM --- attributed to PCM SET --- by ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

March 2012

285 pages

ISBN:9781450312066

DOI:10.1145/2259016

General Chairs:
Carol Eidt
Microsoft
,
Anne Holler
VMware
,
Program Chairs:
Uma Srinivasan
Intel
,
Saman Amarasinghe
MIT

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea

Conference

CGO '12

Sponsor:

CGO '12: Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 31 - April 4, 2012

California, San Jose

Acceptance Rates

CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
232
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents