skip to main content
10.1145/2259016.2259045acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

An automatic code overlaying technique for multicores with explicitly-managed memory hierarchies

Published: 31 March 2012 Publication History

Abstract

The explicitly-managed memory hierarchies, where a hierarchy of distinct memories is exposed to the programmer and managed explicitly by software, are not only found in typical embedded processors but also found in a class of high performance multicore architectures. Code overlay techniques have been widely used to execute a program whose code is bigger than the available code memory in the system. To generate an efficient overlaid executable with maximum storage savings as well as minimum performance overhead, the overlay structure should be designed carefully. In this paper, we propose an efficient code overlay technique that automatically generates an overlay structure for a given memory size for multicores with explicitly-managed memory hierarchies. We observe that finding an efficient overlay structure with minimum memory copying and run-time check overhead is similar to the problem that finds a code placement with minimum conflict misses in the instruction cache. Our algorithm exploits the temporal-ordering information between functions during program execution. The information is obtained from profiling the program. Experimental results with 11 parallel applications on the Cell BE processor indicate that our approach is effective and promising.

References

[1]
F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A post-compiler approach to scratchpad mapping of code. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 259--267, New York, NY, USA, 2004. ACM.
[2]
ARM Ltd. http://www.arm.com, 2009.
[3]
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In CODES '02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 73--78, New York, NY, USA, 2002. ACM.
[4]
L. M. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Transactions on Computers, 27(12):1112--1118, 1978.
[5]
Y.-C. Chen and A. V. Veidenbaum. A software coherence scheme with the assistance of directories. In ICS '91: Proceedings of the 5th international conference on Supercomputing, pages 284--294, New York, NY, USA, 1991. ACM.
[6]
J. d. Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. Tiny threads: A thread virtual machine for the cyclops64 cellular architecture. In IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14, page 265.2, Washington, DC, USA, 2005. IEEE Computer Society.
[7]
R. Cytron and P. G. Loewner. An automatic overlay generator. IBM Journal of Research and Development, 30(6):603--608, 1986.
[8]
B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. A dynamic code placement technique for scratchpad memory using postpass optimization. In CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 223--233, New York, NY, USA, 2006. ACM.
[9]
B. Egger, J. Lee, and H. Shin. Scratchpad memory management for portable systems with a memory management unit. In EMSOFT '06: Proceedings of the 6th ACM & IEEE International conference on Embedded software, pages 321--330, New York, NY, USA, 2006. ACM.
[10]
B. Egger, J. Lee, and H. Shin. Scratchpad memory management in a multitasking environment. In EMSOFT '08: Proceedings of the 8th ACM international conference on Embedded software, pages 265--274, New York, NY, USA, 2008. ACM.
[11]
N. Gloy, T. Blackwell, M. D. Smith, and B. Calder. Procedure placement using temporal ordering information. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 303--313, Washington, DC, USA, 1997. IEEE Computer Society.
[12]
C. Guillon, F. Rastello, T. Bidault, and F. Bouchez. Procedure placement using temporal-ordering information: dealing with code size expansion. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 268--279, New York, NY, USA, 2004. ACM.
[13]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive nuca: near-optimal block placement and replication in distributed caches. In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture, pages 184--195, New York, NY, USA, 2009. ACM.
[14]
A. H. Hashemi, D. R. Kaeli, and B. Calder. Efficient procedure mapping using cache line coloring. In PLDI '97: Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, pages 171--182, New York, NY, USA, 1997. ACM.
[15]
H. He, S. K. Debray, and G. R. Andrews. The revenge of the overlay: automatic compaction of os kernel code via on-demand code loading. In EMSOFT '07: Proceedings of the 7th ACM & IEEE international conference on Embedded software, pages 75--83, New York, NY, USA, 2007. ACM.
[16]
IBM. An introduction to compiling for the cell broadband engine architecture. 2006.
[17]
IBM. Software Development Kit for Multicore Acceleration version 3.1, Programmer's Guide. IBM, 2008.
[18]
IBM, Sony, and Toshiba. Cell Broadband Engine Architecture. IBM, October 2007. http://www.ibm.com/developerworks/power/cell/.
[19]
Intel Corporation. Single-chip cloud computer. http://techresearch.intel.com/articles/Tera-Scale/1826.htm, 2009.
[20]
J. Kalamatianos and D. Kaeli. Temporal-based procedure reordering for improved instruction cache performance. In HPCA '98: Proceedings of the 4th International Symposium on High-Performance Computer Architecture, page 244, Washington, DC, USA, 1998. IEEE Computer Society.
[21]
J. Lee, S. Seo, C. Kim, J. Kim, P. Chun, Z. Sura, J. Kim, and S. Han. COMIC: a coherent shared memory interface for cell be. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 303--314, New York, NY, USA, 2008. ACM.
[22]
D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148--159, New York, NY, USA, 1990. ACM.
[23]
J. R. Levine. Linkers and Loaders. Morgan Kaufmann, 2000.
[24]
NASA Advanced Supercomputing Division. NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.
[25]
A. Pabalkar, A. Shrivastava, A. Kannan, and J. Lee. Sdrm: simultaneous determination of regions and function-to-region mapping for scratchpad memories. In HiPC'08: Proceedings of the 15th international conference on High performance computing, pages 569--582, Berlin, Heidelberg, 2008. Springer-Verlag.
[26]
K. Pettis and R. C. Hansen. Profile guided code positioning. In PLDI '90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, pages 16--27, New York, NY, USA, 1990. ACM.
[27]
S. Seo, J. Lee, and Z. Sura. Design and implementation of software-managed caches for multicores with local memory. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, pages 55--66, 2009.
[28]
Standard Performance Evaluation Corporation. SPEC 2000. http://www.spec.org/benchmarks.html.
[29]
S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, page 409, Washington, DC, USA, 2002. IEEE Computer Society.
[30]
P. Stenström. A survey of cache coherence schemes for multiprocessors. Computer, 23(6):12--24, 1990.
[31]
Texas Instruments Inc. http://www.ti.com, 2009.
[32]
S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits, 43(1):29--41, jan. 2008.
[33]
M. Verma, L. Wehmeyer, and P. Marwedel. Cache-aware scratchpad allocation algorithm. In DATE '04: Proceedings of the conference on Design, automation and test in Europe, page 21264, Washington, DC, USA, 2004. IEEE Computer Society.
[34]
M. Verma, L. Wehmeyer, and P. Marwedel. Dynamic overlay of scratchpad memory for energy minimization. In CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 104--109, New York, NY, USA, 2004. ACM.
[35]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: characterization and methodological considerations. In ISCA '95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24--36, New York, NY, USA, 1995. ACM.

Index Terms

  1. An automatic code overlaying technique for multicores with explicitly-managed memory hierarchies

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization
    March 2012
    285 pages
    ISBN:9781450312066
    DOI:10.1145/2259016
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 March 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. code overlays
    2. temporal ordering
    3. temporal-relationship graph

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CGO '12

    Acceptance Rates

    CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;
    Overall Acceptance Rate 312 of 1,061 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 232
      Total Downloads
    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media