Article

Optimizing inter-processor data locality on embedded chip multiprocessors

Authors:
G. Chen

Pennsylvania State University, University Park, PA

Pennsylvania State University, University Park, PA
View Profile

,
M. Kandemir

Pennsylvania State University, University Park, PA

Pennsylvania State University, University Park, PA
View Profile

EMSOFT '05: Proceedings of the 5th ACM international conference on Embedded softwareSeptember 2005Pages 227–236https://doi.org/10.1145/1086228.1086271

Published:18 September 2005Publication History

EMSOFT '05: Proceedings of the 5th ACM international conference on Embedded software

Pages 227–236

ABSTRACT

Recent research in embedded computing indicates that packing multiple processor cores on the same die is an effective way of utilizing the ever-increasing number of transistors. The advantage of placing multiple cores into a single die is that it reduces on-chip communication costs (in terms of both execution cycles and power consumption) between the processor cores that are traditionally very high in conventional high-performance parallel architectures (such as SMPs). However, on the negative side, this tighter integration exerts an even higher pressure on off-chip accesses to the memory system. This makes minimizing the number of off-chip accesses a critical optimization goal.This paper discusses a compiler-based solution to this problem for the embedded applications that perform stencil computations. An important characteristic of this solution is that it distinguishes between the intra-processor data reuse and inter-processor data reuse. The first of these captures the data reuse that occurs across loop iterations assigned to the same processor, whereas the second one represents the data reuse that takes place across the loop iterations assigned to different processors. The proposed approach then optimizes inter-processor reuse by re-organizing the loop iterations of each processor carefully, considering how data elements are shared across processors. The goal is to ensure that the different processors access the shared data within a short period of time, so that the data can be captured in the on-chip memory space at the time of the reuse. This paper also presents an evaluation of the proposed optimization and compares it to an alternate scheme that optimizes data locality for each processor in isolation. The results obtained by applying our implementation to eight loop-intensive benchmark codes from the embedded computing domain show that our approach improves over the mentioned alternate scheme by 15.6% on average.

References

R. Allen and K. Kennedy. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems. 9(4):491--542, 1987. Google ScholarDigital Library
U. Banerjee. A theory of loop permutations. In Proc. 2nd Workshop on Languages and Compilers for Parallel Computing. August 1989. Google ScholarDigital Library
E. H. Bareiss. Sylvester's Identity and Multistep Integer-Preserving Gaussian Elimination. Mathematics of Computation, 22(103):565--578, July 1968.Google Scholar
R. Bordawekar, A. Choudhary, and J. Ramanujam. Automatic optimization of communication in compiling out-of-core stencil codes. In Proc. ACM International Conference on Supercomputing, May 1996. Google ScholarDigital Library
R. G. Brickner, W. George, S. L. Johnsson, and A. Ruttenberg. A stencil compiler for the connection machine models CM-2/200. Technical Report TR-22-93, Center for Research in Computing Technology, Harvard University, December 1993.Google Scholar
R. G. Brickner, K. Holian, B. Thiagarajan, and S. L. Johnsson. A stencil compiler for the Connection Machine model CM-5. Technical Report CRPC-TR94457, Center for Research on Parallel Computation, Rice University, June 1994.Google ScholarCross Ref
M. Bromley, S. Heller, T. McNerney, and G. L. Steele Jr. Fortran at ten gigaflops: the connection machine convolution compiler. In Proc. ACM Conference on Programming Language Design and Implementation, June 1991. Google ScholarDigital Library
S. Cabay. Exact solution of linear equations. In Proc. ACM Symposium on Symbolic and Algebraic Manipulation, pp. 392--398, 1971. Google ScholarDigital Library
D. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998. Google ScholarDigital Library
K. Davis and F. Bassetti. Exploiting temporal locality in stencil based applications. In Proc. International Conference on Information Systems Analysis and Synthesis, 1999.Google Scholar
M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proc. International Symposium on Computer Architecture, 2003. Google ScholarDigital Library
L. Hammond, B. A. Nayfeh, and K. Olukotun. A single-chip multiprocessor. IEEE Computer Special Issue on "Billion-Transistor Processors", September 1997. Google ScholarDigital Library
F. F. Lee. Partitioning of regular computation on multiprocessor systems. Journal of Parallel and Distributed Computing, 9:312--317, July 1990.Google ScholarCross Ref
S.-T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report 95-09-01, University of Washington, September 1995.Google Scholar
W. Li and K. Pingali. A singular loop transformation framework based on non-singular matrices. In Proc. 5th Workshop on Languages and Compilers for Parallel Computing, Yale University, August 1992. Google ScholarDigital Library
B. A. Nayfeh and K. Olukotun. Exploring the design space for a shared-cache multiprocessor. In Proc. International Symposium on Computer Architecture, 1994. Google ScholarDigital Library
S. Richardson. MPOC: A chip multiprocessor for embedded systems. Technical Report HPL-2002-186, HP Labs, 2002.Google Scholar
G. Roth, J. Mellor-Crummey, K. Kennedy, and R. G. Brickner. Compiling stencils in high performance Fortran. In Proc. ACM/IEEE conference on Supercomputing, 1997. Google ScholarDigital Library
SIMICS Toolset. http://www.virtutech.com.Google Scholar
SUIF Compiler Infrastructure. http://suif.stanford.edu/Google Scholar
W. Wolf. The future of multiprocessor systems-on-chips. In Proc. ACM Design Automation Conference, 2004. Google ScholarDigital Library
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proc. ACM Conference on Programming Language Design and Implementation, pp. 30--44, June 1991. Google ScholarDigital Library
M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, July, 1991. Google ScholarDigital Library
M. J. Wolfe. Optimizing Supercompilers for Supercomputers. Cambridge, MIT Press, 1989. Google ScholarDigital Library

Index Terms

Optimizing inter-processor data locality on embedded chip multiprocessors
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Photonic Networks-on-Chip for Future Generations of Chip Multiprocessors

The design and performance of next-generation chip multiprocessors (CMPs) will be bound by the limited amount of power that can be dissipated on a single die. We present photonic networks-on-chip (NoC) as a solution to reduce the impact of intra-chip ...
Read More
An analysis of on-chip interconnection networks for large-scale chip multiprocessors

With the number of cores of chip multiprocessors (CMPs) rapidly growing as technology scales down, connecting the different components of a CMP in a scalable and efficient way becomes increasingly challenging. In this article, we explore the ...
Read More
Heterogeneous Chip Multiprocessors

Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating Amdahl's law. On-chip heterogeneity allows the processor to better match execution resources to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMSOFT '05: Proceedings of the 5th ACM international conference on Embedded software
September 2005
390 pages
ISBN:1595930914
DOI:10.1145/1086228
Conference Chair:
Wayne Wolf
Princeton University, Princeton, NJ
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 September 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
chip multiprocessors
data locality
stencil computation
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate60of203submissions,30%
Upcoming Conference
ESWEEK '24

Sponsor:

sigbed

sigbed

sigbed

Twentieth Embedded Systems Week

September 29 - October 4, 2024

Raleigh , NC , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 210
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing inter-processor data locality on embedded chip multiprocessors

EMSOFT '05: Proceedings of the 5th ACM international conference on Embedded software

ABSTRACT

References

Cited By

Index Terms

Recommendations

Photonic Networks-on-Chip for Future Generations of Chip Multiprocessors

An analysis of on-chip interconnection networks for large-scale chip multiprocessors

Heterogeneous Chip Multiprocessors