research-article

When spatial and temporal locality collide: the case of the missing cache hits

Authors:
Mattias De Wael

Vrije Universiteit Brussel, Brussel, Belgium

Vrije Universiteit Brussel, Brussel, Belgium
View Profile

,
David Ungar

Watson Research Center, New York, USA

Watson Research Center, New York, USA
View Profile

,
Tom Van Cutsem

Vrije Universiteit Brussel, Brussel, Belgium

Vrije Universiteit Brussel, Brussel, Belgium
View Profile

ICPE '13: Proceedings of the 4th ACM/SPEC International Conference on Performance EngineeringApril 2013Pages 63–70https://doi.org/10.1145/2479871.2479883

Published:21 April 2013Publication History

ICPE '13: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering

Pages 63–70

ABSTRACT

Even the simplest hardware, running the simplest programs, can behave in the strangest of ways. Tracking down the cause of a performance anomaly without the complete hardware reference of a processor is a prime example of black-box architectural exploration. When doubling the work of a simple benchmark program, that was run on a single core of Tilera's TILEPro64 processor, did not double the number of consumed cycles, a mystery was unveiled. After ruling out different levels of optimization for the two programs, a cycle-accurate simulation attributed the sub-optimal performance to an abnormally high number of L1 data cache misses. Further investigation showed that the processor stalled on every Read-After-Write instruction sequence when the following two conditions were met: 1) there are 0 or 1 instructions between the write and the read instruction and 2) the read and the write instructions target distinct memory locations that share an L1 cache line. We call this performance pitfall a RAW hiccup. We describe two countermeasures, memory padding and the explicit introduction of pipeline bubbles, that sidestep the RAW hiccup.

This experience paper serves as a useful troubleshooting guide for uncovering anomalous performance issues when the hardware design under study is unavailable.

References

S. J. Eggers and T. E. Jeremiassen. Eliminating false sharing. In ICPP (1), pages 377--381, 1991.Google Scholar
J. S. Emer and D. W. Clark. A characterization of processor performance in the vax-11/780. SIGARCH Comput. Archit. News, 12(3):301--310, Jan. 1984. Google ScholarDigital Library
B. Heineman. Common performance issues in game programming, June 2008.Google Scholar
J. L. Hennessy and D. A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. Google ScholarDigital Library
G. Hinton, D. Sager, M. Upton, D. Boggs, D. P. Group, and I. Corp. The microarchitecture of the pentium 4 processor. Intel Technology Journal, 1:2001, 2001.Google Scholar
D. A. Patterson and J. L. Hennessy. Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 4th edition, 2008. Google ScholarDigital Library
N. J. A. Sloane. Tetrahedral (or triangular pyramidal) numbers, Oct. 2012.Google Scholar
N. J. A. Sloane. Triangular numbers, Oct. 2012.Google Scholar
Tilera. Tile Processor User Architecture Manual, June 2010.Google Scholar

Index Terms

When spatial and temporal locality collide: the case of the missing cache hits

Recommendations

Improving First Level Cache Efficiency for GPUs Using Dynamic Line Protection
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

A modern Graphics Processing Unit (GPU) utilizes L1 Data (L1D) caches to reduce memory bandwidth requirements and latencies. However, the L1D cache can easily be overwhelmed by many memory requests from GPU function units, which can bottleneck GPU ...
Read More
Exploiting spatial locality in data caches using spatial footprints
Special Issue: Proceedings of the 25th annual international symposium on Computer architecture (ISCA '98)

Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, ...
Read More
Exploiting spatial locality in data caches using spatial footprints
ISCA '98: Proceedings of the 25th annual international symposium on Computer architecture

Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICPE '13: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
April 2013
446 pages
ISBN:9781450316361
DOI:10.1145/2479871
Editor:
Seetharami Seelam
IBM T.J. Watson Research Center, USA
,
General Chairs:
Petr Tůma
Charles University, Czech Republic
,
Giuliano Casale
Imperial College London, UK
,
Program Chairs:
Tony Field
Imperial College London, UK
,
José Nelson Amaral
University of Alberta, Canada
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
l1 data cache
padding
pipeline bubble
tilepro64
Qualifiers
- research-article
Conference

Acceptance Rates
ICPE '13 Paper Acceptance Rate28of64submissions,44%Overall Acceptance Rate252of851submissions,30%
More
Upcoming Conference
ICPE '24

Sponsor:

sigsoft online

sigsoft online

15th ACM/SPEC International Conference on Performance Engineering

May 7 - 11, 2024

London , United Kingdom
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 154
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.