ABSTRACT
Even the simplest hardware, running the simplest programs, can behave in the strangest of ways. Tracking down the cause of a performance anomaly without the complete hardware reference of a processor is a prime example of black-box architectural exploration. When doubling the work of a simple benchmark program, that was run on a single core of Tilera's TILEPro64 processor, did not double the number of consumed cycles, a mystery was unveiled. After ruling out different levels of optimization for the two programs, a cycle-accurate simulation attributed the sub-optimal performance to an abnormally high number of L1 data cache misses. Further investigation showed that the processor stalled on every Read-After-Write instruction sequence when the following two conditions were met: 1) there are 0 or 1 instructions between the write and the read instruction and 2) the read and the write instructions target distinct memory locations that share an L1 cache line. We call this performance pitfall a RAW hiccup. We describe two countermeasures, memory padding and the explicit introduction of pipeline bubbles, that sidestep the RAW hiccup.
This experience paper serves as a useful troubleshooting guide for uncovering anomalous performance issues when the hardware design under study is unavailable.
- S. J. Eggers and T. E. Jeremiassen. Eliminating false sharing. In ICPP (1), pages 377--381, 1991.Google Scholar
- J. S. Emer and D. W. Clark. A characterization of processor performance in the vax-11/780. SIGARCH Comput. Archit. News, 12(3):301--310, Jan. 1984. Google ScholarDigital Library
- B. Heineman. Common performance issues in game programming, June 2008.Google Scholar
- J. L. Hennessy and D. A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. Google ScholarDigital Library
- G. Hinton, D. Sager, M. Upton, D. Boggs, D. P. Group, and I. Corp. The microarchitecture of the pentium 4 processor. Intel Technology Journal, 1:2001, 2001.Google Scholar
- D. A. Patterson and J. L. Hennessy. Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 4th edition, 2008. Google ScholarDigital Library
- N. J. A. Sloane. Tetrahedral (or triangular pyramidal) numbers, Oct. 2012.Google Scholar
- N. J. A. Sloane. Triangular numbers, Oct. 2012.Google Scholar
- Tilera. Tile Processor User Architecture Manual, June 2010.Google Scholar
Index Terms
- When spatial and temporal locality collide: the case of the missing cache hits
Recommendations
Improving First Level Cache Efficiency for GPUs Using Dynamic Line Protection
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingA modern Graphics Processing Unit (GPU) utilizes L1 Data (L1D) caches to reduce memory bandwidth requirements and latencies. However, the L1D cache can easily be overwhelmed by many memory requests from GPU function units, which can bottleneck GPU ...
Exploiting spatial locality in data caches using spatial footprints
Special Issue: Proceedings of the 25th annual international symposium on Computer architecture (ISCA '98)Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, ...
Exploiting spatial locality in data caches using spatial footprints
ISCA '98: Proceedings of the 25th annual international symposium on Computer architectureModern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, ...
Comments