research-article

Fat Loads: Exploiting Locality Amongst Contemporaneous Load Operations to Optimize Cache Accesses

Authors:
Vanshika Baoni

University of Wisconsin-Madison, United States of America

University of Wisconsin-Madison, United States of America
View Profile

,
Adarsh Mittal

University of Wisconsin-Madison, United States of America

University of Wisconsin-Madison, United States of America
View Profile

,
Gurindar S. Sohi

University of Wisconsin-Madison, United States of America

University of Wisconsin-Madison, United States of America
View Profile

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on MicroarchitectureOctober 2021Pages 366–379https://doi.org/10.1145/3466752.3480104

Published:17 October 2021Publication History

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 366–379

ABSTRACT

This paper considers locality among load instructions that are in processing contemporaneously within a processor to optimize the number of accesses to the memory hierarchy. A simple technique is used to learn and predict the number of contemporaneous accesses to a region of memory and classify a particular dynamic load into a normal or a fat load. Fat loads bring in additional data into Contemporaneous Load Access Registers (CLARs), from where other contemporaneous loads could be serviced without accessing the L1 cache. Experimental results indicate that with fat loads, along with 4 or 8 cache line size CLARs (256 or 512 bytes), the number of L1 cache accesses could be reduced by 50-60%, resulting in significant energy savings for the L1 cache operations. Further, in several cases the reduced latency for loads serviced from a CLAR results in an earlier resolution of some mispredicted branches, and a reduction in the number of wrong-path instructions, especially loads.

References

Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, 140–143.Google ScholarCross Ref
Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 49–56.Google ScholarCross Ref
Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: the untapped potential of the store-buffer. In Proceedings of the 46th International Symposium on Computer Architecture. 436–448.Google ScholarDigital Library
TM Austin and GS Sohi. 1996. High-bandwidth address translation for multiple-issue processors. In 23rd Annual International Symposium on Computer Architecture (ISCA’96). IEEE, 158–158.Google ScholarDigital Library
Todd M Austin, Dionisios N Pnevmatikatos, and Gurindar S Sohi. 1995. Streamlining data cache access with fast address calculation. ACM SIGARCH Computer Architecture News 23, 2 (1995), 369–380.Google ScholarDigital Library
Michael Bekerman, Adi Yoaz, Freddy Gabbay, Stephan Jourdan, Maxim Kalaev, and Ronny Ronen. 2000. Early load address resolution via register tracking. ACM SIGARCH Computer Architecture News 28, 2 (2000), 306–315.Google ScholarDigital Library
Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the 1999 international symposium on Low power electronics and design. 64–69.Google ScholarDigital Library
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7.Google Scholar
Brad Calder, Dirk Grunwald, and Joel Emer. 1996. Predictive sequential associative cache. In Proceedings. Second International Symposium on High-Performance Computer Architecture. IEEE, 244–253.Google ScholarCross Ref
Pablo Carazo, Rubén Apolloni, Fernando Castro, Daniel Chaver, Luis Pinuel, and Francisco Tirado. 2010. L1 data cache power reduction using a forwarding predictor. In International Workshop on Power and Timing Modeling, Optimization and Simulation. Springer, 116–125.Google Scholar
Michel Cekleov and Michel Dubois. 1997. Virtual-address caches. Part 1: problems and solutions in uniprocessors. IEEE Micro 17, 5 (1997), 64–71.Google ScholarDigital Library
Dan Ernst, Andrew Hamel, and Todd Austin. 2003. Cyclone: A broadcast-free dynamic instruction scheduler with selective replay. ACM SIGARCH Computer Architecture News 31, 2 (2003), 253–263.Google ScholarDigital Library
Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture. 97–104.Google ScholarDigital Library
Simcha Gochman. 2003. The Intel Pentium M processor: microarchitecture and performance. Intel technology journal 7, 2 (2003).Google Scholar
John L Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1–17.Google ScholarDigital Library
Koji Inoue, Tohru Ishihara, and Kazuaki Murakami. 1999. Way-predicting set-associative cache for high performance and low energy consumption. In Proceedings of the 1999 international symposium on Low power electronics and design. 273–275.Google ScholarDigital Library
Lei Jin and Sangyeun Cho. 2006. Reducing cache traffic and energy with macro data load. In Proceedings of the 2006 international symposium on Low power electronics and design. 147–150.Google ScholarDigital Library
Johnson Kin, Munish Gupta, and William H Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 184–193.Google ScholarCross Ref
Pierre Michaud and André Seznec. 2001. Data-flow prescheduling for large instruction windows in out-of-order processors. In Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture. IEEE, 27–36.Google ScholarCross Ref
A Moshovos. 1998. Memory Dependence Prediction, PhD thesis. Univ. of Wisconsin (1998).Google ScholarDigital Library
Andreas Moshovos, Scott E Breach, Terani N Vijaykumar, and Gurindar S Sohi. 1997. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th annual international symposium on Computer architecture. 181–193.Google ScholarDigital Library
Andreas Moshovos and Gurindar S Sohi. 1997. Streamlining inter-operation memory communication via data dependence prediction. In Proceedings of 30th Annual international Symposium on Microarchitecture. IEEE, 235–245.Google ScholarDigital Library
Andreas Moshovos and Gurindar S Sohi. 1999. Read-after-read memory dependence prediction. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 177–185.Google ScholarDigital Library
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories 27(2009), 28.Google Scholar
Arun A Nair and Lizy K John. 2008. Simulation points for SPEC CPU 2006. In 2008 IEEE International Conference on Computer Design. IEEE, 397–403.Google ScholarCross Ref
Dan Nicolaescu, Alex Veidenbaum, and Alex Nicolau. 2003. Reducing data cache energy consumption via cached load/store queue. In Proceedings of the 2003 international symposium on Low power electronics and design. 252–257.Google ScholarDigital Library
K Olukotun, M Rosenblum, and KM Wilson. 1996. Increasing cache port efficiency for dynamic superscalar microprocessors. In 23rd Annual International Symposium on Computer Architecture (ISCA’96). IEEE, 147–147.Google Scholar
Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 247–259.Google ScholarDigital Library
Michael D Powell, Amit Agarwal, TN Vijaykumar, Babak Falsafi, and Kaushik Roy. 2001. Reducing set-associative cache energy via way-prediction and selective direct-mapping. In Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. IEEE, 54–65.Google ScholarDigital Library
Jude A Rivers, Gary S Tyson, Edward S Davidson, and Todd M Austin. 1997. On high-bandwidth data cache design for multi-issue processors. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 46–56.Google ScholarDigital Library
Alberto Ros and Stefanos Kaxiras. 2018. Non-speculative store coalescing in total store order. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 221–234.Google ScholarDigital Library
Mahadev Satyanarayanan and Dileep Bhandarkar. 1981. Design trade-offs in VAX-11 translation buffer organization. Computer12(1981), 103–111.Google Scholar
André Seznec. 2016. Tage-sc-l branch predictors again. In 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5).Google Scholar
Kevin Skadron and Douglas W Clark. 1997. Design issues and tradeoffs for write buffers. In Proceedings third international symposium on high-performance computer architecture. IEEE, 144–155.Google ScholarCross Ref
Avinash Sodani. 2011. Race to exascale: Opportunities and challenges. In Keynote at the Annual IEEE/ACM 44th Annual International Symposium on Microarchitecture.Google Scholar
Daniel J Sorin, Mark D Hill, and David A Wood. 2011. A primer on memory consistency and cache coherence. Synthesis lectures on computer architecture 6, 3 (2011), 1–212.Google ScholarDigital Library
Gary S Tyson and Todd M Austin. 1997. Improving the accuracy and performance of memory communication through renaming. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 218–227.Google ScholarDigital Library
Gary S Tyson and Todd M Austin. 1999. Memory renaming: Fast, early and accurate processing of memory communication. International Journal of Parallel Programming 27, 5(1999), 357–380.Google ScholarDigital Library
Jiachen Xue and Mithuna Thottethodi. 2013. PreTrans: reducing TLB CAM-search via page number prediction and speculative pre-translation. In International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 341–346.Google ScholarCross Ref
Chuanjun Zhang, Frank Vahid, Jun Yang, and Walid Najjar. 2005. A way-halting cache for low-energy high-performance systems. ACM Transactions on Architecture and Code Optimization (TACO) 2, 1(2005), 34–54.Google ScholarDigital Library

Recommendations

Exploiting temporal locality in drowsy cache policies
CF '05: Proceedings of the 2nd conference on Computing frontiers

Technology projections indicate that static power will become a major concern in future generations of high-performance microprocessors. Caches represent a significant percentage of the overall microprocessor die area. Therefore, recent research has ...
Read More
Exploiting Replicated Cache Blocks to Reduce L2 Cache Leakage in CMPs

Modern chip multiprocessors (CMPs) employ large L2 caches to reduce the performance gap between processors and off-chip memory. However, as the size of an L2 cache increases, its leakage power consumption also becomes a major contributor to the total ...
Read More
Exploiting reuse locality on inclusive shared last-level caches
Special Issue on High-Performance Embedded Architectures and Compilers

Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Fat loads
address pretranslation
cache energy
early branch resolution
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 820
  Total Downloads
- Downloads (Last 12 months)168
- Downloads (Last 6 weeks)36
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Fat Loads: Exploiting Locality Amongst Contemporaneous Load Operations to Optimize Cache Accesses

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Recommendations

Exploiting temporal locality in drowsy cache policies

Exploiting Replicated Cache Blocks to Reduce L2 Cache Leakage in CMPs

Exploiting reuse locality on inclusive shared last-level caches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Fat Loads: Exploiting Locality Amongst Contemporaneous Load Operations to Optimize Cache Accesses

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Recommendations

Exploiting temporal locality in drowsy cache policies

Exploiting Replicated Cache Blocks to Reduce L2 Cache Leakage in CMPs

Exploiting reuse locality on inclusive shared last-level caches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media