ABSTRACT
This paper considers locality among load instructions that are in processing contemporaneously within a processor to optimize the number of accesses to the memory hierarchy. A simple technique is used to learn and predict the number of contemporaneous accesses to a region of memory and classify a particular dynamic load into a normal or a fat load. Fat loads bring in additional data into Contemporaneous Load Access Registers (CLARs), from where other contemporaneous loads could be serviced without accessing the L1 cache. Experimental results indicate that with fat loads, along with 4 or 8 cache line size CLARs (256 or 512 bytes), the number of L1 cache accesses could be reduced by 50-60%, resulting in significant energy savings for the L1 cache operations. Further, in several cases the reduced latency for loads serviced from a CLAR results in an earlier resolution of some mispredicted branches, and a reduction in the number of wrong-path instructions, especially loads.
- Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, 140–143.Google ScholarCross Ref
- Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 49–56.Google ScholarCross Ref
- Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: the untapped potential of the store-buffer. In Proceedings of the 46th International Symposium on Computer Architecture. 436–448.Google ScholarDigital Library
- TM Austin and GS Sohi. 1996. High-bandwidth address translation for multiple-issue processors. In 23rd Annual International Symposium on Computer Architecture (ISCA’96). IEEE, 158–158.Google ScholarDigital Library
- Todd M Austin, Dionisios N Pnevmatikatos, and Gurindar S Sohi. 1995. Streamlining data cache access with fast address calculation. ACM SIGARCH Computer Architecture News 23, 2 (1995), 369–380.Google ScholarDigital Library
- Michael Bekerman, Adi Yoaz, Freddy Gabbay, Stephan Jourdan, Maxim Kalaev, and Ronny Ronen. 2000. Early load address resolution via register tracking. ACM SIGARCH Computer Architecture News 28, 2 (2000), 306–315.Google ScholarDigital Library
- Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the 1999 international symposium on Low power electronics and design. 64–69.Google ScholarDigital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7.Google Scholar
- Brad Calder, Dirk Grunwald, and Joel Emer. 1996. Predictive sequential associative cache. In Proceedings. Second International Symposium on High-Performance Computer Architecture. IEEE, 244–253.Google ScholarCross Ref
- Pablo Carazo, Rubén Apolloni, Fernando Castro, Daniel Chaver, Luis Pinuel, and Francisco Tirado. 2010. L1 data cache power reduction using a forwarding predictor. In International Workshop on Power and Timing Modeling, Optimization and Simulation. Springer, 116–125.Google Scholar
- Michel Cekleov and Michel Dubois. 1997. Virtual-address caches. Part 1: problems and solutions in uniprocessors. IEEE Micro 17, 5 (1997), 64–71.Google ScholarDigital Library
- Dan Ernst, Andrew Hamel, and Todd Austin. 2003. Cyclone: A broadcast-free dynamic instruction scheduler with selective replay. ACM SIGARCH Computer Architecture News 31, 2 (2003), 253–263.Google ScholarDigital Library
- Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture. 97–104.Google ScholarDigital Library
- Simcha Gochman. 2003. The Intel Pentium M processor: microarchitecture and performance. Intel technology journal 7, 2 (2003).Google Scholar
- John L Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1–17.Google ScholarDigital Library
- Koji Inoue, Tohru Ishihara, and Kazuaki Murakami. 1999. Way-predicting set-associative cache for high performance and low energy consumption. In Proceedings of the 1999 international symposium on Low power electronics and design. 273–275.Google ScholarDigital Library
- Lei Jin and Sangyeun Cho. 2006. Reducing cache traffic and energy with macro data load. In Proceedings of the 2006 international symposium on Low power electronics and design. 147–150.Google ScholarDigital Library
- Johnson Kin, Munish Gupta, and William H Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 184–193.Google ScholarCross Ref
- Pierre Michaud and André Seznec. 2001. Data-flow prescheduling for large instruction windows in out-of-order processors. In Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture. IEEE, 27–36.Google ScholarCross Ref
- A Moshovos. 1998. Memory Dependence Prediction, PhD thesis. Univ. of Wisconsin (1998).Google ScholarDigital Library
- Andreas Moshovos, Scott E Breach, Terani N Vijaykumar, and Gurindar S Sohi. 1997. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th annual international symposium on Computer architecture. 181–193.Google ScholarDigital Library
- Andreas Moshovos and Gurindar S Sohi. 1997. Streamlining inter-operation memory communication via data dependence prediction. In Proceedings of 30th Annual international Symposium on Microarchitecture. IEEE, 235–245.Google ScholarDigital Library
- Andreas Moshovos and Gurindar S Sohi. 1999. Read-after-read memory dependence prediction. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 177–185.Google ScholarDigital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories 27(2009), 28.Google Scholar
- Arun A Nair and Lizy K John. 2008. Simulation points for SPEC CPU 2006. In 2008 IEEE International Conference on Computer Design. IEEE, 397–403.Google ScholarCross Ref
- Dan Nicolaescu, Alex Veidenbaum, and Alex Nicolau. 2003. Reducing data cache energy consumption via cached load/store queue. In Proceedings of the 2003 international symposium on Low power electronics and design. 252–257.Google ScholarDigital Library
- K Olukotun, M Rosenblum, and KM Wilson. 1996. Increasing cache port efficiency for dynamic superscalar microprocessors. In 23rd Annual International Symposium on Computer Architecture (ISCA’96). IEEE, 147–147.Google Scholar
- Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 247–259.Google ScholarDigital Library
- Michael D Powell, Amit Agarwal, TN Vijaykumar, Babak Falsafi, and Kaushik Roy. 2001. Reducing set-associative cache energy via way-prediction and selective direct-mapping. In Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. IEEE, 54–65.Google ScholarDigital Library
- Jude A Rivers, Gary S Tyson, Edward S Davidson, and Todd M Austin. 1997. On high-bandwidth data cache design for multi-issue processors. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 46–56.Google ScholarDigital Library
- Alberto Ros and Stefanos Kaxiras. 2018. Non-speculative store coalescing in total store order. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 221–234.Google ScholarDigital Library
- Mahadev Satyanarayanan and Dileep Bhandarkar. 1981. Design trade-offs in VAX-11 translation buffer organization. Computer12(1981), 103–111.Google Scholar
- André Seznec. 2016. Tage-sc-l branch predictors again. In 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5).Google Scholar
- Kevin Skadron and Douglas W Clark. 1997. Design issues and tradeoffs for write buffers. In Proceedings third international symposium on high-performance computer architecture. IEEE, 144–155.Google ScholarCross Ref
- Avinash Sodani. 2011. Race to exascale: Opportunities and challenges. In Keynote at the Annual IEEE/ACM 44th Annual International Symposium on Microarchitecture.Google Scholar
- Daniel J Sorin, Mark D Hill, and David A Wood. 2011. A primer on memory consistency and cache coherence. Synthesis lectures on computer architecture 6, 3 (2011), 1–212.Google ScholarDigital Library
- Gary S Tyson and Todd M Austin. 1997. Improving the accuracy and performance of memory communication through renaming. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 218–227.Google ScholarDigital Library
- Gary S Tyson and Todd M Austin. 1999. Memory renaming: Fast, early and accurate processing of memory communication. International Journal of Parallel Programming 27, 5(1999), 357–380.Google ScholarDigital Library
- Jiachen Xue and Mithuna Thottethodi. 2013. PreTrans: reducing TLB CAM-search via page number prediction and speculative pre-translation. In International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 341–346.Google ScholarCross Ref
- Chuanjun Zhang, Frank Vahid, Jun Yang, and Walid Najjar. 2005. A way-halting cache for low-energy high-performance systems. ACM Transactions on Architecture and Code Optimization (TACO) 2, 1(2005), 34–54.Google ScholarDigital Library
Recommendations
Exploiting temporal locality in drowsy cache policies
CF '05: Proceedings of the 2nd conference on Computing frontiersTechnology projections indicate that static power will become a major concern in future generations of high-performance microprocessors. Caches represent a significant percentage of the overall microprocessor die area. Therefore, recent research has ...
Exploiting Replicated Cache Blocks to Reduce L2 Cache Leakage in CMPs
Modern chip multiprocessors (CMPs) employ large L2 caches to reduce the performance gap between processors and off-chip memory. However, as the size of an L2 cache increases, its leakage power consumption also becomes a major contributor to the total ...
Exploiting reuse locality on inclusive shared last-level caches
Special Issue on High-Performance Embedded Architectures and CompilersOptimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is ...
Comments