skip to main content
10.1145/3466752.3480104acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Fat Loads: Exploiting Locality Amongst Contemporaneous Load Operations to Optimize Cache Accesses

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

This paper considers locality among load instructions that are in processing contemporaneously within a processor to optimize the number of accesses to the memory hierarchy. A simple technique is used to learn and predict the number of contemporaneous accesses to a region of memory and classify a particular dynamic load into a normal or a fat load. Fat loads bring in additional data into Contemporaneous Load Access Registers (CLARs), from where other contemporaneous loads could be serviced without accessing the L1 cache. Experimental results indicate that with fat loads, along with 4 or 8 cache line size CLARs (256 or 512 bytes), the number of L1 cache accesses could be reduced by 50-60%, resulting in significant energy savings for the L1 cache operations. Further, in several cases the reduced latency for loads serviced from a CLAR results in an earlier resolution of some mispredicted branches, and a reduction in the number of wrong-path instructions, especially loads.

References

  1. Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, 140–143.Google ScholarGoogle ScholarCross RefCross Ref
  2. Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 49–56.Google ScholarGoogle ScholarCross RefCross Ref
  3. Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: the untapped potential of the store-buffer. In Proceedings of the 46th International Symposium on Computer Architecture. 436–448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. TM Austin and GS Sohi. 1996. High-bandwidth address translation for multiple-issue processors. In 23rd Annual International Symposium on Computer Architecture (ISCA’96). IEEE, 158–158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Todd M Austin, Dionisios N Pnevmatikatos, and Gurindar S Sohi. 1995. Streamlining data cache access with fast address calculation. ACM SIGARCH Computer Architecture News 23, 2 (1995), 369–380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael Bekerman, Adi Yoaz, Freddy Gabbay, Stephan Jourdan, Maxim Kalaev, and Ronny Ronen. 2000. Early load address resolution via register tracking. ACM SIGARCH Computer Architecture News 28, 2 (2000), 306–315.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the 1999 international symposium on Low power electronics and design. 64–69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7.Google ScholarGoogle Scholar
  9. Brad Calder, Dirk Grunwald, and Joel Emer. 1996. Predictive sequential associative cache. In Proceedings. Second International Symposium on High-Performance Computer Architecture. IEEE, 244–253.Google ScholarGoogle ScholarCross RefCross Ref
  10. Pablo Carazo, Rubén Apolloni, Fernando Castro, Daniel Chaver, Luis Pinuel, and Francisco Tirado. 2010. L1 data cache power reduction using a forwarding predictor. In International Workshop on Power and Timing Modeling, Optimization and Simulation. Springer, 116–125.Google ScholarGoogle Scholar
  11. Michel Cekleov and Michel Dubois. 1997. Virtual-address caches. Part 1: problems and solutions in uniprocessors. IEEE Micro 17, 5 (1997), 64–71.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dan Ernst, Andrew Hamel, and Todd Austin. 2003. Cyclone: A broadcast-free dynamic instruction scheduler with selective replay. ACM SIGARCH Computer Architecture News 31, 2 (2003), 253–263.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture. 97–104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Simcha Gochman. 2003. The Intel Pentium M processor: microarchitecture and performance. Intel technology journal 7, 2 (2003).Google ScholarGoogle Scholar
  15. John L Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1–17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Koji Inoue, Tohru Ishihara, and Kazuaki Murakami. 1999. Way-predicting set-associative cache for high performance and low energy consumption. In Proceedings of the 1999 international symposium on Low power electronics and design. 273–275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lei Jin and Sangyeun Cho. 2006. Reducing cache traffic and energy with macro data load. In Proceedings of the 2006 international symposium on Low power electronics and design. 147–150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Johnson Kin, Munish Gupta, and William H Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 184–193.Google ScholarGoogle ScholarCross RefCross Ref
  19. Pierre Michaud and André Seznec. 2001. Data-flow prescheduling for large instruction windows in out-of-order processors. In Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture. IEEE, 27–36.Google ScholarGoogle ScholarCross RefCross Ref
  20. A Moshovos. 1998. Memory Dependence Prediction, PhD thesis. Univ. of Wisconsin (1998).Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andreas Moshovos, Scott E Breach, Terani N Vijaykumar, and Gurindar S Sohi. 1997. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th annual international symposium on Computer architecture. 181–193.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andreas Moshovos and Gurindar S Sohi. 1997. Streamlining inter-operation memory communication via data dependence prediction. In Proceedings of 30th Annual international Symposium on Microarchitecture. IEEE, 235–245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andreas Moshovos and Gurindar S Sohi. 1999. Read-after-read memory dependence prediction. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 177–185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories 27(2009), 28.Google ScholarGoogle Scholar
  25. Arun A Nair and Lizy K John. 2008. Simulation points for SPEC CPU 2006. In 2008 IEEE International Conference on Computer Design. IEEE, 397–403.Google ScholarGoogle ScholarCross RefCross Ref
  26. Dan Nicolaescu, Alex Veidenbaum, and Alex Nicolau. 2003. Reducing data cache energy consumption via cached load/store queue. In Proceedings of the 2003 international symposium on Low power electronics and design. 252–257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K Olukotun, M Rosenblum, and KM Wilson. 1996. Increasing cache port efficiency for dynamic superscalar microprocessors. In 23rd Annual International Symposium on Computer Architecture (ISCA’96). IEEE, 147–147.Google ScholarGoogle Scholar
  28. Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 247–259.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Michael D Powell, Amit Agarwal, TN Vijaykumar, Babak Falsafi, and Kaushik Roy. 2001. Reducing set-associative cache energy via way-prediction and selective direct-mapping. In Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. IEEE, 54–65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jude A Rivers, Gary S Tyson, Edward S Davidson, and Todd M Austin. 1997. On high-bandwidth data cache design for multi-issue processors. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 46–56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alberto Ros and Stefanos Kaxiras. 2018. Non-speculative store coalescing in total store order. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 221–234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mahadev Satyanarayanan and Dileep Bhandarkar. 1981. Design trade-offs in VAX-11 translation buffer organization. Computer12(1981), 103–111.Google ScholarGoogle Scholar
  33. André Seznec. 2016. Tage-sc-l branch predictors again. In 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5).Google ScholarGoogle Scholar
  34. Kevin Skadron and Douglas W Clark. 1997. Design issues and tradeoffs for write buffers. In Proceedings third international symposium on high-performance computer architecture. IEEE, 144–155.Google ScholarGoogle ScholarCross RefCross Ref
  35. Avinash Sodani. 2011. Race to exascale: Opportunities and challenges. In Keynote at the Annual IEEE/ACM 44th Annual International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  36. Daniel J Sorin, Mark D Hill, and David A Wood. 2011. A primer on memory consistency and cache coherence. Synthesis lectures on computer architecture 6, 3 (2011), 1–212.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Gary S Tyson and Todd M Austin. 1997. Improving the accuracy and performance of memory communication through renaming. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 218–227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Gary S Tyson and Todd M Austin. 1999. Memory renaming: Fast, early and accurate processing of memory communication. International Journal of Parallel Programming 27, 5(1999), 357–380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jiachen Xue and Mithuna Thottethodi. 2013. PreTrans: reducing TLB CAM-search via page number prediction and speculative pre-translation. In International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 341–346.Google ScholarGoogle ScholarCross RefCross Ref
  40. Chuanjun Zhang, Frank Vahid, Jun Yang, and Walid Najjar. 2005. A way-halting cache for low-energy high-performance systems. ACM Transactions on Architecture and Code Optimization (TACO) 2, 1(2005), 34–54.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
    October 2021
    1322 pages
    ISBN:9781450385572
    DOI:10.1145/3466752

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 17 October 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate484of2,242submissions,22%

    Upcoming Conference

    MICRO '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format