skip to main content
10.1145/3330345.3330363acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior

Published:26 June 2019Publication History

ABSTRACT

Modern workloads such as graph analytics, sparse matrix multiplication, and in-memory key-value stores use very large datasets and typically have non-uniform memory access patterns which defy traditional concepts of locality. Moreover, many of these algorithms simultaneously use multiple data structures that have very distinct access patterns to the corresponding pages, leading to heterogeneity in TLB behavior. Our intuition suggests that these two factors make it important to architect a heterogeneity-aware TLB hierarchy.

Our results confirm the existence of heterogeneity in TLB behavior, where a few pages have high reuse but poor temporal locality. These pages are responsible for a significant percentage of the TLB misses (e.g. over 15% of the TLB misses result from only 17 pages, which is 0.04% of the total number of pages, for Canneal kernel). In this paper, we propose Diligent TLBs (Di-TLBs), a novel hardware-software co-design for TLBs that identifies such delinquent page mappings by tracking their reuse behavior and pinning them in the TLBs to reduce misses. We show that Di-TLBs reduce TLB misses by up to 24.93% on average while improving performance by up to 9.13% on average for a collection of memory-intensive workloads.

References

  1. {n. d.}. Intel 64 and IA-32 Architectures Optimization Reference Manual. https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.htmlGoogle ScholarGoogle Scholar
  2. {n. d.}. Intel Itanium Architecture Software Developer's Manual, Volume 2. Section 4.1.1.1. https://www.intel.com/content/dam/www/public/us/en/documents/manuals/itanium-architecture-software-developer-rev-2-3-vol-2-manual.pdfGoogle ScholarGoogle Scholar
  3. A. Awad, S.D. Hammond, G.R. Voskuilen, and R.J. Hoekstra. 2017. Samba: A Detailed Memory Management Unit (MMU) for the SST Simulation Framework. Technical Report. www.cfwebprod.sandia.gov/cfdocs/CompResearch/docs/template1.pdfGoogle ScholarGoogle Scholar
  4. A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Intl. Symp. on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bhattacharjee, D. Lustig, and M. Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Intl. Symp. on High Performance Computer Architecture (HPCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bhattacharjee and M. Martonosi. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. In Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bhattacharjee and M. Martonosi. 2010. Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Cantalupo, V. Venkatesan, and J. R. Hammond. 2015. User Extensible Heap Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (2015).Google ScholarGoogle Scholar
  10. J.B. Chen et al. 1992. A Simulation Based Study of TLB Performance. In Intl. Symp. on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C.-K. Luk et al. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fabien Gaud et al. 2014. Large Pages May Be Harmful on NUMA Systems. In USENIX Annual Technical Conference (USENIX ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mitesh R. Meswani et al. 2014. Toward Efficient Programmer-managed Two-level Memory Hierarchies in Exascale Computers. In Hardware-Software Co-Design for High Performance Computing (Co-HPC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mitesh R. Meswani et al. 2015. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories. In Intl. Symp. on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  15. A. Jaleel, K. B. Theobald, Jr. S. C. Steely, and J. Emer. 2010. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Intl. Symp. on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Jayaraj, A. F. Rodrigues, S. D. Hammond, and G. R. Voskuilen. 2015. The Potential and Perils of Multi-Level Memory. In Intl. Symp. on Memory Systems (MEMSYS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Kandiraju and A. Sivasubramaniam. 2002. Going the Distance for TLB Prefetching: An Application-Driven Study. In Intl. Symp. on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C. Lin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarGoogle Scholar
  19. B. Pham, J. vesely, G. Loh, and A. Bhattacharjee. 2015. Large Pages and Light-weight Memory Management in Virtualized Environments: Can You Have it Both Ways?. In Intl. Symp. on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Rodrigues, R. Murphy, P. Kogge, and K. Underwood. 2004. The Structural Simulation Toolkit: A Tool for Bridging the Architectural/Microarchitectural Evaluation Gap. http://sst.sandia.govGoogle ScholarGoogle Scholar
  21. A. Saulsbury, F. Dahlgren, and P. Stenstrom. 2000. Recency-Based TLB Preloading. In Intl. Symp. on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Anand Lal Shimpi and Ryan Smith. 2012. The Intel Ivy Bridge (Core i7 3770K) Review. Anandtech. http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/3Google ScholarGoogle Scholar
  23. P. Shivakumar and N. P. Jouppi. 2001. Cacti 3.0: An Integrated Cache Timing, Power and Area Model. Technical Report. http://www.hpl.hp.com/research/cacti/cacti3.pdfGoogle ScholarGoogle Scholar
  24. I. Tanase, Y. Xia, L. Nai, Y. Liu, W. Tan, J. Crawford, and C.-Y. Lin. 2014. A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics. In GRADES. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In The Role of Reactor Physics Toward a Sustainable Future (PHYSOR).Google ScholarGoogle Scholar

Index Terms

  1. Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '19: Proceedings of the ACM International Conference on Supercomputing
      June 2019
      533 pages
      ISBN:9781450360791
      DOI:10.1145/3330345

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 June 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader