skip to main content
10.1145/3330345.3330363acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior

Published: 26 June 2019 Publication History

Abstract

Modern workloads such as graph analytics, sparse matrix multiplication, and in-memory key-value stores use very large datasets and typically have non-uniform memory access patterns which defy traditional concepts of locality. Moreover, many of these algorithms simultaneously use multiple data structures that have very distinct access patterns to the corresponding pages, leading to heterogeneity in TLB behavior. Our intuition suggests that these two factors make it important to architect a heterogeneity-aware TLB hierarchy.
Our results confirm the existence of heterogeneity in TLB behavior, where a few pages have high reuse but poor temporal locality. These pages are responsible for a significant percentage of the TLB misses (e.g. over 15% of the TLB misses result from only 17 pages, which is 0.04% of the total number of pages, for Canneal kernel). In this paper, we propose Diligent TLBs (Di-TLBs), a novel hardware-software co-design for TLBs that identifies such delinquent page mappings by tracking their reuse behavior and pinning them in the TLBs to reduce misses. We show that Di-TLBs reduce TLB misses by up to 24.93% on average while improving performance by up to 9.13% on average for a collection of memory-intensive workloads.

References

[1]
{n. d.}. Intel 64 and IA-32 Architectures Optimization Reference Manual. https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
[2]
{n. d.}. Intel Itanium Architecture Software Developer's Manual, Volume 2. Section 4.1.1.1. https://www.intel.com/content/dam/www/public/us/en/documents/manuals/itanium-architecture-software-developer-rev-2-3-vol-2-manual.pdf
[3]
A. Awad, S.D. Hammond, G.R. Voskuilen, and R.J. Hoekstra. 2017. Samba: A Detailed Memory Management Unit (MMU) for the SST Simulation Framework. Technical Report. www.cfwebprod.sandia.gov/cfdocs/CompResearch/docs/template1.pdf
[4]
A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Intl. Symp. on Computer Architecture (ISCA).
[5]
A. Bhattacharjee, D. Lustig, and M. Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Intl. Symp. on High Performance Computer Architecture (HPCA).
[6]
A. Bhattacharjee and M. Martonosi. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. In Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT).
[7]
A. Bhattacharjee and M. Martonosi. 2010. Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[8]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
[9]
C. Cantalupo, V. Venkatesan, and J. R. Hammond. 2015. User Extensible Heap Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (2015).
[10]
J.B. Chen et al. 1992. A Simulation Based Study of TLB Performance. In Intl. Symp. on Computer Architecture (ISCA).
[11]
C.-K. Luk et al. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation.
[12]
Fabien Gaud et al. 2014. Large Pages May Be Harmful on NUMA Systems. In USENIX Annual Technical Conference (USENIX ATC).
[13]
Mitesh R. Meswani et al. 2014. Toward Efficient Programmer-managed Two-level Memory Hierarchies in Exascale Computers. In Hardware-Software Co-Design for High Performance Computing (Co-HPC).
[14]
Mitesh R. Meswani et al. 2015. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories. In Intl. Symp. on High Performance Computer Architecture (HPCA).
[15]
A. Jaleel, K. B. Theobald, Jr. S. C. Steely, and J. Emer. 2010. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Intl. Symp. on Computer Architecture (ISCA).
[16]
J. Jayaraj, A. F. Rodrigues, S. D. Hammond, and G. R. Voskuilen. 2015. The Potential and Perils of Multi-Level Memory. In Intl. Symp. on Memory Systems (MEMSYS).
[17]
G. Kandiraju and A. Sivasubramaniam. 2002. Going the Distance for TLB Prefetching: An Application-Driven Study. In Intl. Symp. on Computer Architecture (ISCA).
[18]
L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C. Lin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).
[19]
B. Pham, J. vesely, G. Loh, and A. Bhattacharjee. 2015. Large Pages and Light-weight Memory Management in Virtualized Environments: Can You Have it Both Ways?. In Intl. Symp. on Microarchitecture (MICRO).
[20]
A. Rodrigues, R. Murphy, P. Kogge, and K. Underwood. 2004. The Structural Simulation Toolkit: A Tool for Bridging the Architectural/Microarchitectural Evaluation Gap. http://sst.sandia.gov
[21]
A. Saulsbury, F. Dahlgren, and P. Stenstrom. 2000. Recency-Based TLB Preloading. In Intl. Symp. on Computer Architecture (ISCA).
[22]
Anand Lal Shimpi and Ryan Smith. 2012. The Intel Ivy Bridge (Core i7 3770K) Review. Anandtech. http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/3
[23]
P. Shivakumar and N. P. Jouppi. 2001. Cacti 3.0: An Integrated Cache Timing, Power and Area Model. Technical Report. http://www.hpl.hp.com/research/cacti/cacti3.pdf
[24]
I. Tanase, Y. Xia, L. Nai, Y. Liu, W. Tan, J. Crawford, and C.-Y. Lin. 2014. A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics. In GRADES.
[25]
J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In The Role of Reactor Physics Toward a Sustainable Future (PHYSOR).

Cited By

View all
  • (2021)Morrigan: A Composite Instruction TLB PrefetcherMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480049(1138-1153)Online publication date: 18-Oct-2021
  • (2020)Enhancing and exploiting contiguity for fast memory virtualizationProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00050(515-528)Online publication date: 30-May-2020
  • (2020)Unified-TP: A Unified TLB and Page Table Cache Structure for Efficient Address Translation2020 IEEE 38th International Conference on Computer Design (ICCD)10.1109/ICCD50377.2020.00052(255-262)Online publication date: Oct-2020

Index Terms

  1. Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '19: Proceedings of the ACM International Conference on Supercomputing
    June 2019
    533 pages
    ISBN:9781450360791
    DOI:10.1145/3330345
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. translation lookaside buffer
    2. virtual memory

    Qualifiers

    • Research-article

    Conference

    ICS '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Morrigan: A Composite Instruction TLB PrefetcherMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480049(1138-1153)Online publication date: 18-Oct-2021
    • (2020)Enhancing and exploiting contiguity for fast memory virtualizationProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00050(515-528)Online publication date: 30-May-2020
    • (2020)Unified-TP: A Unified TLB and Page Table Cache Structure for Efficient Address Translation2020 IEEE 38th International Conference on Computer Design (ICCD)10.1109/ICCD50377.2020.00052(255-262)Online publication date: Oct-2020

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media