research-article

Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior

Authors:

Hussein Elnawawy,

Rangeen Basu Roy Chowdhury,

Gregory T. ByrdAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 195 - 205

https://doi.org/10.1145/3330345.3330363

Published: 26 June 2019 Publication History

Abstract

Modern workloads such as graph analytics, sparse matrix multiplication, and in-memory key-value stores use very large datasets and typically have non-uniform memory access patterns which defy traditional concepts of locality. Moreover, many of these algorithms simultaneously use multiple data structures that have very distinct access patterns to the corresponding pages, leading to heterogeneity in TLB behavior. Our intuition suggests that these two factors make it important to architect a heterogeneity-aware TLB hierarchy.

Our results confirm the existence of heterogeneity in TLB behavior, where a few pages have high reuse but poor temporal locality. These pages are responsible for a significant percentage of the TLB misses (e.g. over 15% of the TLB misses result from only 17 pages, which is 0.04% of the total number of pages, for Canneal kernel). In this paper, we propose Diligent TLBs (Di-TLBs), a novel hardware-software co-design for TLBs that identifies such delinquent page mappings by tracking their reuse behavior and pinning them in the TLBs to reduce misses. We show that Di-TLBs reduce TLB misses by up to 24.93% on average while improving performance by up to 9.13% on average for a collection of memory-intensive workloads.

References

[1]

{n. d.}. Intel 64 and IA-32 Architectures Optimization Reference Manual. https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

[2]

{n. d.}. Intel Itanium Architecture Software Developer's Manual, Volume 2. Section 4.1.1.1. https://www.intel.com/content/dam/www/public/us/en/documents/manuals/itanium-architecture-software-developer-rev-2-3-vol-2-manual.pdf

[3]

A. Awad, S.D. Hammond, G.R. Voskuilen, and R.J. Hoekstra. 2017. Samba: A Detailed Memory Management Unit (MMU) for the SST Simulation Framework. Technical Report. www.cfwebprod.sandia.gov/cfdocs/CompResearch/docs/template1.pdf

[4]

A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Intl. Symp. on Computer Architecture (ISCA).

Digital Library

[5]

A. Bhattacharjee, D. Lustig, and M. Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Intl. Symp. on High Performance Computer Architecture (HPCA).

Digital Library

[6]

A. Bhattacharjee and M. Martonosi. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. In Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[7]

A. Bhattacharjee and M. Martonosi. 2010. Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[8]

Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.

Digital Library

[9]

C. Cantalupo, V. Venkatesan, and J. R. Hammond. 2015. User Extensible Heap Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (2015).

[10]

J.B. Chen et al. 1992. A Simulation Based Study of TLB Performance. In Intl. Symp. on Computer Architecture (ISCA).

Digital Library

[11]

C.-K. Luk et al. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation.

Digital Library

[12]

Fabien Gaud et al. 2014. Large Pages May Be Harmful on NUMA Systems. In USENIX Annual Technical Conference (USENIX ATC).

Digital Library

[13]

Mitesh R. Meswani et al. 2014. Toward Efficient Programmer-managed Two-level Memory Hierarchies in Exascale Computers. In Hardware-Software Co-Design for High Performance Computing (Co-HPC).

Digital Library

[14]

Mitesh R. Meswani et al. 2015. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories. In Intl. Symp. on High Performance Computer Architecture (HPCA).

[15]

A. Jaleel, K. B. Theobald, Jr. S. C. Steely, and J. Emer. 2010. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Intl. Symp. on Computer Architecture (ISCA).

Digital Library

[16]

J. Jayaraj, A. F. Rodrigues, S. D. Hammond, and G. R. Voskuilen. 2015. The Potential and Perils of Multi-Level Memory. In Intl. Symp. on Memory Systems (MEMSYS).

Digital Library

[17]

G. Kandiraju and A. Sivasubramaniam. 2002. Going the Distance for TLB Prefetching: An Application-Driven Study. In Intl. Symp. on Computer Architecture (ISCA).

Digital Library

[18]

L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C. Lin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).

[19]

B. Pham, J. vesely, G. Loh, and A. Bhattacharjee. 2015. Large Pages and Light-weight Memory Management in Virtualized Environments: Can You Have it Both Ways?. In Intl. Symp. on Microarchitecture (MICRO).

Digital Library

[20]

A. Rodrigues, R. Murphy, P. Kogge, and K. Underwood. 2004. The Structural Simulation Toolkit: A Tool for Bridging the Architectural/Microarchitectural Evaluation Gap. http://sst.sandia.gov

[21]

A. Saulsbury, F. Dahlgren, and P. Stenstrom. 2000. Recency-Based TLB Preloading. In Intl. Symp. on Computer Architecture (ISCA).

Digital Library

[22]

Anand Lal Shimpi and Ryan Smith. 2012. The Intel Ivy Bridge (Core i7 3770K) Review. Anandtech. http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/3

[23]

P. Shivakumar and N. P. Jouppi. 2001. Cacti 3.0: An Integrated Cache Timing, Power and Area Model. Technical Report. http://www.hpl.hp.com/research/cacti/cacti3.pdf

[24]

I. Tanase, Y. Xia, L. Nai, Y. Liu, W. Tan, J. Crawford, and C.-Y. Lin. 2014. A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics. In GRADES.

Digital Library

[25]

J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In The Role of Reactor Physics Toward a Sustainable Future (PHYSOR).

Cited By

Vavouliotis GAlvarez LGrot BJiménez DCasas M(2021)Morrigan: A Composite Instruction TLB PrefetcherMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480049(1138-1153)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480049
Alverti CPsomadakis SKarakostas VGandhi JNikas KGoumas GKoziris NMartínez JDuato JEeckhout L(2020)Enhancing and exploiting contiguity for fast memory virtualizationProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00050(515-528)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00050
Ma ZTan YJiang HYan ZLiu DChen XZhuge QSha EWang C(2020)Unified-TP: A Unified TLB and Page Table Cache Structure for Efficient Address Translation2020 IEEE 38th International Conference on Computer Design (ICCD)10.1109/ICCD50377.2020.00052(255-262)Online publication date: Oct-2020
https://doi.org/10.1109/ICCD50377.2020.00052

Index Terms

Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behavior
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

Morrigan: A Composite Instruction TLB Prefetcher
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
254
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vavouliotis GAlvarez LGrot BJiménez DCasas M(2021)Morrigan: A Composite Instruction TLB PrefetcherMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480049(1138-1153)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480049
Alverti CPsomadakis SKarakostas VGandhi JNikas KGoumas GKoziris NMartínez JDuato JEeckhout L(2020)Enhancing and exploiting contiguity for fast memory virtualizationProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00050(515-528)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00050
Ma ZTan YJiang HYan ZLiu DChen XZhuge QSha EWang C(2020)Unified-TP: A Unified TLB and Page Table Cache Structure for Efficient Address Translation2020 IEEE 38th International Conference on Computer Design (ICCD)10.1109/ICCD50377.2020.00052(255-262)Online publication date: Oct-2020
https://doi.org/10.1109/ICCD50377.2020.00052

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents