skip to main content
10.1145/3173162.3173198acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

LATR: Lazy Translation Coherence

Published: 19 March 2018 Publication History

Abstract

We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from LATR's mechanism. For example, LATR improves the latency of munmap() by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from LATR: without any application-level changes, LATR improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.

References

[1]
Lluc Alvarez, Llu'ıs Vilanova, Miquel Moreto, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, and Mateo Valero. Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 720--732, Portland, OR, June 2015.
[2]
Nadav Amit. Optimizing the TLB Shootdown Algorithm with Page Access Tracking. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC), pages 27--39, Santa Clara, CA, July 2017.
[3]
Lukasz Anaczkowski. Linux VM workaround for Knights Landing A/D leak, 2016. https://lkml.org/lkml/2016/6/14/505.
[4]
Apache. Apache HTTP Server Project, 2017. https://httpd.apache.org/.
[5]
Ravi Arimilli, Guy Guthrie, and Kirk Livingston. Multiprocessor system supporting multiple outstanding TLBI operations per partition, October 2004. US Patent App. 10/425,425.
[6]
ARM. ARM Compiler Reference Guide: TLBI, 2014. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/TLBI_SYS.html.
[7]
Amro Awad, Arkaprava Basu, Sergey Blagodurov, Yan Solihin, and Gabriel H. Loh. Avoiding TLB Shootdowns through Self-invalidating TLB Entries. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 273--287, Portland, OR, September 2017.
[8]
Ramesh Balan and Kurt Gollhard. A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machine. In Proceedings of the Summer 1992 USENIX Annual Technical Conference (ATC), pages 107--115, San Antonio, TX, June 1992.
[9]
Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the Killer Microseconds. Communications of the ACM, 60(4):48--54, March 2017.
[10]
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pages 29--44, Big Sky, MT, October 2009.
[11]
Abhishek Bhattacharjee. Translation-Triggered Prefetching. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 63--76, Xi'an, China, April 2017.
[12]
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. Shared Last-Level TLBs for Chip Multiprocessors. In Proceedings of the 17th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 62--73, San Antonio, TX, February 2011.
[13]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Toronto, Canada, October 2008.
[14]
Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--479, Orlando, FL, December 2006.
[15]
David L. Black, Richard F. Rashid, David B. Golub, Charles R. Hill, and Robert V. Baron. Translation Lookaside Buffer Consistency: A Software Approach. In Proceedings of the 3rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 113--122, Boston, MA, April 1989.
[16]
Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. Corey: An Operating System for Many Cores. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 43--57, San Diego, CA, December 2008.
[17]
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1--16, Vancouver, Canada, October 2010.
[18]
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. RadixVM: Scalable Address Spaces for Multithreaded Applications. In Proceedings of the 8th European Conference on Computer Systems (EuroSys), pages 211--224, Prague, Czech Republic, April 2013.
[19]
Jonathan Corbet. Memory compaction, 2010. https://lwn.net/Articles/368869/.
[20]
Jonathan Corbet. AutoNUMA: the other approach to NUMA scheduling, 2012. https://lwn.net/Articles/488709/.
[21]
Jonathan Corbet. (Nearly) full tickless operation in 3.10, 2013. https://lwn.net/Articles/549580/.
[22]
Christopher Covington. arm64: Work around Falkor erratum 1003, 2016. https://lkml.org/lkml/2016/12/29/267.
[23]
Guilherme Cox and Abhishek Bhattacharjee. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 435--448, Xi'an, China, April 2017.
[24]
Linux Kernel Driver Database. CONFIG_ARM_ERRATA_720789, 2017. http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_720789.html.
[25]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, San Francisco, CA, December 2004.
[26]
FreeBSD. FreeBSD - PCID implementation, 2015. https://reviews.freebsd.org/rS282684.
[27]
Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network Requirements for Resource Disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 249--264, Savannah, GA, November 2016.
[28]
Jeff Gilchrist. Parallel Compression with BZIP2. In Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pages 559--564, Cambridge, MA, November 2004.
[29]
Will Glozer. wrk - a HTTP benchmarking tool, 2015. https://github.com/wg/wrk.
[30]
Mel Gorman. TLB flush multiple pages per IPI, 2015. https://lkml.org/lkml/2015/4/25/125.
[31]
Graph500 Reference Implementations, 2017. http://graph500.org/?page_id=47.
[32]
Intel. Multiprocessor Specification, 1997.
[33]
Intel Xeon Processor E5--4610 v2, 2014. http://ark.intel.com/products/75285/Intel-Xeon-Processor-E5--4610-v2--16M-Cache-2_30-GHz.
[34]
Introduction to Cache Allocation Technology in the Intel Xeon Processor E5 v4 Family, 2016. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology.
[35]
Intel Xeon Processor E7--8894 v4, 2017. http://ark.intel.com/products/96900/Intel-Xeon-Processor-E7--8894-v4--60M-Cache-2_40-GHz.
[36]
Gu Juncheng, Lee Youngmoon, Zhang Yiwen, Chowdhury Mosharaf, and Shin Kang. Efficient Memory Disaggregation with Infiniswap. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, April 2017.
[37]
Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Ünsal. Energy-Efficient Address Translation. In Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA), pages 631--643, Barcelona, Spain, March 2016.
[38]
Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 705--721, Savannah, GA, November 2016.
[39]
Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Transactions on Architecture and Code Optimization (TACO), 10(1):2:1--2:38, April 2013.
[40]
Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface. In Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 233--247, Atlanta, GA, April 2016.
[41]
Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for Multicore Architectures. Technical Report MIT-CSAIL-TR-2010-020, MIT, May 2010.
[42]
Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-Stacked and Off-Package Memories. In Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture (HPCA), pages 126--136, San Francisco, CA, February 2015.
[43]
Timothy Prickett Morgan. AMD Disrupts The Two-Socket Server Status Quo, 2017. https://www.nextplatform.com/2017/05/17/amd-disrupts-two-socket-server-status-quo/.
[44]
Mark Oskin and Gabriel H. Loh. A Software-Managed Approach to Die-Stacked DRAM. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 188--200, San Francisco, CA, September 2015.
[45]
J. Kent Peacock, Sunil Saxena, Dean Thomas, Fred Yang, and Wilfred Yu. Experiences from Multithreading System V Release 4. In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems, SEDMS III, pages 77--91, 1992.
[46]
Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. Increasing TLB Reach by Exploiting Clustering in Page Translations. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 558--567, Orlando, FL, USA, February 2014.
[47]
Binh Pham, Derek Hower, Abhishek Bhattacharjee, and Trey Cain. TLB Shootdown Mitigation for Low-Power, Many-Core Servers with L1 Virtual Caches. IEEE Computer Architecture Letters, PP(99), June 2017.
[48]
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 258--269, Vancouver, Canada, December 2012.
[49]
Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee. Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways? In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--12, Waikiki, Hawaii, December 2015.
[50]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 743--758, Salt Lake City, UT, March 2014.
[51]
Jason Power, Mark D. Hill, and David A. Wood. Supporting x86--64 Address Translation for 100s of GPU Lanes. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 568--578, Orlando, FL, USA, February 2014.
[52]
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 13th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 13--24, Phoenix, AZ, February 2007.
[53]
Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. Specifying and Dynamically Verifying Address Translation-aware Memory Consistency. In Proceedings of the 15th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 323--334, Pittsburgh, PA, March 2010.
[54]
Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All. In Proceedings of the 16th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 1--12, Bangalore, India, January 2010.
[55]
Anand Lal Shimpi. AMD's B3 stepping Phenom previewed, TLB hardware fix tested., 2008. http://www.anandtech.com/show/2477/2.
[56]
Patricia Teller. Translation-Lookaside Buffer Consistency. Computer, 23(6):26--36, June 1990.
[57]
Patricia J. Teller, Richard Kenner, and Marc Snir. TLB Consistency on Highly-Parallel Shared-Memory Multiprocessors. In Proceedings of the 21st Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track, volume 1, pages 184--193, 1988.
[58]
Scott Rixner Thomas Barr, Alan Cox. SpecTLB: a Mechanism for Speculative Address Translation. In Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 307--318, San Jose, California, USA, June 2011.
[59]
Michael Y Thompson, JM Barton, TA Jermoluk, and JC Wagner. Translation Lookaside Buffer Synchronization in a Multiprocessor System. In Proceedings of the Winter 1988 USENIX Annual Technical Conference (ATC), Dallas, TX, 1988.
[60]
Linus Torvalds. Linux Kernel, 2017. https://github.com/torvalds/linux.
[61]
Theo Valich. Intel explains the Core 2 CPU errata., 2007. http://www.theinquirer.net/inquirer/news/1031406/intel-explains-core-cpu-errata.
[62]
Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrián Cristal, and Osman S. Ünsal. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 340--349, Galveston Island, TX, October 2011.
[63]
Carl A. Waldspurger. Memory Resource Management in VMware ESX Server. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 181--194, Boston, MA, December 2002.
[64]
Zi Yan, Ján Veselý, Guilherme Cox, and Abhishek Bhattacharjee. Hardware Translation Coherence for Virtualized Systems. In Proceedings of the 44th ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 430--443, Toronto, Canada, June 2017.

Cited By

View all
  • (2024)PMAlloc: A Holistic Approach to Improving Persistent Memory AllocationACM Transactions on Computer Systems10.1145/364388642:3-4(1-52)Online publication date: 20-Sep-2024
  • (2024)Genie Cache: Non-Blocking Miss Handling and Replacement in Page-Table-Based DRAM Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00076(983-996)Online publication date: 2-Nov-2024
  • (2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 2
    ASPLOS '18
    February 2018
    809 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3296957
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. TLB
  2. asynchrony
  3. translation coherence

Qualifiers

  • Research-article

Funding Sources

  • Defense Advanced Research Projects Agency
  • Office of Naval Research
  • Electronics and Telecommunications Research Institute
  • National Science Foundation

Conference

ASPLOS '18

Acceptance Rates

ASPLOS '18 Paper Acceptance Rate 56 of 319 submissions, 18%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)307
  • Downloads (Last 6 weeks)51
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PMAlloc: A Holistic Approach to Improving Persistent Memory AllocationACM Transactions on Computer Systems10.1145/364388642:3-4(1-52)Online publication date: 20-Sep-2024
  • (2024)Genie Cache: Non-Blocking Miss Handling and Replacement in Page-Table-Based DRAM Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00076(983-996)Online publication date: 2-Nov-2024
  • (2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
  • (2024)Distributed Page Table: Harnessing Physical Memory as an Unbounded Hashed Page Table2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00013(36-49)Online publication date: 2-Nov-2024
  • (2024)Design and Implementation of Efficient and Transparent Zero Copy ReadIEEE Access10.1109/ACCESS.2024.350268812(174078-174093)Online publication date: 2024
  • (2023)IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE InvalidationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614269(1163-1177)Online publication date: 28-Oct-2023
  • (2023)Virtual-Memory Assisted Buffer ManagementProceedings of the ACM on Management of Data10.1145/35886871:1(1-25)Online publication date: 30-May-2023
  • (2023)Contiguitas: The Pursuit of Physical Memory Contiguity in DatacentersProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589079(1-15)Online publication date: 17-Jun-2023
  • (2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
  • (2023)AstriFlash A Flash-Based System for Online Services2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070955(81-93)Online publication date: Feb-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media