skip to main content
10.1145/3123939.3123947acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Open access

TMI: thread memory isolation for false sharing repair

Published: 14 October 2017 Publication History

Abstract

Cache contention in the form of false sharing and true sharing arises when threads overshare cache lines at high frequency. Such oversharing can reduce or negate the performance benefits of parallel execution. Prior systems for detecting and repairing cache contention lack efficiency in detection or repair, contain subtle memory consistency flaws, or require invasive changes to the program environment.
In this paper, we introduce a new way to combat cache line oversharing via the Thread Memory Isolation (Tmi) system. Tmi operates completely in userspace, leveraging performance counters and the Linux ptrace mechanism to tread lightly on monitored applications, intervening only when necessary. Tmi's compatible-by-default design allows it to scale to real-world workloads, unlike previous proposals. Tmi introduces a novel code-centric consistency model to handle cross-language memory consistency issues. Tmi exploits the flexibility of code-centric consistency to efficiently repair false sharing while preserving strong consistency model semantics when necessary.
Tmi has minimal impact on programs without oversharing, slowing their execution by just 2% on average. We also evaluate Tmi on benchmarks with known false sharing, and manually inject a false sharing bug into the leveldb key-value store from Google. For these programs, Tmi provides an average speedup of 5.2x and achieves 88% of the speedup possible with manual source code fixes.

References

[1]
Advanced Micro Devices, Inc. 2013. Preliminary BIOS and Kernel DeveloperâĂŹs Guide (BKDG) for AMD Family 16h Models 00h-0Fh (Kabini) Processors. Chapter 2.6.2 Instruction Based Sampling.
[2]
Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford. 2010. Efficient System-Enforced Deterministic Parallelism. CoRR abs/1005.3450 (2010). http://arxiv.org/abs/1005.3450
[3]
Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '11). ACM, New York, NY, USA, 55--66.
[4]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton, NJ, USA. Advisor(s) Li, Kai. AAI3445564.
[5]
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1--8. http://dl.acm.org/citation.cfm?id=1924943.1924944
[6]
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An Infrastructure for Adaptive Dynamic Optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '03). IEEE Computer Society, Washington, DC, USA, 265--275. http://dl.acm.org/citation.cfm?id=776261.776290
[7]
Buddhika Chamith, Bo Joel Svensson, Luke Dalessandro, and Ryan R. Newton. 2016. Living on the Edge: Rapid-toggling Probes with Cross-modification on x86. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 16--26.
[8]
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2012. Scalable Address Spaces Using RCU Balanced Trees. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 199--210.
[9]
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2013. RadixVM: Scalable Address Spaces for Multithreaded Applications. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 211--224.
[10]
Intel Corporation. 2015. Intel VTune Amplifier 2015. (May 2015). https://software.intel.com/en-us/intel-vtune-amplifier-xe
[11]
Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2016. Remix: Online Detection and Repair of Cache Contention for the JVM. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 251--265.
[12]
Joseph L. Greathouse, Zhiqiang Ma, Matthew I. Frank, Ramesh Peri, and Todd Austin. 2011. Demand-driven Software Race Detection Using Hardware Performance Counters. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA '11). ACM, New York, NY, USA, 165--176.
[13]
Stephan M. Günther and Josef Weidendorfer. 2009. Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA '09). ACM, New York, NY, USA, 26--33.
[14]
Lockless Inc. 2015. Lockless Performance. http://locklessinc.com/
[15]
International Standard ISO/IEC 14882:2011. 2011. Programming Languages - C++. International Organization for Standards.
[16]
Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu. 2013. Detection of False Sharing Using Machine Learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 30, 9 pages.
[17]
David Levinthal. {n. d.}. Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 Processors. Intel Corporation.
[18]
C.-L. Liu. 2009. False sharing analysis for multithreaded programs. Master's thesis. National Chung Cheng University.
[19]
Tongping Liu and Emery D. Berger. 2011. SHERIFF: Precise Detection and Automatic Mitigation of False Sharing. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '11). ACM, New York, NY, USA, 3--18.
[20]
Tongping Liu, Charlie Curtsinger, and Emery D. Berger. 2011. Dthreads: Efficient Deterministic Multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11). ACM, New York, NY, USA, 327--336.
[21]
Tongping Liu and Xu Liu. 2016. Cheetah: Detecting False Sharing Efficiently and Effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, New York, NY, USA, 1--11.
[22]
Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. 2014. PREDATOR: Predictive False Sharing Detection. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 3--14.
[23]
Kai Lu, Xu Zhou, Tom Bergan, and Xiaoping Wang. 2014. Efficient Deterministic Multithreading Without Global Barriers. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 287--300.
[24]
Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Gilles Pokam, Chris Newburn, and Joseph Devietti. 2016. LASER: Light, Accurate Sharing Detection and Repair. In IEEE International Symposium on High Performance Computer Architecture (HPCA '16).
[25]
Luc Maranget, Susmit Sarkar, and Peter Sewell. 2012. A Tutorial Introduction to the ARM and POWER Relaxed Memory Models. Technical Report. INRIA and University of Cambridge. https://www.cl.cam.ac.uk/-pes20/ppc-supplemental/test7.pdf
[26]
Joe Mario. 2016. C2C - False Sharing Detection in Linux Perf. (September 2016). https://joemario.github.io/blog/2016/09/01/c2c-blog/
[27]
Robert Martin, John Demme, and Simha Sethumadhavan. 2012. TimeWarp: Rethinking Timekeeping and Performance Monitoring Mechanisms to Mitigate Side-channel Attacks. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 118--129. http://dl.acm.org/citation.cfm?id=2337159.2337173
[28]
mcmcc. 2012. false sharing in boost::detail::spinlock_pool? (June 2012). http://stackoverflow.com/questions/11037655/false-sharing-in-boostdetailspinlock-pool
[29]
Timothy Merrifield and Jakob Eriksson. 2013. Increasing Concurrency in Deterministic Runtimes with Conversion. (2013).
[30]
Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield. 2013. Whose Cache Line is It Anyway?: Operating System Support for Live Detection and Repair of False Sharing. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 141--154.
[31]
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE Computer Society, Washington, DC, USA, 13--24.
[32]
Mikael Ronstrom. 2012. MySQL team increases scalability by >50% for Sysbench OLTP RO in MySQL 5.6 labs release april 2012. (April 2012). http://mikaelronstrom.blogspot.com/2012/04/mysql-team-increases-scalability-by-50.html
[33]
Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). ACM, New York, NY, USA, 175--186.
[34]
Martin Schindewolf. 2007. Analysis of Cache Misses Using SIMICS. Master's thesis. Institute for Computing Systems Architecture, University of Edinburgh.
[35]
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors. Commun. ACM 53, 7 (July 2010), 89--97.
[36]
David L. Weaver and Tom Germond (Eds.). 1994. SPARC Architecture Manual (Version 9). PTR Prentice Hall.
[37]
Vince Weaver. 2015. perf events Library Man Page. http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
[38]
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22Nd Annual International Symposium on Computer Architecture (ISCA '95). ACM, New York, NY, USA, 24--36.
[39]
Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic Cache Contention Detection in Multi-threaded Applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '11). ACM, New York, NY, USA, 27--38.

Cited By

View all
  • (2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
  • (2019)Development of a Random Test Generator for Multi-Core Processor Design Verification2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2019.8822103(1200-1204)Online publication date: Jun-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. C
  2. C++
  3. false sharing
  4. memory consistency
  5. performance counters

Qualifiers

  • Research-article

Funding Sources

Conference

MICRO-50
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)11
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
  • (2019)Development of a Random Test Generator for Multi-Core Processor Design Verification2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2019.8822103(1200-1204)Online publication date: Jun-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media