skip to main content
10.1145/3404397.3404403acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms

Published: 17 August 2020 Publication History

Abstract

The scalability of a dynamic binary translation (DBT) system has become important due to the prevalence of multicore systems and large multi-threaded applications. Several recent efforts have addressed some critical issues in extending a DBT system to run on multicore platforms for better scalability. In this paper, we present a distributed DBT framework, called DQEMU, that goes beyond a single-node multicore processor and can be scaled up to a cluster of multi-node servers.
In such a distributed DBT system, we integrate a page-level directory-based data coherence protocol, a hierarchical locking mechanism, a delegation scheme for system calls, and a remote thread migration approach that are effective in reducing its overheads. We also proposed several performance optimization strategies that include page splitting to mitigate false data sharing among nodes, data forwarding for latency hiding, and a hint-based locality-aware scheduling scheme. Comprehensive experiments have been conducted on DQEMU with micro-benchmarks and the PARSEC benchmark suite. The results show that DQEMU can scale beyond a single-node machine with reasonable overheads. For ”embarrassingly-parallel” benchmark programs, DQEMU can achieve near-linear speedup when the number of nodes increases - as opposed to flattened out due to lack of computing resources as in current single-node, multi-core version of QEMU.

References

[1]
[n.d.]. Compare-and-swap-WikiPedia. https://en.wikipedia.org/wiki/Compare-and-swap.
[2]
Fabrice Bellard. 2005. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Anaheim, CA) (ATEC ’05). USENIX Association, Berkeley, CA, USA, 41–41.
[3]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
[4]
Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H Randall, and Yuli Zhou. 1996. Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing 37, 1 (1996), 55–69.
[5]
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An Infrastructure for Adaptive Dynamic Optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (San Francisco, California, USA) (CGO ’03). IEEE Computer Society, Washington, DC, USA, 265–275.
[6]
Bradford L Chamberlain, David Callahan, and Hans P Zima. 2007. Parallel programmability and the chapel language. The International Journal of High Performance Computing Applications 21, 3(2007), 291–312.
[7]
Emilio G. Cota, Paolo Bonzini, Alex Bennée, and Luca P. Carloni. 2017. Cross-ISA Machine Emulation for Multicores. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (Austin, USA) (CGO ’17). IEEE Press, Piscataway, NJ, USA, 210–220.
[8]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry-standard API for shared-memory programming. Computing in Science & Engineering1 (1998), 46–55.
[9]
Damon. 2019. Cost of a page fault trap. (2019). Accessed 19 Apirl 2019. https://stackoverflow.com/questions/10223690/cost-of-a-page-fault-trap.
[10]
Jeff Dean. 2007. Software engineering advice from building large-scale distributed systems. CS295 Lecture at Stanford University 1, 2.1 (2007), 1–2.
[11]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
[12]
Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. 2010. Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs. In 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing. IEEE, 185–192.
[13]
J. Ding, P. Chang, W. Hsu, and Y. Chung. 2011. PQEMU: A Parallel System Emulator Based on QEMU. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems. 276–283.
[14]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14). 401–414.
[15]
WU Fengguang, XI Hongsheng, and XU Chenfeng. 2008. On the design of a new linux readahead framework. ACM SIGOPS Operating Systems Review 42, 5 (2008), 75–84.
[16]
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer, 97–104.
[17]
Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Pangfeng Liu, Chien-Min Wang, and Yeh-Ching Chung. 2012. HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 104–113.
[18]
Kevin P. Lawton. 1996. Bochs: A Portable PC Emulator for Unix/X. Linux J. 1996, 29es, Article 7 (Sept. 1996).
[19]
Daniel Lustig, Caroline Trippel, Michael Pellauer, and Margaret Martonosi. 2015. ArMOR: Defending Against Memory Consistency Model Mismatches in Heterogeneous Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). ACM, New York, NY, USA, 388–400.
[20]
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (Feb 2002), 50–58.
[21]
Maged M Michael. 2004. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems 15, 6 (2004), 491–504.
[22]
Todd C Mowry, Charles QC Chan, and Adley KW Lo. 1998. Comparative evaluation of latency tolerance techniques for software distributed shared memory. In Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture. IEEE, 300–311.
[23]
Ananya Muddukrishna, Peter A Jonsson, Vladimir Vlassov, and Mats Brorsson. 2013. Locality-aware task scheduling and data distribution on NUMA systems. In International Workshop on OpenMP. Springer, 156–170.
[24]
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-tolerant software distributed shared memory. In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15). 291–305.
[25]
David K Poulsen and Pen-Chung Yew Pen-Chung Yew. 1994. Data prefetching and data forwarding in shared memory multiprocessors. In 1994 Internatonal Conference on Parallel Processing Vol. 2, Vol. 2. IEEE, 280–280.
[26]
Philippas Tsigas and Yi Zhang. 2001. A Simple, Fast and Scalable Non-blocking Concurrent FIFO Queue for Shared Memory Multiprocessor Systems. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures (Crete Island, Greece) (SPAA ’01). ACM, New York, NY, USA, 134–143. https://doi.org/10.1145/378580.378611
[27]
Ke Wang, Xraobing Zhou, Tonglin Li, Dongfang Zhao, Michael Lang, and Ioan Raicu. 2014. Optimizing load balancing and data-locality with data-aware scheduling. In 2014 IEEE International Conference on Big Data (Big Data). IEEE, 119–128.
[28]
Zhaoguo Wang, Ran Liu, Yufei Chen, Xi Wu, Haibo Chen, Weihua Zhang, and Binyu Zang. 2011. COREMU: A Scalable and Portable Parallel Full-system Emulator. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (San Antonio, TX, USA) (PPoPP ’11). ACM, New York, NY, USA, 213–222.
[29]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets.HotCloud 10, 10-10 (2010), 95.

Cited By

View all
  • (2022)FADATestProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510169(896-908)Online publication date: 21-May-2022
  • (2022)WDBTJournal of Systems and Software10.1016/j.jss.2022.111247187:COnline publication date: 1-May-2022
  • (2021)WDBT: Wear Characterization, Reduction, and Leveling of DBT Systems for Non-Volatile MemoryProceedings of the International Symposium on Memory Systems10.1145/3488423.3519337(1-13)Online publication date: 27-Sep-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
August 2020
844 pages
ISBN:9781450388160
DOI:10.1145/3404397
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dynamic binary translator
  2. distributed emulator
  3. distributed system

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP '20

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)FADATestProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510169(896-908)Online publication date: 21-May-2022
  • (2022)WDBTJournal of Systems and Software10.1016/j.jss.2022.111247187:COnline publication date: 1-May-2022
  • (2021)WDBT: Wear Characterization, Reduction, and Leveling of DBT Systems for Non-Volatile MemoryProceedings of the International Symposium on Memory Systems10.1145/3488423.3519337(1-13)Online publication date: 27-Sep-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media