skip to main content
research-article

Improving SIMD Parallelism via Dynamic Binary Translation

Published: 12 February 2018 Publication History

Abstract

Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture that supports improved parallelism and enhanced vector primitives, resulting in only a small fraction of potential peak performance. This article presents a dynamic binary translation technique that enables short-SIMD binaries to exploit benefits of new SIMD architectures by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translator’s internal translation condition and one general approach using dynamic loop peeling optimization. Benchmark results show that average speedups of 1.51× and 2.48× are achieved for an ARM NEON to x86 AVX2 and x86 AVX-512 loop transformation, respectively.

References

[1]
Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12.
[2]
Utpal K. Banerjee. 1976. Data Dependence in Ordinary Programs. Technical Report.
[3]
Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In IEEE/ACM International Symposium on Microarchitecture.
[4]
Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient selection of vector instructions using dynamic programming. In Annual IEEE/ACM International Symposium on Microarchitecture. 201--212.
[5]
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference. 41--46.
[6]
Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2 (2002), 65--98.
[7]
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization. 265--275.
[8]
Patricio Bulić and Veselko Guštin. 2005. On dependence analysis for SIMD enhanced processors. In International Conference on High Performance Computing for Computational Science. 527--540.
[9]
James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In International Symposium on Code Generation and Optimization. 15--24.
[10]
Evelyn Duesterwald and Vasanth Bala. 2000. Software profiling for hot path prediction: Less is more. In International Conference on Architectural Support for Programming Languages and Operating Systems. 202--211.
[11]
Kemal Ebcioğlu and Erik R. Altman. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In International Symposium on Computer Architecture. 26--37.
[12]
Sheng-Yu Fu, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, and Wei-Chung Hsu. 2015. SIMD code translation in an enhanced HQEMU. In IEEE International Conference on Parallel and Distributed Systems. 507--514.
[13]
Nabil Hallou, Erven Rohou, and Philippe Clauss. 2017. Runtime vectorization transformations of binary code. International Journal of Parallel Programming 45, 6 (2017), 1536--1565.
[14]
Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015. Dynamic re-vectorization of binary code. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 228--237.
[15]
Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary translation. In IEEE International Conference on Parallel and Distributed Systems. 853--860.
[16]
Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Yeh-Ching Chung, Pangfeng Liu, and Chien-Min Wang. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Symposium on Code Generation and Optimization. 104--113.
[17]
Intel Corporation. 2016. Intel®64 and IA-32 Architectures Optimization Reference Manual.
[18]
JVM. 1999. HotSpot parallel collector. In Memory Management in the Java HotSpot Virtual Machine Whitepaper.
[19]
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In International Symposium on Code Generation and Optimization. 141--150.
[20]
Vladimir Kiriansky, Derek Bruening, and Saman P. Amarasinghe. 2002. Secure execution via program shepherding. In Security Symposium. 191--206.
[21]
Alexander Klaiber. 2000. The Technology Behind the Crusoe Processors. Technical Report.
[22]
Xiangyun Kong, David Klappholz, and Kleanthis Psarris. 1991. The I test: An improved dependence test for automatic parallelization and vectorization. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 342--349.
[23]
Aparna Kotha, Kapil Anand, Matthew Smithson, Greeshma Yellareddy, and Rajeev Barua. 2010. Automatic parallelization in a binary rewriter. In IEEE/ACM International Symposium on Microarchitecture. 547--557.
[24]
Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In ACM Conference on Programming Language Design and Implementation. 145--156.
[25]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization. 75--88.
[26]
Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In International Symposium on Code Generation and Optimization. 269--280.
[27]
Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A compiler framework for extracting superword level parallelism. In ACM Conference on Programming Language Design and Implementation. 347--358.
[28]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Conference on Programming Language Design and Implementation. 190--200.
[29]
Luc Michel, Nicolas Fournel, and Frederic Petrot. 2011. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In Design, Automation 8 Test in Europe Conference 8 Exhibition. 1530--1591.
[30]
Dorit Naishlos. 2004. Auto-vectorization in GCC. In Proceedings of the GCC Developers Summit. 105--117.
[31]
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100.
[32]
Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In International Symposium on Code Generation and Optimization. 151--160.
[33]
Alex Pajuelo, Antonio Gonzalez, and Mateo Valero. 2002. Speculative dynamic vectorization. In International Symposium on Computer Architecture. 271--280.
[34]
Vasileios Porpodas and Timothy M. Jones. 2015. Throttling automatic vectorization: When less is more. In International Conference on Parallel Architecture and Compilation Techniques. 432--444.
[35]
Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In International Symposium on Code Generation and Optimization. 190--201.
[36]
Kevin Scott and Jack Davidson. 2001. Strata: A Software Dynamic Translation Infrastructure. Technical Report. Charlottesville, VA.
[37]
Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Journal of Instruction-Level Parallelism 5 (2003), 1--28.
[38]
Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R. Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4--15.
[39]
Fu-Hwa Wang. 2003. Compiler annotation for binary translation tools. May 8, 2003. U.S. Patent 20030088860 A1.
[40]
Daniel Williams, Jason D. Hiser, and Jack W. Davidson. 2009. Using program metadata to support SDT in object-oriented applications. In Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems. 55--62.
[41]
Michael Wolfe and Chau-Wen Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems 3, 5 (1992), 591--601.
[42]
Chaohao Xu, Jianhui Li, Tao Bao, Yun Wang, and Bo Huang. 2007. Metadata driven memory optimizations in dynamic binary translator. In International Conference on Virtual Execution Environments. 148--157.
[43]
Matt T. Yourst. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In International Symposium on Performance Analysis of Systems 8 Software. 23--34.
[44]
Hao Zhou and Jingling Xue. 2016a. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization 13, 1 (2016), 11:1--11:26.
[45]
Hao Zhou and Jingling Xue. 2016b. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In International Symposium on Code Generation and Optimization. 59--69.
[46]
Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York.

Cited By

View all
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2022)Algorithm-Oriented SIMD Computer Mathematical Model and Its ApplicationInternational Journal of Information and Communication Technology Education10.4018/IJICTE.31574318:3(1-18)Online publication date: 28-Oct-2022
  • (2022)Reinforcement Learning assisted Loop Distribution for Locality and Vectorization2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)10.1109/LLVM-HPC56686.2022.00006(1-12)Online publication date: Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 17, Issue 3
May 2018
309 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3185335
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 12 February 2018
Accepted: 01 December 2017
Revised: 01 September 2017
Received: 01 April 2017
Published in TECS Volume 17, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dynamic binary translation
  2. SIMD
  3. compiler annotation
  4. dynamic loop peeling
  5. vectorization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Ministry of Science and Technology of Taiwan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2022)Algorithm-Oriented SIMD Computer Mathematical Model and Its ApplicationInternational Journal of Information and Communication Technology Education10.4018/IJICTE.31574318:3(1-18)Online publication date: 28-Oct-2022
  • (2022)Reinforcement Learning assisted Loop Distribution for Locality and Vectorization2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)10.1109/LLVM-HPC56686.2022.00006(1-12)Online publication date: Nov-2022
  • (2020)A Retargetable System-level DBT HypervisorACM Transactions on Computer Systems10.1145/338616136:4(1-24)Online publication date: 30-May-2020
  • (2020)Multi-Target Adaptive Reconfigurable Acceleration for Low-Power IoT ProcessingIEEE Transactions on Computers10.1109/TC.2020.2984736(1-1)Online publication date: 2020
  • (2020)More with Less – Deriving More Translation Rules with Less Training Data for DBTs Using Parameterization2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00043(415-426)Online publication date: Oct-2020
  • (2020)Application of Data Mining Methods in Internet of Things Technology for the Translation Systems in Traditional Ethnic BooksIEEE Access10.1109/ACCESS.2020.29945518(93398-93407)Online publication date: 2020
  • (2020)Measurement system with real time data converter for conversion of I2S data stream to UDP protocol dataHeliyon10.1016/j.heliyon.2020.e037606:4(e03760)Online publication date: Apr-2020
  • (2019)A retargetable system-level DBT hypervisorProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358850(505-520)Online publication date: 10-Jul-2019
  • (2019)Unleashing the power of learningProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358815(77-89)Online publication date: 10-Jul-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media