research-article

Thread Warping: Dynamic and Transparent Synthesis of Thread Accelerators

Authors:

Frank VahidAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 16, Issue 3

Article No.: 32, Pages 1 - 21

https://doi.org/10.1145/1970353.1970365

Published: 01 June 2011 Publication History

Abstract

We introduce thread warping, a dynamic optimization technique that customizes multicore architectures to a given application by dynamically synthesizing threads into custom accelerator circuits on FPGAs (Field-Programmable Gate Arrays). Thread warping builds upon previous dynamic synthesis techniques for single-threaded applications, enabling dynamic architectural adaptation to different amounts of thread-level parallelism, while also exploiting parallelism within each thread to further improve performance. Furthermore, thread warping maintains the important separation of function from architecture, enabling portability of applications to architectures with different quantities of microprocessors and FPGAs, an advantage not shared by static compilation/synthesis approaches. We introduce an approach consisting of CAD tools and operating system support that enables thread warping on potentially any microprocessor/FPGA architecture. We evaluate thread warping using a simulator for high-performance computing systems with different interconnections in addition to multicore embedded systems having between 4 and 64 ARM11 microprocessors. On average, thread warping achieved approximately 3x speedup compared to a high-performance quad-core Intel Xeon and 109x compared to an embedded system consisting of 4 ARM11 cores, with a size cost approximately equal to 36 ARM11 cores.

References

[1]

Altera, Inc. 2011. Increasing productivity with Quartus II incremental compilation. http://www.altera.com/support/software/incremental/sof-qts-increment-comp.html.

[2]

Amerson, R., Carter, R., Culbertson, W., Kuekes, P., Snider, G., and Albertson, L. 1996. Plasma: An FPGA for million gate systems. In Proceedings of the ACM 4th International Symposium on Field-Programmable Gate Arrays. ACM, New York, 10--16.

Digital Library

[3]

Andrews, D., Niehaus, D., and Ashenden, P. 2004. Programming models for hybrid CPU/FPGA chips. Comput. 37, 118--120.

Digital Library

[4]

Athanas, P. M. and Silverman, H. F. 1993. Processor reconfiguration through instruction-set metamorphosis. Comput. 26, 11--18.

Digital Library

[5]

Bauer, L., Shafique, M., and Henkel, J. 2008. Run-Time instruction set selection in a transmutable embedded processor. In Proceedings of the 45th Annual Design Automation Conference. ACM, New York, 56--61.

Digital Library

[6]

Beck, A. C. S. and Carro, L. 2005. Dynamic reconfiguration with binary translation: Breaking the ILP barrier with software compatibility. In Proceedings of the 42nd Annual Design Automation Conference. ACM, New York, 732--737.

Digital Library

[7]

Burger, D. and Austin, T. M. 1997. The simplescalar tool set, version 2.0. SIGARCH Comput. Archit. News 25, 13--25.

Digital Library

[8]

Cifuentes, C. 1994. Reverse compilation techniques. Ph.D. thesis, Queensland University of Technology.

[9]

Cifuentes, C. 1996. Structuring decompiled graphs. In Proceedings of the 6th International Conference on Compiler Construction. Springer, Berlin, 91--105.

Digital Library

[10]

Cifuentes, C. and Emmerik, M. V. 2000. UQBT: Adaptable binary translation at low cost. Comput. 33, 60--66.

Digital Library

[11]

Clark, N., Kudlur, M., Park, H., Mahlke, S., and Flautner, K. 2004. Application-Specific processing on a general-purpose core via transparent instruction set customization. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 30--40.

Digital Library

[12]

Coole, J. and Stitt, G. 2010. Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, 13--22.

Digital Library

[13]

Diniz, P., Hall, M., Park, J., So, B., and Ziegler, H. 2005. Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system. Microprocess. Microsyst. 29, 2-3, (Special Issue on FPGA Tools and Techniques) 51--62.

[14]

Emmerik, M. and Waddington, T. 2004. Using a decompiler for real-world source recovery. In Proceedings of the 11th Working Conference onReverse Engineering. 27--36.

Digital Library

[15]

Fisher, J. A. 1999. Customized instruction-sets for embedded processors. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference. ACM, New York, 253--257.

Digital Library

[16]

GiDEL 2011. GiDEL PROC boards. http://www.gidel.com/PROCBoards.htm.

[17]

Goldstein, S., Schmit, H., Moe, M., Budiu, M., Cadambi, S., Taylor, R., and Laufer, R. 1999. Piperench: A coprocessor for streaming multimedia acceleration. In Proceedings of the 26th International Symposium on Computer Architecture. 28--39.

Digital Library

[18]

Grimpe, E. and Oppenheimer, F. 2003. Extending the SystemC synthesis subset by object-oriented features. In Proceedings of the 1st IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, 25--30.

Digital Library

[19]

Grotker, T. 2002. System Design with SystemC. Kluwer Academic Publishers, Norwell, MA.

Digital Library

[20]

Gupta, S., Dutt, N., Gupta, R., and Nicolau, A. 2003. Spark: A high-level synthesis framework for applying parallelizing compiler transformations. In Proceedings of the 16th International Conference on VLSI Design. 461--466.

Digital Library

[21]

Hauck, S., Fry, T., Hosler, M., and Kao, J. 2004. The chimaera reconfigurable functional unit. IEEE Trans. VLSI Syst. 12, 2, 206--217.

Digital Library

[22]

Hauser, J. and Wawrzynek, J. 1997. Garp: a mips processor with a reconfigurable coprocessor. In Proceedings of the 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines. 12--21.

Digital Library

[23]

Hill, M. D., Larus, J. R., Lebeck, A. R., Talluri, M., and Wood, D. A. 1993. Wisconsin architectural research tool set. SIGARCH Comput. Archit. News 21, 8--10.

Digital Library

[24]

Holland, B., Nagarajan, K., Conger, C., Jacobs, A., and George, A. D. 2007. RAT: A methodology for predicting performance in application design migration to FPGAs. In Proceedings of the 1st International Workshop on High-Performance Reconfigurable Computing Technology and Applications Held in Conjunction with SC07. ACM, New York, 1--10.

Digital Library

[25]

Impulse Accelerated Technologies, Inc. 2011. Impulse c. http://www.impulseaccelerated.com/.

[26]

Jung, H. and Ha, S. 2004. Hardware synthesis from coarse-grained dataflow specification for fast hw/sw cosynthesis. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, 24--29.

Digital Library

[27]

Koch, D., Haubelt, C., and Teich, J. 2007. Efficient hardware checkpointing: concepts, overhead analysis, and implementation. In Proceedings of the ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays. ACM, New York, 188--196.

Digital Library

[28]

Küçükçakar, K. 1999. An ASIP design methodology for embedded systems. In Proceedings of the 7th International Workshop on Hardware/Software Codesign. ACM, New York, 17--21.

Digital Library

[29]

Lu, J., Chen, H., Yew, P.-C., and chung Hsu, W. 2004. Design and implementation of a lightweight dynamic optimization system. J. Instruc.-Level Parallel. 6, 1--24.

[30]

Ludwig, S. 2005. Fast hardware synthesis tools and a reconfigurable coprocessor. Ph.D. thesis, ETH Zurich.

[31]

Lysecky, R., Stitt, G., and Vahid, F. 2004. Warp processors. ACM Trans. Des. Autom. Electron. Syst. 11, 659--681.

Digital Library

[32]

Lysecky, R., Vahid, F., and Tan, S. X. D. 2005. A study of the scalability of on-chip routing for just-in-time FPGA compilation. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society, Los Alamitos, CA, 57--62.

Digital Library

[33]

Mentor Graphics Corp. 2011. Catapult c synthesis. http://www.mentor.com/esl/catapult/overview.

[34]

Micheli, G. D. 1994. Synthesis and Optimization of Digital Circuits 1st Ed. McGraw-Hill Higher Education.

Digital Library

[35]

Mittal, G., Zaretsky, D. C., Tang, X., and Banerjee, P. 2004. Automatic translation of software binaries onto FPGAs. In Proceedings of the 41st Annual Design Automation Conference. ACM, New York, 389--394.

Digital Library

[36]

Nallatech, Inc. 2011. Nallatech PCI express cards. http://www.nallatech.com/pci-express-cards.html.

[37]

SRC Computers, LLC. 2011. SRC MAP processor. http://www.srccomp.com/techpubs/map.asp.

[38]

Stitt, G. and Vahid, F. 2005. New decompilation techniques for binary-level co-processor generation. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. IEEE Computer Society, Washington, DC, 547--554.

Digital Library

[39]

Stitt, G., Lysecky, R., and Vahid, F. 2003. Dynamic hardware/software partitioning: A first approach. In Proceedings of the 40th Annual Design Automation Conference. ACM, New York, 250--255.

Digital Library

[40]

Stitt, G., Guo, Z., Najjar, W., and Vahid, F. 2005a. Techniques for synthesizing binaries to an advanced register/memory structure. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays. ACM, New York, 118--124.

Digital Library

[41]

Stitt, G., Vahid, F., McGregor, G., and Einloth, B. 2005b. Hardware/Software partitioning of software binaries: A case study of h.264 decode. In Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, 285--290.

Digital Library

[42]

Villarreal, J., Park, A., Najjar, W., and Halstead, R. 2010. Designing modular hardware accelerators in C with ROCCC 2.0. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society, Los Alamitos, CA, 127--134.

Digital Library

[43]

Wind River 2011. Wind river vxworks. http://www.windriver.com/products/vxworks/.

[44]

Xilinx, Inc. 2011. Xilinx virtex 5 fxt. http://www.xilinx.com/products/virtex5/fxt.htm.

[45]

XtremeData, Inc. 2011. Xtremedata accelerators. http://www.xtremedata.com/products/accelerators.

[46]

Zhang, W., Calder, B., and Tullsen, D. M. 2005. An event-driven multithreaded dynamic optimization framework. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Los Alamitos, CA, 87--98.

Digital Library

Cited By

Paulino NFerreira JBispo JCardoso JNebel WAtienza D(2015)Transparent acceleration of program execution using reconfigurable hardwareProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2757061(1066-1071)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2757061
Belwal MPurnaprajna MSudarshan TSB (2015)Enabling seamless execution on hybrid CPU/FPGA systems: Challenges & directions2015 25th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2015.7294022(1-8)Online publication date: Sep-2015
https://doi.org/10.1109/FPL.2015.7294022
Paulino NFerreira JCardoso J(2014)A Reconfigurable Architecture for Binary Acceleration of Loops with Memory AccessesACM Transactions on Reconfigurable Technology and Systems10.1145/26294687:4(1-20)Online publication date: 29-Dec-2014
https://dl.acm.org/doi/10.1145/2629468
Show More Cited By

Index Terms

Thread Warping: Dynamic and Transparent Synthesis of Thread Accelerators
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems

Recommendations

Thread warping: a framework for dynamic synthesis of thread accelerators
CODES+ISSS '07: Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis

We present a dynamic optimization technique, thread warping, that uses a single processor on a multiprocessor system to dynamically synthesize threads into custom accelerator circuits on FPGAs (field-programmable gate arrays). Building on dynamic ...
Energy savings and speedups from partitioning critical software loops to hardware in embedded systems

We present results of extensive hardware/software partitioning experiments on numerous benchmarks. We describe our loop-oriented partitioning methodology for moving critical code from hardware to software. Our benchmarks included programs from ...
Massively parallel programming models used as hardware description languages: the OpenCL case
ICCAD '11: Proceedings of the International Conference on Computer-Aided Design

The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this paper, we introduce a methodology to automatically synthesize hardware ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 16, Issue 3

June 2011

330 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/1970353

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 01 June 2011

Accepted: 01 January 2011

Revised: 01 August 2009

Received: 01 February 2009

Published in TODAES Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
343
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Paulino NFerreira JBispo JCardoso JNebel WAtienza D(2015)Transparent acceleration of program execution using reconfigurable hardwareProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2757061(1066-1071)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2757061
Belwal MPurnaprajna MSudarshan TSB (2015)Enabling seamless execution on hybrid CPU/FPGA systems: Challenges & directions2015 25th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2015.7294022(1-8)Online publication date: Sep-2015
https://doi.org/10.1109/FPL.2015.7294022
Paulino NFerreira JCardoso J(2014)A Reconfigurable Architecture for Binary Acceleration of Loops with Memory AccessesACM Transactions on Reconfigurable Technology and Systems10.1145/26294687:4(1-20)Online publication date: 29-Dec-2014
https://dl.acm.org/doi/10.1145/2629468
Paulino NFerreira JCardoso J(2014)Trace-Based Reconfigurable Acceleration with Data Cache and External Memory SupportProceedings of the 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications10.1109/ISPA.2014.29(158-165)Online publication date: 26-Aug-2014
https://dl.acm.org/doi/10.1109/ISPA.2014.29

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents