research-article

Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future Prospects

Authors:

Zhiru ZhangAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 14, Issue 4

Article No.: 17, Pages 1 - 39

https://doi.org/10.1145/3469660

Published: 13 September 2021 Publication History

Abstract

FPGA-based accelerators are increasingly popular across a broad range of applications, because they offer massive parallelism, high energy efficiency, and great flexibility for customizations. However, difficulties in programming and integrating FPGAs have hindered their widespread adoption. Since the mid 2000s, there has been extensive research and development toward making FPGAs accessible to software-inclined developers, besides hardware specialists. Many programming models and automated synthesis tools, such as high-level synthesis, have been proposed to tackle this grand challenge. In this survey, we describe the progression and future prospects of the ongoing journey in significantly improving the software programmability of FPGAs. We first provide a taxonomy of the essential techniques for building a high-performance FPGA accelerator, which requires customizations of the compute engines, memory hierarchy, and data representations. We then summarize a rich spectrum of work on programming abstractions and optimizing compilers that provide different trade-offs between performance and productivity. Finally, we highlight several additional challenges and opportunities that deserve extra attention by the community to bring FPGA-based computing to the masses.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Retrieved from https://arXiv:1603.04467.

[2]

Mohamed S. Abdelfattah and Vaughn Betz. 2014. Networks-on-Chip for FPGAs: Hard, soft or mixed?ACM Trans. Reconfig. Technol. Syst. 7, 3 (2014), 1–22.

Digital Library

[3]

Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O'Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C. Ling, et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'18). 411–4117.

[4]

Michael Adler, Kermin E. Fleming, Angshuman Parashar, Michael Pellauer, and Joel Emer. 2011. Leap scratchpads: Automatic memory and cache management for reconfigurable logic. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'11).

Digital Library

[5]

Jason Agron. 2009. Domain-specific language for HW/SW Co-Design for FPGAs. In IFIP Working Conference on Domain-Specific Languages.

Digital Library

[6]

Muhammed Al Kadi, Benedikt Janssen, and Michael Huebner. 2016. FGPU: An SIMT-architecture for FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'16). 254–263.

Digital Library

[7]

Mythri Alle, Antoine Morvan, and Steven Derrien. 2013. Runtime dependency analysis for loop pipelining in high-level synthesis. In Proceedings of the Design Automation Conference (DAC'13).

Digital Library

[8]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the International Symposium on Code Generation and Optimization (CGO) (2019).

Digital Library

[9]

Samridhi Bansal, Hsuan Hsiao, Tomasz Czajkowski, and Jason H. Anderson. 2018. High-level synthesis of software-customizable floating-point cores. In Proceedings of the Design, Automation, and Test in Europe (DATE'18).

[10]

Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'04).

Digital Library

[11]

Shuvra S. Bhattacharyya, Gordon Brebner, Jörn W. Janneck, Johan Eker, Carl Von Platen, Marco Mattavelli, and Mickaël Raulet. 2009. OpenDF: A dataflow toolset for reconfigurable hardware and multicore systems. ACM SIGARCH Comput. Architect. News 36, 5 (2009), 29–35.

Digital Library

[12]

Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5, 720–748.

Digital Library

[13]

David Boland and George A. Constantinides. 2010. Automated precision analysis: A polynomial algebraic approach. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'10).

Digital Library

[14]

David Boland and George A. Constantinides. 2012. A scalable approach for automated precision analysis. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'12).

Digital Library

[15]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'08).

Digital Library

[16]

Uday Bondhugula, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2007. Automatic mapping of nested loops to FPGAs. In Proceedings of the ACM SIGPLAN Conference on Principles and Practice of Parallel Programming (PPoPP'07).

Digital Library

[17]

Alexander Brant and Guy G. F. Lemieux. 2012. ZUMA: An open FPGA overlay architecture. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'12).

Digital Library

[18]

Pavan Kumar Bussa, Jeffrey Goeders, and Steven J. E. Wilton. 2017. Accelerating In-System FPGA debug of high-level synthesis circuits using incremental compilation techniques. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'17).

[19]

Cadence. 2020. Stratus High-Level Synthesis. Retrieved from https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/tools/digital-design-signoff/stratus-ds.pdf.

[20]

Nazanin Calagar, Stephen D. Brown, and Jason H. Anderson. 2014. Source-level debugging for FPGA high-level synthesis. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'14). 1–8.

[21]

Andrew Canis, Jason H. Anderson, and Stephen D. Brown. 2013. Multi-pumping for resource reduction in FPGA high-level synthesis. In Proceedings of the Design, Automation, and Test in Europe (DATE'13).

Digital Library

[22]

Andrew Canis, Stephen D. Brown, and Jason H. Anderson. 2014. Modulo SDC scheduling with recurrence minimization in high-level synthesis. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'14).

[23]

Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'11).

Digital Library

[24]

Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, et al. 2013. From software to accelerators with LegUp high-level synthesis. In Proceedings of the International Conference on Compilers, Architectures and Synthesis of Embedded Systems (CASES'13).

Digital Library

[25]

Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, Jeffrey Lillie, John L. Gustafson, and Dhireesha Kudithipudi. 2019. Deep positron: A deep neural network using the posit number system. In Proceedings of the Design, Automation, and Test in Europe (DATE'19).

[26]

Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, et al. 2016. A cloud-scale acceleration architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO'16).

Digital Library

[27]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI'18).

Digital Library

[28]

Tao Chen, Shreesha Srinath, Christopher Batten, and G. Edward Suh. 2018. An architectural framework for accelerating dynamic parallel algorithms on reconfigurable hardware. In Proceedings of the International Symposium on Microarchitecture (MICRO'18).

Digital Library

[29]

Tao Chen and G. Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the International Symposium on Microarchitecture (MICRO'16).

Digital Library

[30]

Xinyu Chen, Ronak Bajaj, Yao Chen, Jiong He, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2019. On-the-fly parallel data shuffling for graph processing on OpenCL-based FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'19).

[31]

Yao Chen, Swathi T. Gurumani, Yun Liang, Guofeng Li, Donghui Guo, Kyle Rupnow, and Deming Chen. 2016. FCUDA-NoC: A scalable and efficient network-on-chip implementation for the CUDA-to-FPGA flow. IEEE Trans. Very Large Scale Integr. Syst. 24, 6 (2016), 2220–2233.

Digital Library

[32]

Yu Ting Chen and Jason H. Anderson. 2017. Automated generation of banked memory architectures in the high-level synthesis of multi-threaded software. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'17).

[33]

Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In Proceedings of the Workshop on Hot Topics in Cloud Computing (HotCloud'16).

Digital Library

[34]

Jianyi Cheng, Lana Josipovic, George A. Constantinides, Paolo Ienne, and John Wickerson. 2020. Combining dynamic & static scheduling in high-level synthesis. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'20).

Digital Library

[35]

Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the International Conference on Computer-Aided Design (ICCAD) (2018).

Digital Library

[36]

Yuze Chi, Licheng Guo, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'21).

Digital Library

[37]

Jongsok Choi, Stephen Brown, and Jason Anderson. 2013. From software threads to parallel hardware in high-level synthesis for FPGAs. In Proceedings of the International Conference on Field Programmable Technology (FPT'13).

[38]

Jongsok Choi, Kevin Nam, Andrew Canis, Jason Anderson, Stephen Brown, and Tomasz Czajkowski. 2012. Impact of cache architecture and interface on performance and area of FPGA-based processor/parallel-accelerator systems. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'12).

Digital Library

[39]

Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason Cong. 2021. HBM connect: High-performance HLS interconnect for FPGA HBM. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'21).

Digital Library

[40]

Young-Kyu Choi, Yuze Chi, Jie Wang, and Jason Cong. 2020. FLASH: Fast, parallel, and accurate simulator for HLS. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. (2020).

[41]

Young-kyu Choi and Jason Cong. 2017. HLScope: High-level performance debugging for FPGA designs. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'17).

[42]

Young-kyu Choi and Jason Cong. 2018. HLS-based optimization and design space exploration for applications with variable loop bounds. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'18).

Digital Library

[43]

Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In Proceedings of the Design Automation Conference (DAC'16).

Digital Library

[44]

Young-Kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2019. In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfig. Technol. Syst. (2019).

Digital Library

[45]

Young-kyu Choi, Peng Zhang, Peng Li, and Jason Cong. 2017. HLScope+: Fast and accurate performance estimation for FPGA HLS. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'17).

Digital Library

[46]

Christopher H. Chou, Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, and Guy G. F. Lemieux. 2011. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'11).

Digital Library

[47]

Eric S. Chung, James C. Hoe, and Ken Mai. 2011. CoRAM: An in-fabric memory architecture for FPGA-based computing. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'11).

Digital Library

[48]

Alessandro Cilardo and Luca Gallo. 2015. Improving multibank memory access parallelism with lattice-based partitioning. ACM Trans. Architect. Code Optimiz. 11, 4 (2015).

Digital Library

[49]

Albert Cohen, Marc Sigler, Sylvain Girbal, Olivier Temam, David Parello, and Nicolas Vasilache. 2005. Facilitating the search for compositions of program transformations. In Proceedings of the International Symposium on Supercomputing (ICS'05).

Digital Library

[50]

Jason Cong, Zhenman Fang, Yuchen Hao, Peng Wei, Cody Hao Yu, Chen Zhang, and Peipei Zhou. 2018. Best-effort FPGA programming: A few steps can go a long way. Retrieved from https://arXiv:1807.01340.

[51]

Jason Cong, Zhenman Fang, Muhuan Huang, Libo Wang, and Di Wu. 2017. CPU-FPGA coscheduling for big data applications. IEEE Design Test 35, 1 (2017), 16–22.

[52]

Jason Cong, Zhenman Fang, Muhuan Huang, Peng Wei, Di Wu, and Cody Hao Yu. 2018. Customizable computing–from single chip to datacenters. Proc. IEEE 107, 1 (2018), 185–203.

[53]

Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[54]

Jason Cong, Muhuan Huang, Peichen Pan, Yuxin Wang, and Peng Zhang. 2016. Source-to-source optimization for HLS. FPGAs Softw. Program. (2016).

[55]

Jason Cong, Wei Jiang, Bin Liu, and Yi Zou. 2011. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Trans. Design Autom. Electron. Syst. 16, 2 (2011), 1–25.

Digital Library

[56]

Jason Cong, Peng Li, Bingjun Xiao, and Peng Zhang. 2016. An optimal microarchitecture for stencil computation acceleration based on nonuniform partitioning of data reuse buffers. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 35, 3 (2016), 407–418.

Digital Library

[57]

Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. 2011. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 30, 4 (2011), 473–491.

Digital Library

[58]

Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'18).

Digital Library

[59]

Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. 2018. Automated accelerator generation and optimization with composable, parallel and pipeline architecture. In Proceedings of the Design Automation Conference (DAC'18).

Digital Library

[60]

Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. 2017. Bandwidth optimization through on-chip memory restructuring for HLS. In Proceedings of the Design Automation Conference (DAC'17).

Digital Library

[61]

Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. 2018. Latte: Locality aware transformation for high-level synthesis. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[62]

Jason Cong and Zhiru Zhang. 2006. An efficient and versatile scheduling algorithm based on SDC formulation. In Proceedings of the Design Automation Conference (DAC'06).

Digital Library

[63]

James Coole and Greg Stitt. 2010. Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'10).

Digital Library

[64]

Philippe Coussy, Cyrille Chavet, Pierre Bomel, Dominique Heller, Eric Senn, and Eric Martin. 2008. GAUT: A high-level synthesis tool for DSP applications. High-Level Synth. (2008).

[65]

John Curreri, Seth Koehler, Alan D. George, Brian Holland, and Rafael Garcia. 2010. Performance analysis framework for high-level language applications in reconfigurable computing. ACM Trans. Reconfig. Technol. Syst. 3, 1 (2010), 1–23.

Digital Library

[66]

Tomasz S. Czajkowski, Utku Aydonat, Dmitry Denisenko, John Freeman, Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras, and Deshanand P. Singh. 2012. From OpenCL to high-performance hardware on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'12).

[67]

Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph processing framework on FPGA a case study of breadth-first search. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'16).

Digital Library

[68]

Steve Dai, Gai Liu, and Zhiru Zhang. 2018. A scalable approach to exact resource-constrained scheduling based on a joint SDC and SAT formulation. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'18).

Digital Library

[69]

Steve Dai, Gai Liu, Ritchie Zhao, and Zhiru Zhang. 2017. Enabling adaptive loop pipelining in high-level synthesis. In Proceedings of the Asilomar Conference on Signals, Systems, and Computers.

[70]

Steve Dai, Mingxing Tan, Kecheng Hao, and Zhiru Zhang. 2014. Flushing-enabled loop pipelining for high-level synthesis. In Proceedings of the Design Automation Conference (DAC'14).

Digital Library

[71]

Steve Dai, Ritchie Zhao, Gai Liu, Shreesha Srinath, Udit Gupta, Christopher Batten, and Zhiru Zhang. 2017. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'17).

Digital Library

[72]

Steve Dai, Yuan Zhou, Hang Zhang, Ecenur Ustun, Evangeline F. Y. Young, and Zhiru Zhang. 2018. Fast and accurate estimation of quality of results in high-level synthesis with machine learning. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[73]

Luka Daoud, Dawid Zydek, and Henry Selvaraj. 2014. A survey of high level synthesis languages, tools, and compilers for reconfigurable high performance computing. Adv. Syst. Sci. (2014).

[74]

Bita Darvish Rouhani, Daniel Lo, Ritchie Zhao, Ming Liu, Jeremy Fowers, Kalin Ovtcharov, Anna Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, et al. 2020. Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point. Adv. Neural Info. Process. Syst. (2020).

[75]

Florent De Dinechin and Bogdan Pasca. 2011. Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28, 4 (2011), 18–27.

Digital Library

[76]

Luiz Henrique De Figueiredo and Jorge Stolfi. 2004. Affine arithmetic: Concepts and applications. Numer. Algor. (2004).

[77]

Johannes de Fine Licht, Simon Meierhans, and Torsten Hoefler. 2018. Transformations of high-level synthesis codes for high-performance computing. Retrieved from https://arXiv:1805.08288.

[78]

Steven Derrien, Thibaut Marty, Simon Rokicki, and Tomofumi Yuki. 2020. Toward speculative loop pipelining for high-level synthesis. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 11 (2020), 4229–4239.

[79]

Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, et al. 2018. Fast inference of deep neural networks in FPGAs for particle physics. J. Instrument. (2018).

[80]

David Durst, Matthew Feldman, Dillon Huff, David Akeley, Ross Daly, Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, and Pat Hanrahan. 2020. Type-directed scheduling of streaming accelerators. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'20).

Digital Library

[81]

Stephen A. Edwards, Richard Townsend, Martha Barker, and Martha A. Kim. 2019. Compositional dataflow circuits. ACM Trans. Embed. Comput. Syst. 18, 1 (2019), 1–27.

Digital Library

[82]

Johan Eker and J. Janneck. 2003. CAL language report: Specification of the CAL actor language. ERL Tech. Memo UCB/ERL (2003).

[83]

Fatemeh Eslami and Steven J. E. Wilton. 2018. Rapid triggering capability using an adaptive overlay during FPGA debug. ACM Trans. Design Autom. Electron. Syst. 23, 6 (2018), 1–25.

Digital Library

[84]

Zhenman Fang, Farnoosh Javadi, Jason Cong, and Glenn Reinman. 2019. Understanding performance gains of accelerator-rich architectures. In Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors (ASAP'19).

[85]

Lorenzo Ferretti, Jihye Kwon, Giovanni Ansaloni, Giuseppe Di Guglielmo, Luca P. Carloni, and Laura Pozzi. 2020. Leveraging prior knowledge for effective design-space exploration in high-level synthesis. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 11 (2020), 3736–3747.

[86]

Pietro Fezzardi, Marco Lattuada, and Fabrizio Ferrandi. 2017. Using efficient path profiling to optimize memory consumption of on-chip debugging for high-level synthesis. ACM Trans. Embed. Comput. Syst. 16, 5s (2017), 1–22.

Digital Library

[87]

Christian Fobel, Gary Grewal, and Deborah Stacey. 2014. A scalable, serially equivalent, high-quality parallel placement methodology suitable for modern multicore and GPU architectures. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'14).

[88]

Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the International Symposium on Computer Architecture (ISCA'18).

Digital Library

[89]

Tushar Garg, Saud Wasly, Rodolfo Pellizzoni, and Nachiket Kapre. 2020. HopliteBuf: Network calculus-based design of FPGA NoCs with provably stall-free FIFOs. ACM Trans. Reconfig. Technol. Syst. 13, 2 (2020).

Digital Library

[90]

Mohammad Ghasemzadeh, Mohammad Samragh, and Farinaz Koushanfar. 2018. ReBNet: Residual binarized neural network. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[91]

Jeffrey Goeders and Steven J. E. Wilton. 2014. Effective FPGA debug for high-level synthesis generated circuits. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'14).

[92]

Jeffrey Goeders and Steven J. E. Wilton. 2016. Signal-tracing techniques for In-System FPGA debugging of high-level synthesis circuits. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 36, 1 (2016), 83–96.

Digital Library

[93]

Jeffrey B. Goeders, Guy G. F. Lemieux, and Steven J. E. Wilton. 2011. Deterministic timing-driven parallel placement by simulated annealing using half-box window decomposition. In Proceedings of the International Conference on Reconfigruable Computing and FPGAs (ReConFig'11).

Digital Library

[94]

Marcel Gort and Jason H. Anderson. 2010. Deterministic multi-core parallel routing for FPGAs. In Proceedings of the International Conference on Field Programmable Technology (FPT'10).

[95]

Marcel Gort and Jason H. Anderson. 2013. Range and bitmask analysis for hardware optimization in high-level synthesis. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC'13).

[96]

Ian Gray, Yu Chan, Jamie Garside, Neil Audsley, and Andy Wellings. 2015. Transparent hardware synthesis of Java for predictable large-scale distributed systems. Retrieved from https://arXiv:1508.07142.

[97]

Paul Grigoraş, Xinyu Niu, Jose G. F. Coutinho, Wayne Luk, Jacob Bower, and Oliver Pell. 2013. Aspect driven compilation for dataflow designs. In Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors (ASAP'13).

Digital Library

[98]

Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly–performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. (2012).

[99]

Sikender Gul, Muhammad Faisal Siddiqui, and Naveed Ur Rehman. 2019. FPGA based real-time implementation of online EMD with fixed point architecture. IEEE Access 7 (2019), 176565–176577.

[100]

Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. 2021. AutoBridge: Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'21).

Digital Library

[101]

Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang, and Jason Cong. 2020. Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency. In Proceedings of the Design Automation Conference (DAC'20).

Digital Library

[102]

Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, and Jason Cong. 2019. Hardware acceleration of long read pairwise overlapping in genome sequencing: A race between FPGA and GPU. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'19).

[103]

Peng Guo, Hong Ma, Ruizhi Chen, Pin Li, Shaolin Xie, and Donglin Wang. 2018. FBNA: A fully binarized neural network accelerator. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'18).

[104]

Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2017. Decoupling data supply from computation for latency-tolerant communication in heterogeneous architectures. ACM Trans. Architect. Code Optimiz. 14, 2 (2017), 1–27.

Digital Library

[105]

Mohamed Ben Hammouda, Philippe Coussy, and Loïc Lagadec. 2014. A design approach to automatically synthesize ANSI-C assertions during high-level synthesis of hardware accelerators. In Proceedings of the International Symposium on Circuits and Systems (ISCAS'14).

[106]

Frank Hannig, Holger Ruckdeschel, Hritam Dutta, and Jürgen Teich. 2008. PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the International Workshop on Applied Reconfigurable Computing (ARC'08).

Digital Library

[107]

James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Trans. Graph. 33, 4 (2014), 144:1–144:11.

Digital Library

[108]

James Hegarty, Ross Daly, Zachary DeVito, Jonathan Ragan-Kelley, Mark Horowitz, and Pat Hanrahan. 2016. Rigel: Flexible multi-rate image processing hardware. ACM Trans. Graph. 25, 4 (2016), 1–11.

Digital Library

[109]

Timothy Hickey, Qun Ju, and Maarten H. Van Emden. 2001. Interval arithmetic: From principles to implementation. J. ACM 48, 5 (2001), 1038–1068.

Digital Library

[110]

Daniel Holanda Noronha, Ruizhe Zhao, Jeff Goeders, Wayne Luk, and Steven J. E. Wilton. 2019. On-Chip FPGA debug instrumentation for machine learning applications. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'19).

Digital Library

[111]

Chin Hau Hoo and Akash Kumar. 2018. ParaDRo: A parallel deterministic router based on spatial partitioning and scheduling. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'18).

Digital Library

[112]

Amir Hormati, Manjunath Kudlur, Scott Mahlke, David Bacon, and Rodric Rabbah. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the International Conference on Compilers, Architectures and Synthesis of Embedded Systems (CASES'08).

Digital Library

[113]

Hsuan Hsiao and Jason Anderson. 2019. Thread weaving: Static resource scheduling for multithreaded high-level synthesis. In Proceedings of the Design Automation Conference (DAC'19).

Digital Library

[114]

Bohu Huang and Haibin Zhang. 2013. Application of multi-core parallel computing in FPGA placement. In Proceedings of the International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA'13).

[115]

Muhuan Huang, Di Wu, Cody Hao Yu, Zhenman Fang, Matteo Interlandi, Tyson Condie, and Jason Cong. 2016. Programming and runtime support to blaze FPGA accelerator deployment at datacenter scale. In Proceedings of the ACM Symposium on Cloud Computing.

Digital Library

[116]

Sitao Huang, Gowthami Jayashri Manikandan, Anand Ramachandran, Kyle Rupnow, Wen-mei W Hwu, and Deming Chen. 2017. Hardware acceleration of the pair-HMM algorithm for DNA variant calling. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'17).

Digital Library

[117]

Yuanjie Huang, Paolo Ienne, Olivier Temam, Yunji Chen, and Chengyong Wu. 2013. Elastic CGRAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'13).

Digital Library

[118]

Stephen Ibanez, Gordon Brebner, Nick McKeown, and Noa Zilberman. 2019. The P4->NetFPGA workflow for line-rate packet processing. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'19).

Digital Library

[119]

Mohsen Imani, Samuel Bosch, Sohum Datta, Sharadhi Ramakrishna, Sahand Salamat, Jan M. Rabaey, and Tajana Rosing. 2019. QuantHD: A quantization framework for hyperdimensional computing. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 10 (2019), 2268–2278.

[120]

Mohsen Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, Farinaz Koushanfar, and Tajana Rosing. 2019. SparseHD: Algorithm-hardware co-optimization for efficient high-dimensional computing. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'19).

[121]

Intel. 2019. Intel Agilex F-Series FPGAs & SoCs. Retrieved from https://www.intel.com/content/www/us/en/products/programmable/fpga/agilex/f-series.html.

[122]

Intel. 2020. Intel High Level Synthesis Compiler Pro Edition: Reference Manual. Retrieved from https://www.intel.com/content/www/us/en/programmable/documentation/ewa1462824960255.html.

[123]

Intel. 2020. Intel SoC FPGAs. Retrieved from https://www.intel.ca/content/www/ca/en/products/programmable/soc.html.

[124]

Intel. 2020. The oneAPI Specification. Retrieved from https://www.oneapi.com/.

[125]

Christian Iseli and Eduardo Sanchez. 1993. Spyder: A reconfigurable VLIW processor using FPGAs. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines.

[126]

Asif Islam and Nachiket Kapre. 2018. LegUp-NoC: High-level synthesis of loops with indirect addressing. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[127]

Manish Kumar Jaiswal and Ray C. C. Cheung. 2013. Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support. Microelectr. J. 44, 5 (2013), 421–430.

Digital Library

[128]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the International Conference on Multimedia.

Digital Library

[129]

Jiantong Jiang, Zeke Wang, Xue Liu, Juan Gómez-Luna, Nan Guan, Qingxu Deng, Wei Zhang, and Onur Mutlu. 2020. Boyi: A systematic framework for automatically deciding the right execution model of OpenCL applications on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'20).

Digital Library

[130]

Lana Josipovic, Philip Brisk, and Paolo Ienne. 2017. An out-of-order load-store queue for spatial computing. ACM Trans. Embed. Comput. Syst. 16, 5s (2017), 1–19.

Digital Library

[131]

Lana Josipović, Radhika Ghosal, and Paolo Ienne. 2018. Dynamically scheduled high-level synthesis. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'18).

Digital Library

[132]

Lana Josipovic, Andrea Guerrieri, and Paolo Ienne. 2019. Speculative dataflow circuits. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'19).

Digital Library

[133]

Juniper. 2020. Juniper: Java Platform for High-performance and Real-time Large-scale Data. Retrieved from http:// www.juniper-project.org/.

[134]

Nachiket Kapre et al. 2018. Hoplite-Q: Priority-aware routing in FPGA overlay NoCs. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[135]

Nachiket Kapre and Jan Gray. 2015. Hoplite: Building austere overlay NoCs for FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'15).

[136]

Nachiket Kapre and Deheng Ye. 2016. GPU-Accelerated high-level synthesis for bitwidth optimization of FPGA datapaths. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'16).

Digital Library

[137]

Soguy Mak karé Gueye, Gwenaël Delaval, Eric Rutten, Dominique Heller, and Jean-Philippe Diguet. 2018. A domain-specific language for autonomic managers in FPGA reconfigurable architectures. In Proceedings of the International Conference on Autonomic Computing (ICAC'18).

[138]

Ryan Kastner, Janarbek Matai, and Stephen Neuendorffer. 2018. Parallel programming for FPGAs. Retrieved from https://arXiv:1805.03648.

[139]

Keras. 2020. Keras. Simple. Flexible. Powerful.Retrieved from https://keras.io/.

[140]

Ronan Keryell and Lin-Ya Yu. 2018. Early experiments using SYCL single-source modern C++ on Xilinx FPGA: Extended Abstract of technical presentation. In Proceedings of the International Workshop on OpenCL.

Digital Library

[141]

Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. 2018. Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI'18).

Digital Library

[142]

Soroosh Khoram, Jialiang Zhang, Maxwell Strange, and Jing Li. 2018. Accelerating graph analytics by co-optimizing storage and access on an FPGA-HMC platform. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'18).

Digital Library

[143]

Jeffrey Kingyens and J. Gregory Steffan. 2011. The potential for a GPU-Like overlay architecture for FPGAs. Intl. J. Reconfig. Comput. (2011).

[144]

Adam B. Kinsman and Nicola Nicolici. 2009. Finite precision bit-width allocation using SAT-Modulo theory. In Design, Automation, and Test in Europe (DATE'09).

Digital Library

[145]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proc. ACM Program. Lang. (2017).

Digital Library

[146]

Ana Klimovic and Jason H. Anderson. 2013. Bitwidth-optimized hardware accelerators with software fallback. In Proceedings of the International Conference on Field Programmable Technology (FPT'13).

[147]

David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis et al. 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'18).

Digital Library

[148]

David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, Christos Kozyrakis, and Kunle Olukotun. 2016. Automatic generation of efficient accelerators for reconfigurable hardware. In Proceedings of the International Symposium on Computer Architecture (ISCA'16).

Digital Library

[149]

Maciej Kurek, Tobias Becker, Thomas C. P. Chau, and Wayne Luk. 2014. Automating optimization of reconfigurable designs. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'14).

Digital Library

[150]

Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'19).

Digital Library

[151]

Yi-Hsiang Lai, Hongbo Rong, Size Zheng, Weihao Zhang, Xiuping Cui, Yunshan Jia, Jie Wang, Brendan Sullivan, Zhiru Zhang, Yun Liang, et al. 2020. SuSy: A programming model for productive construction of high-performance systolic arrays on FPGAs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'20).

Digital Library

[152]

Chris Lavin and Alireza Kaviani. 2018. RapidWright: Enabling custom crafted implementations for FPGAs. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[153]

D.-U. Lee, Altaf Abdul Gaffar, Ray C. C. Cheung, Oskar Mencer, Wayne Luk, and George A. Constantinides. 2006. Accuracy-guaranteed bit-width optimization. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 25, 10 (2006), 1990–2000.

Digital Library

[154]

David M. Lewis, Marcus H. van Ierssel, and Daniel H. Wong. 1993. A field programmable accelerator for compiled-code applications. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines.

[155]

Peng Li, Louis-Noël Pouchet, and Jason Cong. 2014. Throughput optimization for high-level synthesis using resource constraints. In Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT'14).

[156]

Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan. 2017. UTPlaceF 3.0: A parallelization framework for modern FPGA global placement. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'17).

Digital Library

[157]

Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. 2018. FP-BNN: Binarized neural network on FPGA. Neurocomputing (2018).

[158]

Yibo Lin, Zixuan Jiang, Jiaqi Gu, Wuxi Li, Shounak Dhar, Haoxing Ren, Brucek Khailany, and David Z. Pan. 2020. DREAMPlace: Deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 4 (2020), 748–761.

[159]

Junyi Liu, Samuel Bayliss, and George A. Constantinides. 2015. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'15).

Digital Library

[160]

Ji Liu, Abdullah-Al Kafi, Xipeng Shen, and Huiyang Zhou. 2020. MKPipe: A compiler framework for optimizing multi-kernel workloads in OpenCL for FPGA. In Proceedings of the International Symposium on Supercomputing (ICS'20).

Digital Library

[161]

Junyi Liu, John Wickerson, Samuel Bayliss, and George A. Constantinides. 2017. Polyhedral-based dynamic loop pipelining for high-level synthesis. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 9 (2017), 1802–1815.

Digital Library

[162]

Leo Liu, Jay Weng, and Nachiket Kapre. 2019. RapidRoute: Fast assembly of communication structures for FPGA overlays. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'19).

[163]

Qiang Liu, George A. Constantinides, Konstantinos Masselos, and Peter Y. K. Cheung. 2007. Automatic on-chip memory minimization for data reuse. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'07).

Digital Library

[164]

Charles Lo and Paul Chow. 2016. Model-based optimization of high level synthesis directives. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'16).

[165]

Charles Lo and Paul Chow. 2018. Multi-fidelity optimization for high-level synthesis directives. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'18).

[166]

Charles Lo and Paul Chow. 2020. Hierarchical modelling of generators in design-space exploration. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'20).

[167]

Alec Lu, Zhenman Fang, Weihua Liu, and Lesley Shannon. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'21).

Digital Library

[168]

Adrian Ludwin and Vaughn Betz. 2011. Efficient and deterministic parallel placement for FPGAs. ACM Trans. Design Autom. Electron. Syst. 16, 3 (2011), 1–23.

Digital Library

[169]

Adrian Ludwin, Vaughn Betz, and Ketan Padalia. 2008. High-quality, deterministic parallel placement for FPGAs on commodity hardware. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'08).

Digital Library

[170]

Rui Ma, Jia-Ching Hsu, Tian Tan, Eriko Nurvitadhi, David Sheffield, Rob Pelt, Martin Langhammer, Jaewoong Sim, Aravind Dasu, and Derek Chiou. 2019. Specializing FGPU for persistent deep learning. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'19). 326–333.

[171]

Xiaoyin Ma, Walid A. Najjar, and Amit K. Roy-Chowdhury. 2015. Evaluation and acceleration of high-throughput fixed-point object detection on FPGAs. IEEE Trans. Circ. Syst. Video Technol. 25, 6 (2015), 1051–1062.

Digital Library

[172]

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA'16).

[173]

Hosein Mohammadi Makrani, Farnoud Farahmand, Hossein Sayadi, Sara Bondi, Sai Manoj Pudukotai Dinakarrao, Houman Homayoun, and Setareh Rafatirad. 2019. Pyramid: Machine learning framework to estimate the optimal timing and resource usage of a high-level synthesis design. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'19).

[174]

Maxeler. 2020. Maxeler High-performance Dataflow Computing Systems. Retrieved from https://www.maxeler.com/products/software/maxcompiler/.

[175]

Séamas McGettrick, Kunjan Patel, and Chris Bleakley. 2011. High performance programmable FPGA overlay for digital signal processing. In Proceedings of the International Conference on Reconfigurable Computing: Architectures, Tools and Applications (ARC'11).

Digital Library

[176]

Atefeh Mehrabi, Aninda Manocha, Benjamin C. Lee, and Daniel J. Sorin. 2020. Prospector: Synthesizing efficient accelerators via statistical learning. In Proceedings of the Design, Automation, and Test in Europe (DATE'20).

Digital Library

[177]

Richard Membarth, Oliver Reiche, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. 2016. Hipacc: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst. 27, 1 (2016), 210–224.

Digital Library

[178]

Mentor. 2020. Catapult High-Level Synthesis. Retrieved from https://s3.amazonaws.com/s3.mentor.com/public_documents/datasheet/hls-lp/catapult-high-level-synthesis.pdf.

[179]

Microchip. 2020. LegUp 9.1 Documentation. Retrieved from https://download-soc.microsemi.com/FPGA/HLS-EAP/docs/legup-9.1-docs/index.html.

[180]

Microchip. 2020. Microchip Acquires High-Level Synthesis Tool Provider LegUp to Simplify Development of PolarFire FPGA-based Edge Compute Solutions. Retrieved from https://www.microchip.com/en-us/about/news-releases/products/microchip-acquires-high-level-synthesis-tool-provider-legup.

[181]

Microsoft. 2020. A Microsoft Custom Data Type for Efficient Inference. Retrieved from https://www.microsoft.com/en-us/research/blog/a-microsoft-custom-data-type-for-efficient-inference/.

[182]

Peter Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012. Computer generation of hardware for linear digital signal processing transforms. ACM Trans. Design Autom. Electron. Syst. 16, 3 (2012), 1–23.

Digital Library

[183]

Yehdhih Moctar, Mirjana Stojilović, and Philip Brisk. 2018. Deterministic parallel routing for FPGAs based on galois parallel execution model. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'18).

[184]

Joshua S. Monson and Brad Hutchings. 2014. New approaches for in-system debug of behaviorally synthesized FPGA circuits. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'14).

[185]

Joshua S. Monson and Brad L. Hutchings. 2015. Using source-level transformations to improve high-level synthesis debug and validation on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'15).

Digital Library

[186]

Joshua S. Monson and Brad L. Hutchings. 2018. Enhancing debug observability for HLS-based FPGA circuits through source-to-source compilation. J. Parallel Distrib. Comput. 117 (2018), 148–160.

[187]

Thierry Moreau, Tianqi Chen, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. VTA: An open hardware-software stack for deep learning. Retrieved from https://arXiv:1807.04188.

[188]

Antoine Morvan, Steven Derrien, and Patrice Quinton. 2013. Polyhedral bubble insertion: A method to improve nested loop pipelining for high-level synthesis. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 32, 3 (2013), 339–352.

Digital Library

[189]

Kevin E. Murray, Mohamed A. Elgammal, Vaughn Betz, Tim Ansell, Keith Rothman, and Alessandro Comodi. 2020. SymbiFlow and VPR: An open-source design flow for commercial and novel FPGAs. IEEE Micro (2020).

[190]

Kevin E. Murray, Oleg Petelin, Sheng Zhong, Jia Min Wang, Mohamed Eldafrawy, Jean-Philippe Legault, Eugene Sha, Aaron G. Graham, Jean Wu, Matthew J. P. Walker, et al. 2020. VTR 8: High-performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. 13, 2 (2020), 339–352.

Digital Library

[191]

Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Brown, Fabrizio Ferrandi, et al. 2016. A survey and evaluation of FPGA high-level synthesis tools. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 35, 10 (2016), 1591–1604.

Digital Library

[192]

Rachit Nigam, Sachille Atapattu, Samuel Thomas, Zhijing Li, Theodore Bauer, Yuwei Ye, Apurva Koti, Adrian Sampson, and Zhiru Zhang. 2020. Predictable accelerator design with time-sensitive affine types. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'20).

Digital Library

[193]

Mostafa W. Numan, Braden J. Phillips, Gavin S. Puddy, and Katrina Falkner. 2020. Towards automatic high-level code deployment on reconfigurable platforms: A survey of high-level synthesis tools and toolchains. IEEE Access 8 (2020), 174692–174722.

[194]

Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C. Hoe, José F Martínez, and Carlos Guestrin. 2014. GraphGen: An FPGA framework for vertex-centric graph computation. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'14).

Digital Library

[195]

William George Osborne, Ray C. C. Cheung, José Gabriel F. Coutinho, Wayne Luk, and Oskar Mencer. 2007. Automatic accuracy-guaranteed bit-width optimization for fixed and floating-point systems. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'07).

[196]

Ganda Stephane Ouedraogo, Matthieu Gautier, and Olivier Sentieys. 2014. A frame-based domain-specific language for rapid prototyping of FPGA-based software-defined radios. EURASIP J. Adv. Signal Process. 1 (2014), 1–15.

[197]

M. Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2016. FPGA-based accelerator design from a domain-specific language. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'16).

[198]

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP'09).

[199]

Philippos Papaphilippou, Jiuxi Meng, and Wayne Luk. 2020. High-performance FPGA network switch architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'20).

Digital Library

[200]

Dongjoon Park, Yuanlong Xiao, Nevo Magnezi, and André DeHon. 2018. Case for fast FPGA compilation using partial reconfiguration. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'18).

[201]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Retrieved from https://arXiv:1912.01703.

Digital Library

[202]

Ryan Pattison, Christian Fobel, Gary Grewal, and Shawki Areibi. 2015. Scalable analytic placement for FPGA on GPGPU. In Proceedings of the International Conference on Reconfigruable Computing and FPGAs (ReConFig'15).

[203]

Francesco Peverelli, Marco Rabozzi, Emanuele Del Sozzo, and Marco D. Santambrogio. 2018. OXiGen: A tool for automatic acceleration of C functions into dataflow FPGA-based kernels. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops (IPDPSW'18).

[204]

Christian Pilato and Fabrizio Ferrandi. 2013. Bambu: A modular framework for the high level synthesis of memory-intensive applications. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'13).

[205]

Christian Pilato, Daniele Loiacono, Antonino Tumeo, Fabrizio Ferrandi, Pier Luca Lanzi, and Donatella Sciuto. 2010. Speeding-Up expensive evaluations in high-level synthesis using solution modeling and fitness inheritance. Comput. Intell. Exp. Optimiz. Problems (2010).

[206]

Jose P. Pinilla and Steven J. E. Wilton. 2016. Enhanced source-level instrumentation for FPGA in-system debug of high-level synthesis designs. In Proceedings of the International Conference on Field Programmable Technology (FPT'16).

[207]

Louis-Noel Pouchet, Peng Zhang, Ponnuswamy Sadayappan, and Jason Cong. 2013. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'13).

Digital Library

[208]

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming heterogeneous systems from an image processing DSL. ACM Trans. Architect. Code Optimiz. 14, 3 (2017), 1–25.

Digital Library

[209]

Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'16). 26–35.

Digital Library

[210]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices (2013).

Digital Library

[211]

B. Ramakrishna Rau. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the International Symposium on Microarchitecture (MICRO'94).

Digital Library

[212]

Oliver Reiche, M. Akif Özkan, Richard Membarth, Jürgen Teich, and Frank Hannig. 2017. Generating FPGA-Based image processing accelerators with hipacc. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'17).

Digital Library

[213]

Hongbo Rong. 2017. Programmatic control of a compiler for generating high-performance spatial hardware. Retrieved from https://arXiv:1711.07606.

[214]

Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel: A high-level programming platform for streaming applications on FPGA. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[215]

Sahand Salamat, Mohsen Imani, Behnam Khaleghi, and Tajana Rosing. 2019. F5-HD: Fast flexible FPGA-based framework for refreshing hyperdimensional computing. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'19).

Digital Library

[216]

Andrew G. Schmidt, Neil Steiner, Matthew French, and Ron Sass. 2012. HwPMI: An extensible performance monitoring infrastructure for improving hardware design and productivity on FPGAs. Int. J. Reconfig. Comput. (2012).

Digital Library

[217]

Robert Schreiber, Shail Aditya, Scott Mahlke, Vinod Kathail, B. Ramakrishna Rau, Darren Cronquist, and Mukund Sivaraman. 2002. PICO-NPA: High-level synthesis of nonprogrammable hardware accelerators. J. VLSI Signal Process. Syst. Signal Image Video Technol. 31, 2 (2002), 127–142.

Digital Library

[218]

Jocelyn Sérot, François Berry, and Sameer Ahmed. 2013. CAPH: A language for implementing stream-processing applications on FPGAs. Embed. Syst. Design FPGAs (2013).

[219]

Aaron Severance, Joe Edwards, Hossein Omidian, and Guy Lemieux. 2014. Soft vector processors with streaming pipelines. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'14).

Digital Library

[220]

Aaron Severance and Guy Lemieux. 2012. VENICE: A compact vector processor for FPGA applications. In Proceedings of the International Conference on Field Programmable Technology (FPT'12).

Digital Library

[221]

Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the International Symposium on Microarchitecture (MICRO'16).

Digital Library

[222]

Minghua Shen and Guojie Luo. 2015. Accelerate FPGA routing with parallel recursive partitioning. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'15).

Digital Library

[223]

Minghua Shen and Guojie Luo. 2017. Corolla: GPU-accelerated FPGA routing based on subgraph dynamic expansion. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'17).

Digital Library

[224]

Minghua Shen, Guojie Luo, and Nong Xiao. 2020. Coarse-grained parallel routing with recursive partitioning for FPGAs. IEEE Trans. Parallel Distrib. Syst. 32, 4 (2020), 884–899.

Digital Library

[225]

Sam Skalicky, Joshua Monson, Andrew Schmidt, and Matthew French. 2018. Hot & spicy: Improving productivity with python and HLS for FPGAs. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'18).

[226]

Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. 2020. End-to-end optimization of deep learning applications. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'20).

Digital Library

[227]

Roman A. Solovyev, Alexandr A. Kalinin, Alexander G. Kustov, Dmitry V. Telpukhov, and Vladimir S. Ruhlov. 2018. FPGA implementation of convolutional neural networks with fixed-point calculations. Retrieved from https://arXiv:1808.09945.

[228]

Lukas Sommer, Lukas Weber, Martin Kumm, and Andreas Koch. 2020. Comparison of arithmetic number formats for inference in sum-product networks on FPGAs. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'20).

[229]

Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, David Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, et al. 2019. T2S-Tensor: Productively generating high-performance spatial hardware for dense tensor computations. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'19).

[230]

Robert Stewart, Kirsty Duncan, Greg Michaelson, Paulo Garcia, Deepayan Bhowmik, and Andrew Wallace. 2018. RIPL: A parallel image processing language for FPGAs. ACM Trans. Reconfig. Technol. Syst. 11, 1 (2018), 1–24.

Digital Library

[231]

Ian Swarbrick, Dinesh Gaitonde, Sagheer Ahmad, Bala Jayadev, Jeff Cuppett, Abbas Morshed, Brian Gaide, and Ygal Arbel. 2019. Versal network-on-chip (NoC). In Proceedings of the Symposium on High-Performance Interconnects (Hot Interconnects'19).

[232]

Synthesijer. 2020. Synthesijer GitHub. Retrieved from https://github.com/synthesijer/synthesijer.

[233]

Mingxing Tan, Steve Dai, Udit Gupta, and Zhiru Zhang. 2015. Mapping-aware constrained scheduling for LUT-Based FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'15).

Digital Library

[234]

Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. 2015. Elasticflow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'15).

Digital Library

[235]

James Thomas, Pat Hanrahan, and Matei Zaharia. 2020. Fleet: A framework for massively parallel streaming on FPGAs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'20).

Digital Library

[236]

Tim Todman and Wayne Luk. 2013. Runtime assertions and exceptions for streaming systems. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'13).

[237]

Stephen M. Steve Trimberger. 2018. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology: This paper reflects on how Moore's law has driven the design of FPGAs through three epochs: The age of invention, the age of expansion, and the age of accumulation. IEEE Solid-State Circ. Mag. 10, 2 (2018), 16–29.

[238]

Ecenur Ustun, Chenhui Deng, Debjit Pal, Zhijing Li, and Zhiru Zhang. 2020. Accurate operation delay prediction for FPGA HLS using graph neural networks. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'20).

Digital Library

[239]

Ecenur Ustun, Shaojie Xiang, Jinny Gui, Cunxi Yu, and Zhiru Zhang. 2019. LAMDA: Learning-assisted multi-stage autotuning for FPGA design closure. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'19).

[240]

Shervin Vakili, J. M. Pierre Langlois, and Guy Bois. 2013. Enhanced precision analysis for accuracy-aware bit-width optimization using affine arithmetic. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 32, 12 (2013), 1853–1865.

Digital Library

[241]

Anshuman Verma, Huiyang Zhou, Skip Booth, Robbie King, James Coole, Andy Keep, John Marshall, and Wu-chun Feng. 2017. Developing dynamic profiling and debugging support in OpenCL for FPGAs. In Proceedings of the Design Automation Conference (DAC'17).

Digital Library

[242]

Chris C. Wang and Guy G. F. Lemieux. 2011. Scalable and deterministic timing-driven parallel placement for FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'11).

Digital Library

[243]

Dekui Wang, Zhenhua Duan, Cong Tian, Bohu Huang, and Nan Zhang. 2020. ParRA: A shared memory parallel FPGA router using hybrid partitioning approach. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 39, 4 (2020), 830–842.

[244]

Han Wang, Robert Soulé, Huynh Tu Dang, Ki Suh Lee, Vishal Shrivastav, Nate Foster, and Hakim Weatherspoon. 2017. P4FPGA: A rapid prototyping framework for p4. In Proceedings of the Symposium on SDN Research.

Digital Library

[245]

Jie Wang, Licheng Guo, and Jason Cong. 2021. AutoSA: A polyhedral compiler for high-performance systolic arrays on FPGA. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'21).

Digital Library

[246]

Shibo Wang and Pankaj Kanwar. 2019. BFloat16: The secret to high performance on cloud TPUs. Google Cloud Blog (2019).

[247]

Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'18).

Digital Library

[248]

Xiaojun Wang and Miriam Leeser. 2010. VFloat: A variable precision fixed- and floating-point library for reconfigurable hardware. ACM Trans. Reconfig. Technol. Syst. 16, 3 (2010), 1–23.

Digital Library

[249]

Yu Wang, James C. Hoe, and Eriko Nurvitadhi. 2019. Processor assisted worklist scheduling for FPGA accelerated graph processing on a shared-memory platform. In Proceedings of the IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'19).

[250]

Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'14).

Digital Library

[251]

Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the Design Automation Conference (DAC'13).

Digital Library

[252]

Saud Wasly, Rodolfo Pellizzoni, and Nachiket Kapre. 2017. HopliteRT: An efficient FPGA NoC for real-time applications. In Proceedings of the International Conference on Field Programmable Technology (FPT'17).

[253]

Richard Wei, Lane Schwartz, and Vikram Adve. 2017. DLVM: A modern compiler infrastructure for deep learning systems. Retrieved from https://arXiv:1711.03016.

[254]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the Design Automation Conference (DAC'17).

Digital Library

[255]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.

Digital Library

[256]

Yuanlong Xiao, Dongjoon Park, Andrew Butt, Hans Giesen, Zhaoyang Han, Rui Ding, Nevo Magnezi, Raphael Rubin, and André DeHon. 2019. Reducing FPGA compile time with separate compilation for FPGA building blocks. In Proceedings of the International Conference on Field Programmable Technology (FPT'19).

[257]

Xilinx. 2012. ChipScope Pro Software and Cores (UG029). Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx14_7/chipscope_pro_sw_cores_ug029.pdf.

[258]

Xilinx. 2020. SDNet Packet Processor User Guide. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug1012-sdnet-packet-processor.pdf.

[259]

Xilinx. 2020. Vitis High-Level Synthesis User Guide. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1399-vitis-hls.pdf.

[260]

Xilinx. 2020. Zynq UltraScale+ MPSoC. Retrieved from https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html.

[261]

Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yi Shan, and Yu Wang. 2019. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 10, 2668–2681.

[262]

Li Yang, Zhezhi He, and Deliang Fan. 2018. A fully onchip binarized convolutional neural network FPGA implementation with accurate inference. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED'18).

Digital Library

[263]

Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, and Kurt Keutzer. 2019. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'19).

Digital Library

[264]

Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2007. Exploration and customization of FPGA-based soft processors. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 26, 2 (2007), 266–277.

Digital Library

[265]

Cody Hao Yu, Peng Wei, Max Grossman, Peng Zhang, Vivek Sarker, and Jason Cong. 2018. S2FA: An accelerator automation framework for heterogeneous computing in datacenters. In Proceedings of the Design Automation Conference (DAC'18).

Digital Library

[266]

Jason Yu, Guy Lemieux, and Christpher Eagleston. 2008. Vector processing as a soft-core CPU accelerator. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'08). 222–232.

Digital Library

[267]

David C. Zaretsky, Gaurav Mittal, Robert P. Dick, and Prith Banerjee. 2007. Balanced scheduling and operation chaining in high-level synthesis for FPGA designs. In Proceedings of the International Symposium on Quality Electronic Design (ISQED'07).

Digital Library

[268]

Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'20).

Digital Library

[269]

Yue Zha and Jing Li. 2020. Virtualizing FPGAs in the cloud. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'20). 845–858.

Digital Library

[270]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'15).

Digital Library

[271]

Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2019. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 38, 11 (2019), 2072–2085.

Digital Library

[272]

Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'18).

Digital Library

[273]

Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, and Zhiru Zhang. 2021. FracBNN: Accurate and FPGA-efficient binary neural networks with fractional activations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'21).

Digital Library

[274]

Zhiru Zhang and Bin Liu. 2013. SDC-based modulo scheduling for pipeline synthesis. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'13).

Digital Library

[275]

Jieru Zhao, Liang Feng, Sharad Sinha, Wei Zhang, Yun Liang, and Bingsheng He. 2017. COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'17).

Digital Library

[276]

Jieru Zhao, Tingyuan Liang, Sharad Sinha, and Wei Zhang. 2019. Machine learning based routing congestion prediction in FPGA high-level synthesis. In Proceedings of the Design, Automation, and Test in Europe (DATE'19).

[277]

Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolutional neural networks with software-programmable fpgas. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'17).

Digital Library

[278]

Ritchie Zhao, Mingxing Tan, Steve Dai, and Zhiru Zhang. 2015. Area-efficient pipelining for FPGA-targeted high-level synthesis. In Proceedings of the Design Automation Conference (DAC'15).

Digital Library

[279]

Zhipeng Zhao and James C. Hoe. 2017. Using vivado-HLS for structural design: A NoC case study. Retrieved from https://arXiv:1710.10290.

Digital Library

[280]

Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. 2016. Lin-analyzer: A high-level performance analysis tool for FPGA-based accelerators. In Proceedings of the Design Automation Conference (DAC'16).

Digital Library

[281]

Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing Wu. 2019. HitGraph: High-throughput graph processing framework on FPGA. IEEE Trans. Parallel Distrib. Syst. 30, 10, 2249–2264.

[282]

Yuan Zhou, Khalid Musa Al-Hawaj, and Zhiru Zhang. 2017. A new approach to automatic memory banking using trace-based address mining. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA'17).

Digital Library

[283]

Wei Zuo, Peng Li, Deming Chen, Louis-Noël Pouchet, Shunan Zhong, and Jason Cong. 2013. Improving polyhedral code generation for high-level synthesis. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'13).

Digital Library

Cited By

Lahti SHämäläinen T(2025)High-Level Synthesis for FPGAs—A Hardware Engineer’s PerspectiveIEEE Access10.1109/ACCESS.2025.354032013(28574-28593)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3540320
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Huang BLyubomirsky SLi YHe MSmith GTambe TGaonkar ACanumalla VCheung AWei GGupta ATatlock ZMalik S(2024)Application-level Validation of Accelerator Designs Using a Formal Software/Hardware InterfaceACM Transactions on Design Automation of Electronic Systems10.1145/363905129:2(1-25)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1145/3639051
Show More Cited By

Index Terms

Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future Prospects
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
      1. Hardware-software codesign
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications

Recommendations

Acceleration of an FPGA router
FCCM '97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines

The authors describe their experience and progress in accelerating an FPGA router. Placement and routing is undoubtedly the most time-consuming process in automatic chip design or configuring programmable logic devices as reconfigurable computing ...
Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow
Special Section on Field Programmable Logic and Applications 2015 and Regular Papers

In this article, we consider implementing field-programmable gate arrays (FPGAs) using a standard cell design methodology and present a framework for the automated generation of synthesizable FPGA fabrics. The open-source Verilog-to-Routing (VTR) FPGA ...
PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow
FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

The exploding complexity and computation efficiency requirements of applications are stimulating a strong demand for hardware acceleration with heterogeneous platforms such as FPGAs. However, a high-quality FPGA design is very hard to create as it ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 14, Issue 4

December 2021

165 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3483341

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2021

Accepted: 01 May 2021

Revised: 01 May 2021

Received: 01 March 2021

Published in TRETS Volume 14, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

NSF/Intel CAPA
NSERC Discovery
Canada Foundation for Innovation John R. Evans Leaders Fund

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
1,314
Total Downloads

Downloads (Last 12 months)260
Downloads (Last 6 weeks)13

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lahti SHämäläinen T(2025)High-Level Synthesis for FPGAs—A Hardware Engineer’s PerspectiveIEEE Access10.1109/ACCESS.2025.354032013(28574-28593)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3540320
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Huang BLyubomirsky SLi YHe MSmith GTambe TGaonkar ACanumalla VCheung AWei GGupta ATatlock ZMalik S(2024)Application-level Validation of Accelerator Designs Using a Formal Software/Hardware InterfaceACM Transactions on Design Automation of Electronic Systems10.1145/363905129:2(1-25)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1145/3639051
Pouchet LTucker EZhang NChen HPal DRodríguez GZhang ZZhang ZPutnam A(2024)Formal Verification of Source-to-Source Transformations for HLSProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637563(97-107)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637563
Ling MFeng ZChen RShao YTang SZhu Y(2024)Vina-FPGA-Cluster: Multi-FPGA Based Molecular Docking Tool With High-Accuracy and Multi-Level ParallelismIEEE Transactions on Biomedical Circuits and Systems10.1109/TBCAS.2024.338832318:6(1321-1337)Online publication date: Dec-2024
https://doi.org/10.1109/TBCAS.2024.3388323
Karlos GBal HWang L(2024)NetCL: A Unified Programming Framework for In-Network ComputingProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00051(1-20)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00051
Bisulli LScazzoli DLinsalata FMagarini MMizmizi MMazzucco CSpagnolini U(2024)Real-time Beamforming Testbed and Tracking Relay for mmWave Applications2024 IEEE 30th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)10.1109/RTCSA62462.2024.00025(114-119)Online publication date: 21-Aug-2024
https://doi.org/10.1109/RTCSA62462.2024.00025
Wang Y(2024)Low-Power Design of Advanced Image Processing Algorithms under FPGA in Real-Time Applications2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA)10.1109/ICPECA60615.2024.10471036(1080-1084)Online publication date: 26-Jan-2024
https://doi.org/10.1109/ICPECA60615.2024.10471036
Garcia P(2024)Preserving Power Optimizations Across the High Level Synthesis of Distinct Application-Specific Circuits2024 Tenth International Conference on Communications and Electronics (ICCE)10.1109/ICCE62051.2024.10634722(125-130)Online publication date: 31-Jul-2024
https://doi.org/10.1109/ICCE62051.2024.10634722
Tu KTang XYu CJosipović LChu ZTu KTang XYu CJosipović LChu Z(2024)High-Level SynthesisFPGA EDA10.1007/978-981-99-7755-0_8(113-134)Online publication date: 1-Feb-2024
https://doi.org/10.1007/978-981-99-7755-0_8
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents