skip to main content
research-article

PipeArch: Generic and Context-Switch Capable Data Processing on FPGAs

Published: 05 November 2020 Publication History

Abstract

Data processing systems based on FPGAs offer high performance and energy efficiency for a variety of applications. However, these advantages are achieved through highly specialized designs. The high degree of specialization leads to accelerators with narrow functionality and designs adhering to a rigid execution flow. For multi-tenant systems this limits the scope of applicability of FPGA-based accelerators, because, first, supporting a single operation is unlikely to have any significant impact on the overall performance of the system, and, second, serving multiple users satisfactorily is difficult due to simplistic scheduling policies enforced when using the accelerator. Standard operating system and database management system features that would help address these limitations, such as context-switching, preemptive scheduling, and thread migration are practically non-existent in current FPGA accelerator efforts.
In this work, we propose PipeArch, an open-source project1 for developing FPGA-based accelerators that combine the high efficiency of specialized hardware designs with the generality and functionality known from conventional CPU threads. PipeArch provides programmability and extensibility in the accelerator without losing the advantages of SIMD-parallelism and deep pipelining. PipeArch supports context-switching and thread migration, thereby enabling for the first time new capabilities such as preemptive scheduling in FPGA accelerators within a high-performance data processing setting. We have used PipeArch to implement a variety of machine learning methods for generalized linear model training and recommender systems showing empirically their advantages over a high-end CPU and even over fully specialized FPGA designs.

References

[1]
[n.d.]. Amazon Employee Access Dataset. https://github.com/owenzhang/Kaggle-AmazonChallenge2013.
[2]
[n.d.]. Amazon F1 Instances. aws.amazon.com/ec2/instance-types/f1/.
[3]
[n.d.]. AWS FPGA Stack Repository. Retrieved from https://github.com/aws/aws-fpga.
[4]
[n.d.]. Baidu FPGA Instances. Retrieved from https://cloud.baidu.com/product/fpga.html.
[5]
[n.d.]. Intel OPAE Framework. Retrieved from opae.github.io.
[6]
[n.d.]. KDD Dataset. Retrieved from https://www.datarobot.com/blog/datarobot-the-2014-kdd-cup.
[7]
[n.d.]. Music (Audio Features) Dataset. Retrieved from https://labrosa.ee.columbia.edu/millionsong.
[8]
[n.d.]. Xilinx VCU1525. Retrieved from www.xilinx.com/products/boards-and-kits/vcu1525-a.html.
[9]
Jason Agron and David Andrews. 2009. Building heterogeneous reconfigurable systems with a hardware microkernel. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/software Codesign and System Synthesis. ACM, 393--402.
[10]
Mikhail Asiatici, Nithin George, Kizheppatt Vipin, Suhaib A. Fahmy, and Paolo Ienne. 2017. Virtualized execution runtime for FPGA accelerators in the cloud. IEEE Access 5 (2017), 1900--1910.
[11]
James Bennett, Stan Lanning, et al. 2007. The Netflix prize. In Proceedings of the KDD Cup and Workshop, Vol. 2007. New York, NY, 35.
[12]
Alban Bourge, Olivier Muller, and Frédéric Rousseau. 2016. Generating efficient context-switch capable circuits through autonomous design flow. ACM Trans. Reconfig. Technol. Syst. 10, 1 (2016), 1--23.
[13]
Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, and William Yoder. 2004. Scaling to the end of silicon with EDGE architectures. Computer 37, 7 (2004), 44--55.
[14]
Stuart Byma, J. Gregory Steffan, Hadi Bannazadeh, Alberto Leon Garcia, and Paul Chow. 2014. FPGAs in the cloud: Booting virtualized hardware accelerators with OpenStack. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 109--116.
[15]
Emmanuel J. Candès and Benjamin Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9, 6 (2009), 717.
[16]
Hui Yan Cheah, Suhaib A. Fahmy, and Douglas L. Maskell. 2012. iDEA: A DSP block based FPGA soft processor. In Proceedings of the 2012 International Conference on Field-Programmable Technology. IEEE, 151--158.
[17]
Fei Chen, Yi Shan, Yu Zhang, Yu Wang, Hubertus Franke, Xiaotao Chang, and Kun Wang. 2014. Enabling FPGAs in the cloud. In Proceedings of the 11th ACM Conference on Computing Frontiers. ACM, 3.
[18]
Yao Chen, Jiong He, Xiaofan Zhang, Cong Hao, and Deming Chen. 2019. Cloud-DNN: An open framework for mapping DNN models to cloud FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 73--82.
[19]
Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 1 (2015), 2.
[20]
Christopher H. Chou, Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, and Guy G. F. Lemieux. 2011. VEGAS: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 15--24.
[21]
Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8--20.
[22]
Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, and Peipei Zhou. 2014. A fully pipelined and dynamically composable architecture of CGRA. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 9--16.
[23]
James Coole and Greg Stitt. 2013. Fast, flexible high-level synthesis from OpenCL using reconfiguration contexts. IEEE Micro 34, 1 (2013), 42--53.
[24]
Henk Corporaal. 1997. Microprocessor Architectures: From VLIW to TTA. John Wiley 8 Sons, Inc.
[25]
Kermin Fleming, Hsin-Jung Yang, Michael Adler, and Joel Emer. 2014. The LEAP FPGA operating system. In Proceedings of the 2014 24th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--8.
[26]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--14.
[27]
Mark Gebhart, Bertrand A. Maher, Katherine E. Coons, Jeff Diamond, Paul Gratz, Mario Marino, Nitya Ranganathan, Behnam Robatmili, Aaron Smith, James Burrill, et al. 2009. An evaluation of the TRIPS computer system. ACM SIGARCH Computer Architecture News 37, 1 (2009), 1--12.
[28]
Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari Cadambi, R. Reed Taylor, and Ronald Laufer. 1999. PipeRench: A coprocessor for streaming multimedia acceleration. In Proceedings of the 26th International Symposium on Computer Architecture (Cat. No. 99CB36367). IEEE, 28--39.
[29]
Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. Dyser: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro 32, 5 (2012), 38--51.
[30]
Panu Hamalainen, Jari Heikkinen, Marko Hannikainen, and Timo D. Hamalainen. 2005. Design of transport triggered architecture processors for wireless encryption. In Proceedings of the 8th Euromicro Conference on Digital System Design (DSD’05). IEEE, 144--152.
[31]
Markus Happe, Andreas Traber, and Ariane Keller. 2015. Preemptive hardware multitasking in ReconOS. In Proceedings of the International Symposium on Applied Reconfigurable Computing. Springer, 79--90.
[32]
Jan Hoogerbrugge and Henk Corporaal. 1995. Automatic synthesis of transport triggered processors. In Proceedings of the First Ann. Conf. Advanced School for Computing and Imaging, Heijen, The Netherlands.
[33]
S. Idreos, F. Groffen, N. Nes, S. Manegold, S. Mullender, and M. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. Data Engineering 40 (2012).
[34]
Aws Ismail and Lesley Shannon. 2011. FUSE: Front-end user framework for O/S abstraction of hardware accelerators. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 170--177.
[35]
Zsolt István, David Sidler, and Gustavo Alonso. 2016. Runtime parameterizable regular expression operators for databases. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 204--211.
[36]
Xabier Iturbe, Khaled Benkrid, Chuan Hong, Ali Ebrahim, Raul Torrego, Imanol Martinez, Tughrul Arslan, and Jon Perez. 2013. R3TOS: A novel reliable reconfigurable real-time operating system for highly adaptive, efficient, and dependable computing on FPGAs. IEEE Transactions on Computers 62, 8 (2013), 1542--1556.
[37]
Pekka Jääskeläinen, Aleksi Tervo, Guillermo Payá Vayá, Timo Viitanen, Nicolai Behmann, Jarmo Takala, and Holger Blume. 2018. Transport-triggered soft cores. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 83--90.
[38]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1--12.
[39]
Muhammed Al Kadi, Benedikt Janssen, Jones Yudi, and Michael Huebner. 2018. General-purpose computing with soft GPUs on FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 1 (2018), 5.
[40]
Nachiket Kapre. 2016. Optimizing soft vector processing in FPGA-based embedded systems. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 9, 3 (2016), 17.
[41]
Nachiket Kapre and Jan Gray. 2015. Hoplite: Building austere overlay NOCs for FPGAs. In Proceedings of the 2015 25th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--8.
[42]
Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 160--167.
[43]
Kaan Kara and Gustavo Alonso. 2016. Fast and robust hashing for database operators. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1--4.
[44]
Kaan Kara, Ken Eguro, Ce Zhang, and Gustavo Alonso. 2018. ColumnML: Column-store machine learning with on-the-fly data transformation. Proceedings of the VLDB Endowment 12, 4 (2018), 348--361.
[45]
Kaan Kara, Jana Giceva, and Gustavo Alonso. 2017. FPGA-based data partitioning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 433--445.
[46]
Oliver Knodel, Paul R. Genssler, and Rainer G. Spallek. 2017. Migration of long-running tasks between reconfigurable resources using virtualization. ACM SIGARCH Computer Architecture News 44, 4 (2017), 56--61.
[47]
Dirk Koch, Christian Haubelt, and Jürgen Teich. 2007. Efficient hardware checkpointing: Concepts, overhead analysis, and implementation. In Proceedings of the 2007 ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays. 188--196.
[48]
Chris Lattner and Jacques Pienaar. 2019. MLIR primer: A compiler infrastructure for the end of Moore’s law. (2019).
[49]
Cheng Liu, Ho-Cheung Ng, and Hayden Kwok-Hay So. 2015. QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay. In Proceedings of the 2015 International Conference on Field Programmable Technology (FPT). IEEE, 56--63.
[50]
Yu Liu, Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2018. MLBench: How good are machine learning clouds for binary classification tasks on structured data? Proceedings of the VLDB Endowment 11, 10 (2018), 1220--1232.
[51]
Enno Lübbers and Marco Platzner. 2009. ReconOS: Multithreaded programming for reconfigurable computers. ACM Transactions on Embedded Computing Systems (TECS) 9, 1 (2009), 8.
[52]
Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. 2018. In-RDBMS hardware acceleration of advanced analytics. Proceedings of the VLDB Endowment 11, 11 (2018), 1317--1331.
[53]
Aurelio Morales-Villanueva, Rohit Kumar, and Ann Gordon-Ross. 2016. Configuration prefetching and reuse for preemptive hardware multitasking on partially reconfigurable FPGAs. In Proceedings of the 2016 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1505--1508.
[54]
Ramadass Nagarajan, Karthikeyan Sankaralingam, Doug Burger, and Stephen W. Keckler. 2001. A design space evaluation of grid processor architectures. In Proceedings of the 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. IEEE, 40--51.
[55]
Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’11). IEEE, 80--85.
[56]
Muhsen Owaida, Gustavo Alonso, Laura Fogliarini, Anthony Hock-Koon, and Pierre-Etienne Melet. 2019. Lowering the latency of data processing pipelines through FPGA based hardware acceleration. Proceedings of the VLDB Endowment 13, 1 (2019), 71--85.
[57]
Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. 2017. Centaur: A framework for hybrid CPU-FPGA databases. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 211--218.
[58]
Muhsen Owaida, Hantian Zhang, Ce Zhang, and Gustavo Alonso. 2017. Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1--8.
[59]
Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2009), 1345--1359.
[60]
Kolin Paul, Chinmaya Dash, and Mansureh Shahraki Moghaddam. 2012. reMORPH: A runtime reconfigurable architecture. In Proceedings of the 2012 15th Euromicro Conference on Digital System Design. IEEE, 26--33.
[61]
Andrew Putnam. 2014. Large-scale reconfigurable computing in a microsoft datacenter. In Proceedings of the Hot Chips 26 Symposium (HCS), 2014 IEEE. IEEE, 1--38.
[62]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News 42, 3 (2014), 13--24.
[63]
Benjamin Recht and Christopher Ré. 2013. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation 5, 2 (2013), 201--226.
[64]
Aaron Severance, Joe Edwards, Hossein Omidian, and Guy Lemieux. 2014. Soft vector processors with streaming pipelines. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 117--126.
[65]
Aaron Severance and Guy Lemieux. 2012. VENICE: A compact vector processor for FPGA applications. In Proceedings of the 2012 International Conference on Field-Programmable Technology. IEEE, 261--268.
[66]
Shai Shalev-Shwartz and Ambuj Tewari. 2011. Stochastic methods for L1-regularized loss minimization. Journal of Machine Learning Research 12, Jun (2011), 1865--1892.
[67]
David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. 2017. doppioDB: A hardware accelerated database. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1659--1662.
[68]
Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. 2012. Database analytics acceleration using FPGAs. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 411--420.
[69]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.
[70]
Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 65--74.
[71]
Anuj Vaishnav, Khoa Dang Pham, and Dirk Koch. 2018. A survey on FPGA virtualization. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 131--1317.
[72]
Zeke Wang et al. 2019. Accelerating generalized linear models with MLWeaving: A one-size-fits-all system for any-precision learning. Proceedings of the VLDB Endowment 12, 7 (2019), 807--821.
[73]
Jagath Weerasinghe, Raphael Polig, Francois Abel, and Christoph Hagleitner. 2016. Network-attached FPGAs for data center applications. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 36--43.
[74]
Loring Wirbel. 2014. Xilinx SDAccel Whitepaper.
[75]
Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2008. VESPA: Portable, scalable, and flexible FPGA-based vector processors. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 61--70.
[76]
Jiansong Zhang, Yongqiang Xiong, Ningyi Xu, Ran Shu, Bojie Li, Peng Cheng, Guo Chen, and Thomas Moscibroda. 2017. The Feniks FPGA operating system for cloud computing. In Proceedings of the 8th Asia-Pacific Workshop on Systems. ACM, 22.
[77]
Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning. ACM, 116.
[78]
Zhuangdi Zhu, Alex X. Liu, Fan Zhang, and Fei Chen. 2018. FPGA resource pooling in cloud computing. IEEE Transactions on Cloud Computing (2018).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 14, Issue 1
March 2021
138 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3418746
  • Editor:
  • Deming Chen
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 November 2020
Accepted: 01 August 2020
Revised: 01 June 2020
Received: 01 April 2020
Published in TRETS Volume 14, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA
  2. context-switch
  3. data processing
  4. generalized linear models
  5. generic architecture
  6. high-performance
  7. machine learning
  8. matrix factorization
  9. programmable
  10. training

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 289
    Total Downloads
  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media