skip to main content
research-article

DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks

Published: 19 December 2019 Publication History

Abstract

We present a hardware generator for computations with regular structure including the fast Fourier transform (FFT), sorting networks, and others. The input of the generator is a high-level description of the algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM- and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamless switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs, sorting networks, and Walsh-Hadamard transforms that outperform prior generators.

References

[1]
Jacques Hadamard. 1893. Résolution d’une question relative aux déterminants. Bulletin des sciences mathématiques 17 (1893), 240--246.
[2]
J. Astola and D. Akopian. 1999. Architecture-oriented regular algorithms for discrete sine and cosine transforms. IEEE Trans. Signal Process. 47, 4 (1999), 1109--1124.
[3]
Ken Edward Batcher. 1968. Sorting networks and their applications. In Proceedings of the Spring Joint Computer Conference(AFIPS’68), Vol. 32. 307--314.
[4]
Harold S. Stone. 1971. Parallel processing with the perfect shuffle. IEEE Trans. Comput. 20, 2 (1971), 153--161.
[5]
Váaclav Edvard Beneš. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. Academic Press.
[6]
Abraham Waksman. 1968. A permutation network. J. ACM 15, 1 (1968), 159--163.
[7]
Marshall C. Pease. 1977. The indirect binary n-cube microprocessor array. IEEE Trans. Comput. 26, 5 (1977), 458--473.
[8]
Jacques Lenfant and Serge Tahé. 1985. Permuting data with the Omega network. ACTA Informatica 21, 6 (1985), 629--641.
[9]
David Steinberg. 1983. Invariant properties of the shuffle-exchange and a simplified cost-effective version of the Omega network. IEEE Trans. Comput. 32, 5 (1983), 444--450.
[10]
David Nassimi and Sartaj Sahni. 1981. A self-routing Benes network and parallel permutation algorithms. IEEE Trans. Comput. 30, 5 (1981), 332--340.
[11]
Danny Cohen. 1976. Simplified control of FFT hardware. IEEE Trans. Acoust. Speech Signal Process. 24, 6 (1976), 577--579.
[12]
Pinit Kumhom, Jeremy R. Johnson, and Prawat Nagvajara. 2000. Design, optimization, and implementation of a universal FFT processor. In Proceedings of the International ASIC/SOC Conference (ASIC’00). 182--186.
[13]
Ainhoa Cortés, Igone Vélez, and Juan F. Sevillano. 2009. Radix rk FFTs: Matricial representation and SDC/SDF pipeline implementation. IEEE Trans. Signal Process. 57, 7 (2009), 2824--2839.
[14]
Shousheng He and Mats Torkelson. 1996. A new approach to pipeline FFT processor. In Proceedings of the Parallel Processing Symposium (IPPS’96). 766--770.
[15]
Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. FFTs with near-optimal memory access through block data layouts: Algorithm, architecture and design automation. J. Signal Process. Syst. 85, 1 (2015), 67--82.
[16]
Byung G. Jo and Myung H. Sunwoo. 2005. New continuous-flow mixed-radix (CFMR) FFT Processor using novel in-place strategy. IEEE Trans. Circ. Syst. I 52, 5 (2005), 911--919.
[17]
Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012. Computer generation of hardware for linear digital signal processing transforms. ACM Trans. Design Autom. Electron. Syst. 17, 2 (2012), 15:1--15:33.
[18]
Mario Garrido, Miguel Ángel Sánchez, María Luisa López-Vallejo, and Jesùs Grajal. 2017. A 4096-point Radix-4 memory-based FFT using DSP slices. IEEE Trans. Very Large Scale Integr. Syst. 25, 1 (2017), 375--379.
[19]
Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2008. Linear transforms: From math to efficient hardware. In Proceedings of the Workshop on High-Level Synthesis Colocated with DAC.
[20]
Grace Nordin, Peter A. Milder, James C. Hoe, and Markus Püschel. 2005. Automatic generation of customized discrete Fourier transform IPs. In Proceedings of the Design Automation Conference (DAC’05). 471--474.
[21]
Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2012. Computer generation of streaming sorting networks. In Proceedings of the Design Automation Conference (DAC’12). 1245--1253.
[22]
Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2016. Streaming sorting networks. ACM Trans. Design Autom. Electron. Syst. 21, 4 (2016).
[23]
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE Spec. Issue 93, 2 (2005), 232--275.
[24]
Florent de Dinechin and Bogdan Pasca. 2011. Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28, 4 (2011), 18--27.
[25]
François Serre, Thomas Holenstein, and Markus Püschel. 2016. Optimal circuits for streamed linear permutations using RAM. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’16). 215--223.
[26]
Thaddeus Koehn and Peter Athanas. 2016. Arbitrary streaming permutations with minimum memory and latency. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’16). 1--6.
[27]
François Serre. 2019. Optimal Streaming Permutations and Transforms: Theory and Implementation. Ph.D. Dissertation. ETH Zurich.
[28]
François Serre and Markus Püschel. 2018. Memory-efficient fast Fourier transform on streaming data by fusing permutations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’18). 219--228.
[29]
Martin Odersky, Lex Spoon, and Bill Venners. 2008. Programming in Scala. Artima Inc.
[30]
Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (June 2012), 121--130.
[31]
François Serre. 2018. SGen—A streaming hardware generator. Retrieved from https://acl.inf.ethz.ch/research/hardware/.
[32]
François Serre. 2018. DFT and streamed linear permutation generator for hardware. Retrieved from https://github.com/fserre/sgen.
[33]
François Serre and Markus Püschel. 2018. A DSL-based FFT hardware generator in Scala. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). 315--322.
[34]
Jianxin Xiong, Jeremy Johnson, Robert W. Johnson, and David Padua. 2001. SPL: A language and compiler for DSP algorithms. In Proceedings of the Conference on Programming Languages Design and Implementation (PLDI’01). 298--308.
[35]
J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. 1990. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circ. Syst. Signal Process. 9, 4 (1990), 449--500.
[36]
Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. 2013. Spiral in Scala: Towards the systematic construction of generators for performance libraries. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’13). 125--134.
[37]
Marshall C. Pease. 1968. An adaptation of the fast Fourier transform for parallel processing. J. ACM 15, 2 (1968), 252--264.
[38]
Markus Püschel, Peter A. Milder, and James C. Hoe. 2009. Permuting streaming data using RAMs. J. ACM 56, 2 (2009), 10:1--10:34.
[39]
Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. 2009. Operator language: A program generation framework for fast kernels. In Proceedings of the IFIP Working Conference on Domain Specific Languages (DSL WC’09) (Lecture Notes in Computer Science), Vol. 5658. Springer, 385--410.
[40]
Philip Wadler and Stephen Blott. 1989. How to make ad-hoc polymorphism less ad hoc. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 60--76.
[41]
Georg Ofenbeck, Tiark Rompf, and Markus Püschel. 2017. Staging for generic programming in space and time. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’17). 15--28.
[42]
Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the Design Automation Conference (DAC’12). 1216--1225.
[43]
O. Port and Y. Etsion. 2017. DFiant: A dataflow hardware description language. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’17). 1--4.
[44]
Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: An implicitly parallel domain-specific language for machine learning. In Proceedings of the International Conference on Machine Learning (ICML’11). 609--616.
[45]
Nithin George, HyoukJoong Lee, David Novo, Muhsen Owaida, David Andrews, Kunle Olukotun, and Paolo Ienne. 2015. Automatic support for multi-module parallelism from computational patterns. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’15). 1--8.
[46]
Nithin George, David Novo, Tiark Rompf, Martin Odersky, and Paolo Ienne. 2013. Making domain-specific hardware synthesis tools cost-efficient. In Proceedings of the International Conference on Field-Programmable Technology (FPT’13). 120--127.
[47]
Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). 33--36.
[48]
Geoffrey Mainland and Jeremy Johnson. 2017. A Haskell compiler for signal transforms. In Proceedings of the International Conference on Generative Programming: Concepts and Experiences (GPCE’17). 219--232.
[49]
Georg Ofenbeck. 2017. Generic Programming in Space and Time. Ph.D. Dissertation. ETH Zurich.
[50]
Donald Ervin Knuth. 1978. The Art of Computer Programming (Addison-Wesley Series in Computer Science and Information, 2nd ed. Addison-Wesley Longman Publishing, Boston, MA.
[51]
Rene Mueller, Jens Teubner, and Gustavo Alonso. 2012. Sorting networks on FPGAs. VLDB J. 21, 1 (2012), 1--23.
[52]
Ren Chen, Sruja Siriyal, and Viktor Prasanna. 2015. Energy and memory efficient mapping of bitonic sorting on FPGA. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). 240--249.
[53]
J. Ortiz and D. Andrews. 2010. A configurable high-throughput linear sorter system. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops, and Ph.D. Forum (IPDPSW’10). 1--8.

Cited By

View all
  • (2024)Scalability through PulverisationFuture Generation Computer Systems10.1016/j.future.2024.07.042161:C(545-558)Online publication date: 1-Dec-2024
  • (2023)FPT: A Fixed-Point Accelerator for Torus Fully Homomorphic EncryptionProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623159(741-755)Online publication date: 15-Nov-2023
  • (2022)Optimizing data reshaping operations in functional IRs for high-level synthesisProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535069(61-72)Online publication date: 14-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 13, Issue 1
March 2020
135 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3377289
  • Editor:
  • Deming Chen
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 December 2019
Accepted: 01 August 2019
Revised: 01 July 2019
Received: 01 December 2018
Published in TRETS Volume 13, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fast Fourier transform
  2. IP core
  3. Scala
  4. Walsh-Hadamard transform
  5. hardware generation
  6. sorting network
  7. streaming datapaths

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scalability through PulverisationFuture Generation Computer Systems10.1016/j.future.2024.07.042161:C(545-558)Online publication date: 1-Dec-2024
  • (2023)FPT: A Fixed-Point Accelerator for Torus Fully Homomorphic EncryptionProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623159(741-755)Online publication date: 15-Nov-2023
  • (2022)Optimizing data reshaping operations in functional IRs for high-level synthesisProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535069(61-72)Online publication date: 14-Jun-2022
  • (2022)Memory-Aware Functional IR for Higher-Level Synthesis of AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/350176819:2(1-26)Online publication date: 30-Jun-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media