research-article

DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks

Authors:

François Serre,

Markus PüschelAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 13, Issue 1

Article No.: 1, Pages 1 - 23

https://doi.org/10.1145/3359754

Published: 19 December 2019 Publication History

Abstract

We present a hardware generator for computations with regular structure including the fast Fourier transform (FFT), sorting networks, and others. The input of the generator is a high-level description of the algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM- and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamless switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs, sorting networks, and Walsh-Hadamard transforms that outperform prior generators.

References

[1]

Jacques Hadamard. 1893. Résolution d’une question relative aux déterminants. Bulletin des sciences mathématiques 17 (1893), 240--246.

[2]

J. Astola and D. Akopian. 1999. Architecture-oriented regular algorithms for discrete sine and cosine transforms. IEEE Trans. Signal Process. 47, 4 (1999), 1109--1124.

Digital Library

[3]

Ken Edward Batcher. 1968. Sorting networks and their applications. In Proceedings of the Spring Joint Computer Conference(AFIPS’68), Vol. 32. 307--314.

Digital Library

[4]

Harold S. Stone. 1971. Parallel processing with the perfect shuffle. IEEE Trans. Comput. 20, 2 (1971), 153--161.

Digital Library

[5]

Váaclav Edvard Beneš. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. Academic Press.

[6]

Abraham Waksman. 1968. A permutation network. J. ACM 15, 1 (1968), 159--163.

Digital Library

[7]

Marshall C. Pease. 1977. The indirect binary n-cube microprocessor array. IEEE Trans. Comput. 26, 5 (1977), 458--473.

Digital Library

[8]

Jacques Lenfant and Serge Tahé. 1985. Permuting data with the Omega network. ACTA Informatica 21, 6 (1985), 629--641.

[9]

David Steinberg. 1983. Invariant properties of the shuffle-exchange and a simplified cost-effective version of the Omega network. IEEE Trans. Comput. 32, 5 (1983), 444--450.

Digital Library

[10]

David Nassimi and Sartaj Sahni. 1981. A self-routing Benes network and parallel permutation algorithms. IEEE Trans. Comput. 30, 5 (1981), 332--340.

Digital Library

[11]

Danny Cohen. 1976. Simplified control of FFT hardware. IEEE Trans. Acoust. Speech Signal Process. 24, 6 (1976), 577--579.

[12]

Pinit Kumhom, Jeremy R. Johnson, and Prawat Nagvajara. 2000. Design, optimization, and implementation of a universal FFT processor. In Proceedings of the International ASIC/SOC Conference (ASIC’00). 182--186.

[13]

Ainhoa Cortés, Igone Vélez, and Juan F. Sevillano. 2009. Radix r^k FFTs: Matricial representation and SDC/SDF pipeline implementation. IEEE Trans. Signal Process. 57, 7 (2009), 2824--2839.

Digital Library

[14]

Shousheng He and Mats Torkelson. 1996. A new approach to pipeline FFT processor. In Proceedings of the Parallel Processing Symposium (IPPS’96). 766--770.

[15]

Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. FFTs with near-optimal memory access through block data layouts: Algorithm, architecture and design automation. J. Signal Process. Syst. 85, 1 (2015), 67--82.

Digital Library

[16]

Byung G. Jo and Myung H. Sunwoo. 2005. New continuous-flow mixed-radix (CFMR) FFT Processor using novel in-place strategy. IEEE Trans. Circ. Syst. I 52, 5 (2005), 911--919.

[17]

Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012. Computer generation of hardware for linear digital signal processing transforms. ACM Trans. Design Autom. Electron. Syst. 17, 2 (2012), 15:1--15:33.

Digital Library

[18]

Mario Garrido, Miguel Ángel Sánchez, María Luisa López-Vallejo, and Jesùs Grajal. 2017. A 4096-point Radix-4 memory-based FFT using DSP slices. IEEE Trans. Very Large Scale Integr. Syst. 25, 1 (2017), 375--379.

Digital Library

[19]

Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2008. Linear transforms: From math to efficient hardware. In Proceedings of the Workshop on High-Level Synthesis Colocated with DAC.

[20]

Grace Nordin, Peter A. Milder, James C. Hoe, and Markus Püschel. 2005. Automatic generation of customized discrete Fourier transform IPs. In Proceedings of the Design Automation Conference (DAC’05). 471--474.

Digital Library

[21]

Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2012. Computer generation of streaming sorting networks. In Proceedings of the Design Automation Conference (DAC’12). 1245--1253.

Digital Library

[22]

Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2016. Streaming sorting networks. ACM Trans. Design Autom. Electron. Syst. 21, 4 (2016).

Digital Library

[23]

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE Spec. Issue 93, 2 (2005), 232--275.

[24]

Florent de Dinechin and Bogdan Pasca. 2011. Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28, 4 (2011), 18--27.

Digital Library

[25]

François Serre, Thomas Holenstein, and Markus Püschel. 2016. Optimal circuits for streamed linear permutations using RAM. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’16). 215--223.

Digital Library

[26]

Thaddeus Koehn and Peter Athanas. 2016. Arbitrary streaming permutations with minimum memory and latency. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’16). 1--6.

Digital Library

[27]

François Serre. 2019. Optimal Streaming Permutations and Transforms: Theory and Implementation. Ph.D. Dissertation. ETH Zurich.

[28]

François Serre and Markus Püschel. 2018. Memory-efficient fast Fourier transform on streaming data by fusing permutations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’18). 219--228.

Digital Library

[29]

Martin Odersky, Lex Spoon, and Bill Venners. 2008. Programming in Scala. Artima Inc.

[30]

Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (June 2012), 121--130.

Digital Library

[31]

François Serre. 2018. SGen—A streaming hardware generator. Retrieved from https://acl.inf.ethz.ch/research/hardware/.

[32]

François Serre. 2018. DFT and streamed linear permutation generator for hardware. Retrieved from https://github.com/fserre/sgen.

[33]

François Serre and Markus Püschel. 2018. A DSL-based FFT hardware generator in Scala. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). 315--322.

[34]

Jianxin Xiong, Jeremy Johnson, Robert W. Johnson, and David Padua. 2001. SPL: A language and compiler for DSP algorithms. In Proceedings of the Conference on Programming Languages Design and Implementation (PLDI’01). 298--308.

Digital Library

[35]

J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. 1990. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circ. Syst. Signal Process. 9, 4 (1990), 449--500.

Digital Library

[36]

Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. 2013. Spiral in Scala: Towards the systematic construction of generators for performance libraries. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’13). 125--134.

Digital Library

[37]

Marshall C. Pease. 1968. An adaptation of the fast Fourier transform for parallel processing. J. ACM 15, 2 (1968), 252--264.

Digital Library

[38]

Markus Püschel, Peter A. Milder, and James C. Hoe. 2009. Permuting streaming data using RAMs. J. ACM 56, 2 (2009), 10:1--10:34.

Digital Library

[39]

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. 2009. Operator language: A program generation framework for fast kernels. In Proceedings of the IFIP Working Conference on Domain Specific Languages (DSL WC’09) (Lecture Notes in Computer Science), Vol. 5658. Springer, 385--410.

Digital Library

[40]

Philip Wadler and Stephen Blott. 1989. How to make ad-hoc polymorphism less ad hoc. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 60--76.

Digital Library

[41]

Georg Ofenbeck, Tiark Rompf, and Markus Püschel. 2017. Staging for generic programming in space and time. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’17). 15--28.

Digital Library

[42]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the Design Automation Conference (DAC’12). 1216--1225.

Digital Library

[43]

O. Port and Y. Etsion. 2017. DFiant: A dataflow hardware description language. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’17). 1--4.

[44]

Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: An implicitly parallel domain-specific language for machine learning. In Proceedings of the International Conference on Machine Learning (ICML’11). 609--616.

[45]

Nithin George, HyoukJoong Lee, David Novo, Muhsen Owaida, David Andrews, Kunle Olukotun, and Paolo Ienne. 2015. Automatic support for multi-module parallelism from computational patterns. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’15). 1--8.

[46]

Nithin George, David Novo, Tiark Rompf, Martin Odersky, and Paolo Ienne. 2013. Making domain-specific hardware synthesis tools cost-efficient. In Proceedings of the International Conference on Field-Programmable Technology (FPT’13). 120--127.

[47]

Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). 33--36.

Digital Library

[48]

Geoffrey Mainland and Jeremy Johnson. 2017. A Haskell compiler for signal transforms. In Proceedings of the International Conference on Generative Programming: Concepts and Experiences (GPCE’17). 219--232.

Digital Library

[49]

Georg Ofenbeck. 2017. Generic Programming in Space and Time. Ph.D. Dissertation. ETH Zurich.

[50]

Donald Ervin Knuth. 1978. The Art of Computer Programming (Addison-Wesley Series in Computer Science and Information, 2nd ed. Addison-Wesley Longman Publishing, Boston, MA.

[51]

Rene Mueller, Jens Teubner, and Gustavo Alonso. 2012. Sorting networks on FPGAs. VLDB J. 21, 1 (2012), 1--23.

Digital Library

[52]

Ren Chen, Sruja Siriyal, and Viktor Prasanna. 2015. Energy and memory efficient mapping of bitonic sorting on FPGA. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). 240--249.

Digital Library

[53]

J. Ortiz and D. Andrews. 2010. A configurable high-throughput linear sorter system. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops, and Ph.D. Forum (IPDPSW’10). 1--8.

Cited By

Farabegoli NPianini DCasadei RViroli M(2024)Scalability through PulverisationFuture Generation Computer Systems10.1016/j.future.2024.07.042161:C(545-558)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.future.2024.07.042
Van Beirendonck MD'Anvers JTuran FVerbauwhede IMeng WJensen CCremers CKirda E(2023)FPT: A Fixed-Point Accelerator for Torus Fully Homomorphic EncryptionProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623159(741-755)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623159
Schlaak CJuang TDubach CGrosser TLee K(2022)Optimizing data reshaping operations in functional IRs for high-level synthesisProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535069(61-72)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3519941.3535069
Show More Cited By

Index Terms

DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks
1. Hardware

Recommendations

Computer Generation of Hardware for Linear Digital Signal Processing Transforms

Linear signal transforms such as the discrete Fourier transform (DFT) are very widely used in digital signal processing and other domains. Due to high performance or efficiency requirements, these transforms are often implemented in hardware. This ...
Discrete Cosine Transfom

A discrete cosine transform (DCT) is defined and an algorithm to compute it using the fast Fourier transform is developed. It is shown that the discrete cosine transform can be used in the area of digital processing for the purposes of pattern ...
The Fast Hartley Transform Algorithm

The fast Hartley transform (FHT) is similar to the Cooley-Tukey fast Fourier transform (FFT) but performs much faster because it requires only real arithmetic computations compared to the complex arithmetic computations required by the FFT. Through use ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 13, Issue 1

March 2020

135 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3377289

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 December 2019

Accepted: 01 August 2019

Revised: 01 July 2019

Received: 01 December 2018

Published in TRETS Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
312
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Farabegoli NPianini DCasadei RViroli M(2024)Scalability through PulverisationFuture Generation Computer Systems10.1016/j.future.2024.07.042161:C(545-558)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.future.2024.07.042
Van Beirendonck MD'Anvers JTuran FVerbauwhede IMeng WJensen CCremers CKirda E(2023)FPT: A Fixed-Point Accelerator for Torus Fully Homomorphic EncryptionProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623159(741-755)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623159
Schlaak CJuang TDubach CGrosser TLee K(2022)Optimizing data reshaping operations in functional IRs for high-level synthesisProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535069(61-72)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3519941.3535069
Schlaak CJuang TDubach C(2022)Memory-Aware Functional IR for Higher-Level Synthesis of AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/350176819:2(1-26)Online publication date: 30-Jun-2022
https://doi.org/10.1145/3501768

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents