research-article

Permuting streaming data using RAMs

Authors:

Markus Püschel,

Peter A. Milder,

James C. HoeAuthors Info & Claims

Journal of the ACM (JACM), Volume 56, Issue 2

Article No.: 10, Pages 1 - 34

https://doi.org/10.1145/1502793.1502799

Published: 17 April 2009 Publication History

Abstract

This article presents a method for constructing hardware structures that perform a fixed permutation on streaming data. The method applies to permutations that can be represented as linear mappings on the bit-level representation of the data locations. This subclass includes many important permutations such as stride permutations (corner turn, perfect shuffle, etc.), the bit reversal, the Hadamard reordering, and the Gray code reordering.

The datapath for performing the streaming permutation consists of several independent banks of memory and two interconnection networks. These structures are built for a given streaming width (i.e., number of inputs and outputs per cycle) and operate at full throughput for this streaming width.

We provide an algorithm that completely specifies the datapath and control logic given the desired permutation and streaming width. Further, we provide lower bounds on the achievable cost of a solution and show that for an important subclass of permutations our solution is optimal.

We apply our algorithm to derive datapaths for several important permutations, including a detailed example that carefully illustrates each aspect of the design process. Lastly, we compare our permutation structures to those of Järvinen et al. [2004], which are specialized for stride permutations.

References

[1]

Astola, J., and Akopian, D. 1999. Architecure-oriented regular algorithms for discrete sine and cosine transforms. IEEE Trans. Sig. Proc. 47, 4, 1109--1124.

Digital Library

[2]

Beauchamp, K. G. 1984. Applications of Walsh and Related Functions. Academic Press, Orlando, FL.

[3]

Benes, V. E. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. Academic Press, Orlando, FL.

[4]

Bernstein, D. S. 2005. Matrix Mathematics. Princeton University Press, Princeton, NJ.

[5]

Bilardi, G. 1989. Merging and sorting networks with the topology of the omega network. IEEE Trans. Comput. 38, 10, 1396--1403.

Digital Library

[6]

Bürgisser, P., Clausen, M., and Shokrollahi, M. A. 1997. Algebraic Complexity Theory. Springer-Verlag, Berlin, Germany.

[7]

Duhamel, P. 1990. A connection between bit reversal and matrix transposition: Hardware and software consequences. IEEE Trans. Acous., Speech, Signal Proc. 38, 11, 1893--1418.

[8]

Gorman, S. F., and Wills, J. M. 1995. Partial column FFT pipelines. IEEE Trans. Circ. Syst. II: Analog Digital Signal Proc. 42, 6, 414--423.

[9]

Järvinen, T. S., Salmela, P., Sorokin, H., and Takala, J. H. 2004. Stride permutation networks for array processors. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors. IEEE Computer Society Press, Los Alamitos, CA, 326--386.

Digital Library

[10]

Láng, T. 1976. Interconnections between processors and memory modules using the shuffle-exchange network. IEEE Trans. Comput. 25, 5, 496--503.

Digital Library

[11]

Lawrie, D. H. 1975. Access and alignment of data in an array processor. IEEE Trans. Comput. 24, 12, 1145--1155.

Digital Library

[12]

Lee, K. Y. 1985. On the rearrangeability of 2(log₂ N)− 1 stage permutation networks. IEEE Trans. Comput. 34, 5, 412--425.

Digital Library

[13]

Milder, P. A., Franchetti, F., Hoe, J. C., and Püschel, M. 2008. Formal datapath representation and manipulation for implementing DSP transforms. In Proceedings of the 45th Annual ACM/IEEE Conference on Design Automation (DAC). ACM, New York, 385--390.

Digital Library

[14]

Milder, P. A., Hoe, J. C., and Püschel, M. 2009. Automatic generation of streaming datapaths for arbitrary fixed permutations. In Proceedings of Design, Automation and Test in Europe.

Digital Library

[15]

Nordin, G., Milder, P. A., Hoe, J. C., and Püschel, M. 2005. Automatic generation of customized discrete Fourier transform IPs. In Proceedings of the 42nd Annual ACM/IEEE Conference on Design Automation (DAC). ACM, New York, 471--474.

Digital Library

[16]

Parhi, K. K. 1992. Systematic synthesis of DSP data format converters using life-time analysis and forward-backward register allocation. IEEE Trans. Circ. Syst. II: Analog Digital Signal Proc. 39, 7, 423--440.

[17]

Parker, D. S. 1980. Notes on shuffle/exchange-type switching networks. IEEE Trans. Comput. 29, 3, 213--222.

Digital Library

[18]

Pease, M. C. 1977. The indirect binary N-cube microprocessor array. IEEE Trans. Comput. 26, 5, 458--473.

Digital Library

[19]

Püschel, M., and Moura, J. M. F. 2008. Algebraic signal processing theory: Cooley-Tukey type algorithms for DCTs and DSTs. IEEE Trans. Signal Proc. 56, 4, 1502--1521.

Digital Library

[20]

Takala, J. H., Järvinen, T. S., and Sorokin, H. T. 2003. Conflict-free parallel memory access scheme for FFT processors. In Proceedings of the 2003 International Symposium on Circuits and Systems.

[21]

Van Loan, C. 1992. Computational Frameworks for the Fast Fourier Transform. SIAM, Philadelphia, PA.

Digital Library

[22]

Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2, 260--269.

Digital Library

[23]

Waksman, A. 1968. A permutation network. J. ACM 15, 1, 159--163.

Digital Library

Cited By

Oh HPark JLee S(2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
https://doi.org/10.1109/TCSII.2024.3377255
Vega MYang XShalf JPopovici D(2023)Towards a Flexible Hardware Implementation for Mixed-Radix Fourier Transforms2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363540(1-7)Online publication date: 25-Sep-2023
https://doi.org/10.1109/HPEC58863.2023.10363540
Prakash Reddy BKumar NKandpal KGoswami M(2023)Multiplexer & Memory Efficient Bit-Reversal Algorithms2023 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)10.1109/APCCAS60141.2023.00061(236-240)Online publication date: 19-Nov-2023
https://doi.org/10.1109/APCCAS60141.2023.00061
Show More Cited By

Index Terms

Permuting streaming data using RAMs
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis

Recommendations

Optimal Circuits for Streamed Linear Permutations Using RAM
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We propose a method to automatically derive hardware structures that perform a fixed linear permutation on streaming data. Linear permutations are permutations that map linearly the bit representation of the elements addresses. This set contains many of ...
Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing Permutations
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We propose a novel FFT datapath that reduces the memory requirement compared to state-of-the-art RAM-based implementations by up to a factor of two. The novelty is in a technique to fuse the datapaths for the required perfect shuffle and bit reversal ...
Data staging for efficient high throughput stream processing
Highlights
- Linear and general data staging hardware architectures are examined for common use cases. This study replicates and extends previous results to flesh out the trade space more fully.
- A method is demonstrated to use linear permutations ...
Abstract
High-bandwidth stream-oriented applications often demand high throughput computation engines implemented on dedicated hardware such as FPGAs, or ASICs. In such circuits, the streaming width (number of inputs and outputs per cycle) multiplied by ...

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM

Journal of the ACM Volume 56, Issue 2

April 2009

190 pages

ISSN:0004-5411

EISSN:1557-735X

DOI:10.1145/1502793

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2009

Accepted: 01 December 2008

Revised: 01 November 2008

Received: 01 March 2008

Published in JACM Volume 56, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
1,109
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Oh HPark JLee S(2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
https://doi.org/10.1109/TCSII.2024.3377255
Vega MYang XShalf JPopovici D(2023)Towards a Flexible Hardware Implementation for Mixed-Radix Fourier Transforms2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363540(1-7)Online publication date: 25-Sep-2023
https://doi.org/10.1109/HPEC58863.2023.10363540
Prakash Reddy BKumar NKandpal KGoswami M(2023)Multiplexer & Memory Efficient Bit-Reversal Algorithms2023 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)10.1109/APCCAS60141.2023.00061(236-240)Online publication date: 19-Nov-2023
https://doi.org/10.1109/APCCAS60141.2023.00061
Serre FPüschel M(2019)DSL-Based Hardware Generation with ScalaACM Transactions on Reconfigurable Technology and Systems10.1145/335975413:1(1-23)Online publication date: 19-Dec-2019
https://dl.acm.org/doi/10.1145/3359754
Garrido MGrajal JGustafsson O(2019)Optimum Circuits for Bit-Dimension PermutationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.289232227:5(1148-1160)Online publication date: May-2019
https://doi.org/10.1109/TVLSI.2019.2892322
Nazmy MNasr OFahmy H(2019) A Novel Generic Low Latency Hybrid Architecture for Parallel Pipelined Radix-2 k Feed Forward FFT 2019 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS.2019.8702144(1-5)Online publication date: May-2019
https://doi.org/10.1109/ISCAS.2019.8702144
Serre FPuschel M(2019)In Search of the Optimal Walsh-hadamard Transform for Streamed Parallel ProcessingICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2019.8682213(1532-1536)Online publication date: May-2019
https://doi.org/10.1109/ICASSP.2019.8682213
Koehn TAthanas P(2019)Data staging for efficient high throughput stream processingParallel Computing10.1016/j.parco.2019.10256690(102566)Online publication date: Dec-2019
https://doi.org/10.1016/j.parco.2019.102566
Serre FPüschel MAnderson JBazargan K(2018)Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing PermutationsProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174263(219-228)Online publication date: 15-Feb-2018
https://dl.acm.org/doi/10.1145/3174243.3174263
Garrido M(2018)Multiplexer and Memory-Efficient Circuits for Parallel Bit ReversalIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2018.2880921(1-1)Online publication date: 2018
https://doi.org/10.1109/TCSII.2018.2880921
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents