skip to main content
research-article

Permuting streaming data using RAMs

Published: 17 April 2009 Publication History

Abstract

This article presents a method for constructing hardware structures that perform a fixed permutation on streaming data. The method applies to permutations that can be represented as linear mappings on the bit-level representation of the data locations. This subclass includes many important permutations such as stride permutations (corner turn, perfect shuffle, etc.), the bit reversal, the Hadamard reordering, and the Gray code reordering.
The datapath for performing the streaming permutation consists of several independent banks of memory and two interconnection networks. These structures are built for a given streaming width (i.e., number of inputs and outputs per cycle) and operate at full throughput for this streaming width.
We provide an algorithm that completely specifies the datapath and control logic given the desired permutation and streaming width. Further, we provide lower bounds on the achievable cost of a solution and show that for an important subclass of permutations our solution is optimal.
We apply our algorithm to derive datapaths for several important permutations, including a detailed example that carefully illustrates each aspect of the design process. Lastly, we compare our permutation structures to those of Järvinen et al. [2004], which are specialized for stride permutations.

References

[1]
Astola, J., and Akopian, D. 1999. Architecure-oriented regular algorithms for discrete sine and cosine transforms. IEEE Trans. Sig. Proc. 47, 4, 1109--1124.
[2]
Beauchamp, K. G. 1984. Applications of Walsh and Related Functions. Academic Press, Orlando, FL.
[3]
Benes, V. E. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. Academic Press, Orlando, FL.
[4]
Bernstein, D. S. 2005. Matrix Mathematics. Princeton University Press, Princeton, NJ.
[5]
Bilardi, G. 1989. Merging and sorting networks with the topology of the omega network. IEEE Trans. Comput. 38, 10, 1396--1403.
[6]
Bürgisser, P., Clausen, M., and Shokrollahi, M. A. 1997. Algebraic Complexity Theory. Springer-Verlag, Berlin, Germany.
[7]
Duhamel, P. 1990. A connection between bit reversal and matrix transposition: Hardware and software consequences. IEEE Trans. Acous., Speech, Signal Proc. 38, 11, 1893--1418.
[8]
Gorman, S. F., and Wills, J. M. 1995. Partial column FFT pipelines. IEEE Trans. Circ. Syst. II: Analog Digital Signal Proc. 42, 6, 414--423.
[9]
Järvinen, T. S., Salmela, P., Sorokin, H., and Takala, J. H. 2004. Stride permutation networks for array processors. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors. IEEE Computer Society Press, Los Alamitos, CA, 326--386.
[10]
Láng, T. 1976. Interconnections between processors and memory modules using the shuffle-exchange network. IEEE Trans. Comput. 25, 5, 496--503.
[11]
Lawrie, D. H. 1975. Access and alignment of data in an array processor. IEEE Trans. Comput. 24, 12, 1145--1155.
[12]
Lee, K. Y. 1985. On the rearrangeability of 2(log2 N)− 1 stage permutation networks. IEEE Trans. Comput. 34, 5, 412--425.
[13]
Milder, P. A., Franchetti, F., Hoe, J. C., and Püschel, M. 2008. Formal datapath representation and manipulation for implementing DSP transforms. In Proceedings of the 45th Annual ACM/IEEE Conference on Design Automation (DAC). ACM, New York, 385--390.
[14]
Milder, P. A., Hoe, J. C., and Püschel, M. 2009. Automatic generation of streaming datapaths for arbitrary fixed permutations. In Proceedings of Design, Automation and Test in Europe.
[15]
Nordin, G., Milder, P. A., Hoe, J. C., and Püschel, M. 2005. Automatic generation of customized discrete Fourier transform IPs. In Proceedings of the 42nd Annual ACM/IEEE Conference on Design Automation (DAC). ACM, New York, 471--474.
[16]
Parhi, K. K. 1992. Systematic synthesis of DSP data format converters using life-time analysis and forward-backward register allocation. IEEE Trans. Circ. Syst. II: Analog Digital Signal Proc. 39, 7, 423--440.
[17]
Parker, D. S. 1980. Notes on shuffle/exchange-type switching networks. IEEE Trans. Comput. 29, 3, 213--222.
[18]
Pease, M. C. 1977. The indirect binary N-cube microprocessor array. IEEE Trans. Comput. 26, 5, 458--473.
[19]
Püschel, M., and Moura, J. M. F. 2008. Algebraic signal processing theory: Cooley-Tukey type algorithms for DCTs and DSTs. IEEE Trans. Signal Proc. 56, 4, 1502--1521.
[20]
Takala, J. H., Järvinen, T. S., and Sorokin, H. T. 2003. Conflict-free parallel memory access scheme for FFT processors. In Proceedings of the 2003 International Symposium on Circuits and Systems.
[21]
Van Loan, C. 1992. Computational Frameworks for the Fast Fourier Transform. SIAM, Philadelphia, PA.
[22]
Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2, 260--269.
[23]
Waksman, A. 1968. A permutation network. J. ACM 15, 1, 159--163.

Cited By

View all
  • (2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
  • (2023)Towards a Flexible Hardware Implementation for Mixed-Radix Fourier Transforms2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363540(1-7)Online publication date: 25-Sep-2023
  • (2023)Multiplexer & Memory Efficient Bit-Reversal Algorithms2023 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)10.1109/APCCAS60141.2023.00061(236-240)Online publication date: 19-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 56, Issue 2
April 2009
190 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1502793
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2009
Accepted: 01 December 2008
Revised: 01 November 2008
Received: 01 March 2008
Published in JACM Volume 56, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Permutation
  2. RAM
  3. connection network
  4. data reordering
  5. linear bit mapping
  6. matrix transposition
  7. streaming datapath
  8. stride permutation
  9. switch

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DL-Sort: A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming SortingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337725571:5(2549-2553)Online publication date: May-2024
  • (2023)Towards a Flexible Hardware Implementation for Mixed-Radix Fourier Transforms2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363540(1-7)Online publication date: 25-Sep-2023
  • (2023)Multiplexer & Memory Efficient Bit-Reversal Algorithms2023 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)10.1109/APCCAS60141.2023.00061(236-240)Online publication date: 19-Nov-2023
  • (2019)DSL-Based Hardware Generation with ScalaACM Transactions on Reconfigurable Technology and Systems10.1145/335975413:1(1-23)Online publication date: 19-Dec-2019
  • (2019)Optimum Circuits for Bit-Dimension PermutationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.289232227:5(1148-1160)Online publication date: May-2019
  • (2019) A Novel Generic Low Latency Hybrid Architecture for Parallel Pipelined Radix-2 k Feed Forward FFT 2019 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS.2019.8702144(1-5)Online publication date: May-2019
  • (2019)In Search of the Optimal Walsh-hadamard Transform for Streamed Parallel ProcessingICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2019.8682213(1532-1536)Online publication date: May-2019
  • (2019)Data staging for efficient high throughput stream processingParallel Computing10.1016/j.parco.2019.10256690(102566)Online publication date: Dec-2019
  • (2018)Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing PermutationsProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174263(219-228)Online publication date: 15-Feb-2018
  • (2018)Multiplexer and Memory-Efficient Circuits for Parallel Bit ReversalIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2018.2880921(1-1)Online publication date: 2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media