Hybrid Binary-Unary Hardware Accelerator | IEEE Journals & Magazine | IEEE Xplore

Hybrid Binary-Unary Hardware Accelerator


Abstract:

Stream-based computing such as stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. ...Show More

Abstract:

Stream-based computing such as stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the area saving comes at an exponential price in latency, making the area x delay cost unattractive. In this article, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or close-to-optimal design in the design space. To the best of our knowledge, we are the first to show a scalable method based on parallel bit-stream data representation that can beat conventional binary in terms of a real cost, i.e., area x delay and energy consumption in almost all functions that we tried at resolutions of 8-, 10-, and 12-bits. Our method outperforms the binary, stochastic, and fully unary methods on a number of functions, especially low-cost binary CORDIC-based functions, and on a common edge detection algorithm on FPGA and in ASIC implementation. In terms of area x delay cost, our {on FPGA, in ASIC} cost is on average only {4:72%, 24:36%} and {20:16%, 60:12%} of the parallel binary pipeline implementation at 8and 10-bit resolution, respectively. These numbers are 2-3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the parallel CORDIC-based pipeline binary method for high-resolution (12-bit), highly oscillating functions such as sin (15x). However, for complex functions like gamma function, the proposed method can beat any other methods in terms of area x delay, throughput, latency, and energy per sample costs. To implement the Roberts cross edge detection algorithm, the proposed method takes 5.7 and 39.45 percent of t...
Published in: IEEE Transactions on Computers ( Volume: 69, Issue: 9, 01 September 2020)
Page(s): 1308 - 1319
Date of Publication: 04 February 2020

ISSN Information:


Contact IEEE to Subscribe

References

References is not available for this document.