Skip to main content

Advertisement

Log in

A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

The increased adoption of DNN applications drove the emergence of dedicated tensor computing units to accelerate multi-dimensional matrix multiplication operations. Although they deploy highly efficient computing architectures, they often lack support for more general-purpose application domains. Such a limitation occurs both due to their consolidated computation scheme (restricted to matrix multiplication) and due to their frequent adoption of low-precision/custom floating-point formats (unsuited for general application domains). In contrast, this paper proposes a new Reconfigurable Tensor Unit (RTU) which deploys an array of variable-precision Vector Multiply-Accumulate (VMA) units. Furthermore, each VMA unit leverages the new Posit floating-point format and supports the full range of standardized posit precisions in a single SIMD unit, with variable vector-element width. Moreover, the proposed RTU explores the Posit format features for fused operations, together with spatial and time-multiplexing reconfiguration mechanisms to fuse and combine multiple VMAs to map high-level and complex operations. The proposed RTU is also supported by an automatic data streaming infrastructure and a pipelined data transfer scheme, allowing it to accelerate the computation of most data-parallel patterns commonly present in vectorizable applications. The proposed RTU showed to outperform state-of-the-art tensor and SIMD units present in off-the-shelf platforms, and with dedicated FPGA-based accelerators, in turn resulting in significant energy-efficiency improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. Although it is out of the scope of this work, to deploy a VMA (or the RTU) either as a CPU functional unit or as a dedicated accelerator, it is only required to connect each controller to a centralized mechanism to facilitate its programming.

  2. The adopted NVIDIA tensor core was used as a representative platform in the domain of tensor accelerators not only due to its accessibility, but also because it consists on a fair and valid comparison basis since its topology is close to that of the RTU base architecture.

References

  1. Dean, J., Patterson, D., & Young, C. (2018). A new golden age in computer architecture: Empowering the machine-learning revolution. IEEE Micro, 38(2), 21–29.

    Article  Google Scholar 

  2. Hennessy, J. L., & Patterson, D. A. (2019). A new golden age for computer architecture. Communications of the ACM, 62(2), 48–60.

    Article  Google Scholar 

  3. Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., et al. (2018). Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro, 38(2), 8–20.

    Article  Google Scholar 

  4. Delaye, E., Sirasao, A., Dudha, C., & Das, S. (2017). Deep learning challenges and solutions with xilinx fpgas. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), IEEE, pp. 908–913.

  5. Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., et al. (2018). A configurable cloud-scale dnn processor for real-time ai. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 1–14.

  6. Jouppi, N. P., Young, C., Patil, N., & Patterson, D. (2018). A domain-specific architecture for deep neural networks. Communications of the ACM, 61(9), 50–59.

    Article  Google Scholar 

  7. NVIDIA. (2017). Nvidia tesla v100 GPU architecture. White paper.

  8. Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S. K., Hernández-Lobato, J. M., Wei, G.-Y., & Brooks, D. (2016). Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 267–278.

  9. Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12.

  10. Köster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K., Constable, W., Elibol, O., Gray, S., Hall, S., Hornof, L., et al. (2017). Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in neural information processing systems, pp. 1742–1752.

  11. Markidis, S., Der Chien, S. W., Laure, E., Peng, I. B., & Vetter, J. S. (2018). Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp. 522–531.

  12. Gustafson, J. L., & Yonemoto, I. T. (2017). Beating floating point at its own game: Posit arithmetic. Supercomputing Frontiers and Innovations, 4(2), 71–86.

    Google Scholar 

  13. Carmichael, Z., Langroudi, H. F., Khazanov, C., Lillie, J., Gustafson, J. L., & Kudithipudi, D. (2019). Deep positron: A deep neural network using the posit number system. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, pp. 1421–1426.

  14. Chaurasiya, R., Gustafson, J., Shrestha, R., Neudorfer, J., Nambiar, S., Niyogi, K., Merchant, F., & Leupers, R. (2018). Parameterized posit arithmetic hardware generator. In 2018 IEEE 36th International Conference on Computer Design (ICCD), IEEE, pp. 334–341.

  15. P. W, & Group. (2018). Posit standard documentation. Release, 3, 2.

  16. Chen, Y.-H., Krishna, T., Emer, J. S., & Sze, V. (2016). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1), 127–138.

    Article  Google Scholar 

  17. Koeplinger, D., Feldman, M., Prabhakar, R., Zhang, Y., Hadjis, S., Fiszel, R., Zhao, T., Nardi, L., Pedram, A., Kozyrakis, C., et al. (2018). Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 296–311.

  18. Nowatzki, T., Gangadhar, V., Ardalani, N., & Sankaralingam, K. (2017). Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 416–429.

  19. Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T., Hadjis, S., Pedram, A., Kozyrakis, C., & Olukotun, K. (2017). Plasticine: A reconfigurable architecture for parallel patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), IEEE, pp. 389–402.

  20. Neves, N., Tomás, P., & Roma, N. (2017). Adaptive in-cache streaming for efficient data management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 7, 2130–2143.

  21. Jaiswal, M. K., and So, H. K. (2018). Architecture generator for type-3 unum posit adder/subtractor. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, pp. 1–5.

  22. Forget, L., Uguen, Y., & De Dinechin, F. (2019). Hardware cost evaluation of the posit number system. In Compas’2019 - Conférence d’informatique en Parallélisme, Architecture et Système, pp. 1–7.

  23. Zhang, H., et al. (2019) Efficient posit multiply-accumulate unit generator for deep learning applications. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, pp. 1–5.

  24. Ghosh, S., Martonosi, M., & Malik, S. (1997). Cache miss equations: An analytical representation of cache misses. In Proceedings of the 11th international conference on Supercomputing, pp. 317–324.

  25. Paiágua, S., Pratas, F., Tomás, P., Roma, N., & Chaves, R. (2013). Hotstream: Efficient data streaming of complex patterns to multiple accelerating kernels. In 2013 25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), IEEE, pp. 17–24.

  26. Hussain, T., Palomar, O., Unsal, O., Cristal, A., Ayguadé, E., & Valero, M. (2014). Advanced Pattern based Memory Controller for FPGA based HPC applications. In 2014 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp. 287–294.

  27. Grosser, T., Zheng, H., Aloor, R., Simbürger, A., Größlinger, A., & Pouchet, L.-N. (2011). Polly-polyhedral optimization in llvm. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), 2011, 1.

    Google Scholar 

  28. De Dinechin, F., Forget, L., Muller, J.-M., & Uguen, Y. (2019). Posits: the good, the bad and the ugly. In Proceedings of the Conference for Next Generation Arithmetic, 2019, 1–10.

    Google Scholar 

  29. Viitanen, T., Jääskeläinen, P., Esko, O., & Takala, J. (2013). Simplified floating-point division and square root. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp. 2707–2711.

  30. Guthaus, M. R., Stine, J. E., Ataei, S., Chen, B., Wu, B., & Sarwar, M. (2016). Openram: An open-source memory compiler. In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), IEEE, pp. 1–6.

  31. Svensson, B. J. (2016). Exploring opencl memory throughput on the zynq. Technical Report.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuno Neves.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under projects UIDB/50021/2020 and PTDC/EEI-HAC/30485/2017.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Neves, N., Tomás, P. & Roma, N. A Reconfigurable Posit Tensor Unit with Variable-Precision Arithmetic and Automatic Data Streaming. J Sign Process Syst 93, 1365–1385 (2021). https://doi.org/10.1007/s11265-021-01687-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-021-01687-7

Keywords

Navigation