Skip to main content

Low-Precision Floating-Point Formats: From General-Purpose to Application-Specific

  • Chapter
  • First Online:
Approximate Computing

Abstract

Low-precision floating-point formats are used to enhance performance by providing the minimal precision required for the application’s requirements. General-purpose formats, such as IEEE Half-Precision floating-point (FP) format, are not practical for all applications. When dealing with small bit widths, careful adjustment of precision and dynamic range is required, and applications require bespoke number format. This chapter provides a comprehensive review of low-precision FP formats identifying: (1) the numerical features (e.g., dynamic range and precision) of each format, (2) their usage in the target applications, (3) their accuracy and performance. Finally, guidelines are provided to design high-performance and efficient applications via customized FP formats.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Goldberg D. What every computer scientist should know about floating-point arithmetic. ACM Comput Surv. 1991;23(1):5–48.

    Article  Google Scholar 

  2. Molahosseini AS, Sousa L, Chang CH, editors. Embedded systems design with special arithmetic and number systems. Springer; 2017.

    Google Scholar 

  3. Parhami B. Computer arithmetic: algorithms and hardware designs. 2nd ed. New York: Oxford University Press; 2010.

    Google Scholar 

  4. IEEE Computer Society. IEEE standard for floating-point arithmetic. IEEE Std 754-2008:1–70; 2008.

    Google Scholar 

  5. Khan M, Bilal J, Haleem F. Deep learning: convergence to big data analytics. Springer; 2019.

    Book  Google Scholar 

  6. Alioto M, editor. Enabling the Internet of Things: from integrated circuits to integrated systems. Springer; 2017.

    Google Scholar 

  7. Mach S. Floating-point architectures for energy-efficient transprecision computing. Doctoral Thesis, ETH ZURICH, Switzerland. 2021.

    Google Scholar 

  8. Malossi ACI, et al. The transprecision computing paradigm: Concept, design, and applications. In: Proc. of design, automation and test in Europe conference and exhibition (DATE). 2018.

    Google Scholar 

  9. Sousa L. Nonconventional computer arithmetic circuits, systems and applications. IEEE Circ Syst Mag. 2021;21(1):6–40.

    Article  Google Scholar 

  10. Cherubin S, Agosta G. Tools for reduced precision computation: a survey. ACM Comput Surv. 2020;53(2):1.

    Article  Google Scholar 

  11. Lee J, Vandierendonck H. Towards lower precision adaptive filters: facts from backward error analysis of RLS. IEEE Trans Signal Process. 2021;69:3446–3458.

    Article  MathSciNet  Google Scholar 

  12. Klöwer M, Düben PD, Palmer TN. Number formats, error mitigation, and scope for 16-bit arithmetics in weather and climate modeling analyzed with a shallow water model. J Adv Model Earth Syst. 2020;12(10):1–17.

    Article  Google Scholar 

  13. Micikevicius P, Narang S, Alben J, Diamos G, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H, Mixed precision training. Preprint. arXiv:1710.03740. 2017.

    Google Scholar 

  14. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint. arXiv:1603.04467. 2016.

    Google Scholar 

  15. Norrie T, et al. The design process for Google’s training chips: TPUv2 and TPUv3. IEEE Micro. 2021;41(2):56–63.

    Article  Google Scholar 

  16. Ying C, Kumar S, Chen D, Wang T, Cheng Y. Image classification at supercomputer scale. CoRR, abs/1811.06992. 2018.

    Google Scholar 

  17. Kalamkar D, et al. A study of BFLOAT16 for deep learning training. Preprint. arXiv:1905.12322. 2019.

    Google Scholar 

  18. Burgess N, Milanovic J, Stephens N, Monachopoulos K, Mansell D. Bfloat16 processing for neural networks. In: Proc. of IEEE 26th symposium on computer arithmetic (ARITH). 2019. pp. 88–91.

    Google Scholar 

  19. Jouppi NP, et al. Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In: Proc. of ACM/IEEE 48th annual international symposium on computer architecture (ISCA). 2021. pp. 1–14.

    Google Scholar 

  20. Choquette J, Gandhi W, Giroux O, Stam N, Krashinsky R. NVIDIA A100 Tensor Core GPU: performance and innovation. IEEE Micro. 2021;41(2):29–35.

    Article  Google Scholar 

  21. Ozturk ME, Wang W, Szankin M, Shao L. Distributed BERT pre-training & fine-tuning with intel optimized TensorFlow on Intel Xeon scalable processors. In: Proc. of the ACM/IEEE international conference for high performance computing networking, storage, and analysis. 2020.

    Google Scholar 

  22. Daghaghi S, Nicholas M, Mengnan Z, Shrivastava A. Accelerating slide deep learning on modern CPUs: Vectorization, quantizations, memory optimizations, and more. In: Proc. of machine learning and systems conference. 2021.

    Google Scholar 

  23. bfloat16 – Hardware Numerics Definition. White Paper, Intel Corporation, USA, Nov. 2018.

    Google Scholar 

  24. Chromczak J, Wheeler M, Chiasson C, How D, Langhammer M, Vanderhoek T, Zgheib G, Ganusov I. Architectural enhancements in Intel Agilex FPGAs. In: Proc. of the ACM/SIGDA international symposium on field-programmable gate arrays (FPGA ’20), New York, NY, USA. 2020. pp. 140–149.

    Google Scholar 

  25. Agrawal A, et al. DLFloat: A 16-b floating point format designed for deep learning training and inference. In Proc. of IEEE symposium on computer arithmetic (ARITH), Kyoto, Japan. 2019. pp. 92–5.

    Google Scholar 

  26. Bordawekar R, Abali B, Chen MH. EFloat: Entropy-coded floating point format for deep learning. arXiv:2102.02705. 2021.

    Google Scholar 

  27. Krashinsky R, Giroux O, Jones S, Stam N, Ramaswamy S. NVIDIA ampere architecture in-depth, NVidia Blog. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/. Last Accessed: 30 Aug 2021.

  28. Chung E, et al. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro. 2018;38(2):8–20.

    Article  Google Scholar 

  29. Köster U, et al. Flexpoint: an adaptive numerical format for efficient training of deep neural networks. In: Proc. of the 31st international conference on neural information processing systems, Red Hook, NY, USA. 2017. pp. 1740–50.

    Google Scholar 

  30. Anderson A, Muralidharan S, Gregg D. Efficient multibyte floating point data formats using vectorization. IEEE Trans Comput. 2017;66(12):2081–96.

    Article  MathSciNet  Google Scholar 

  31. Nannarelli A. Tunable floating-point for energy efficient accelerators. In: Proc. of IEEE symposium on computer arithmetic, Amherst, MA, USA. 2018. pp. 29–36.

    Google Scholar 

  32. Nannarelli A. Variable precision 16-bit floating-point vector unit for embedded processors. In: Proc. of IEEE symposium on computer arithmetic (ARITH), Portland, OR, USA. 2020. pp. 96–102.

    Google Scholar 

  33. Gustafson JL, Yonemoto I. Beating floating point at its own game: posit arithmetic. Supercomput Front Innov 2017;4(2):71–86.

    Google Scholar 

  34. Carmichael Z, Langroudi HF, Khazanov C, Lillie J, Gustafson JL, Kudithipudi D. Performance-efficiency trade-off of low-precision numerical formats in deep neural networks. In: Proc. of the conference for next generation arithmetic (CoNGA’19), New York, NY, USA. 2019. pp. 3, 1–9.

    Google Scholar 

  35. Montero RM. Leveraging posit arithmetic in deep neural networks. Master Thesis, Complutense University of Madrid. 2021.

    Google Scholar 

  36. Lu J, Fang C, Xu M, Lin J, Wang Z. Evaluations on deep neural networks training using posit number system. IEEE Trans Comput. 2021;70(2):174–87.

    Article  Google Scholar 

  37. Langroudi HF, Carmichael Z, Gustafson JL, Kudithipudi D. PositNN framework: tapered precision deep learning inference for the edge. In: Proc. of IEEE space computing conference (SCC). 2019. pp. 53–9.

    Google Scholar 

  38. Grützmacher T, Cojean T, Flegar G, Anzt H, Quintana-Ortí ES. Acceleration of PageRank with customized precision based on mantissa segmentation. ACM Trans Parallel Comput. 2020;7(1):1.

    Article  Google Scholar 

  39. Grützmacher T, Cojean T, Flegar G, Göbel F, Anzt H. A customized precision format based on mantissa segmentation for accelerating sparse linear algebra. Concurr Comput Pract Exp. 2020;32(15):1–12.

    Article  Google Scholar 

  40. Molahosseini AS, Vandierendonck H. Half-precision floating-point formats for PageRank: opportunities and challenges. In: Proc. of IEEE high performance extreme computing conference (HPEC). 2020. pp. 1–7.

    Google Scholar 

  41. Markidis S, Chien SWD, Laure E, Peng IB, Vetter JS. Nvidia Tensor Core programmability, performance & precision. In Proc. of IEEE international parallel and distributed processing symposium workshops (IPDPSW). 2018. pp. 522–531.

    Google Scholar 

  42. Mach S, Rossi D, Tagliavini G, Marongiu A, Benini L. A transprecision floating-point architecture for energy-efficient embedded computing. In Proc. of IEEE international symposium on circuits and systems (ISCAS). 2018. pp. 1–5.

    Google Scholar 

  43. Ho N, Wong W. Exploiting half precision arithmetic in Nvidia GPUs. In: Proc. of IEEE high performance extreme computing conference (HPEC). 2017. pp. 1–7.

    Google Scholar 

  44. Haidar A, Tomov S, Dongarra J, Higham NJ. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proc. of international conference for high performance computing, networking, storage and analysis. 2018. pp. 603–13.

    Google Scholar 

  45. Zambelli C, Ranhel J. Half-precision floating point on spiking neural networks simulations in FPGA. In: Proc. of international joint conference on neural networks (IJCNN). 2018. pp. 1–6.

    Google Scholar 

  46. Brennan J, Bonner S, Atapour-Abarghouei A, Jackson PT, Obara B, McGough AS. Not half bad: exploring half-precision in graph convolutional neural networks. In: Proc. of IEEE international conference on big data (Big Data). 2020. pp. 2725–34.

    Google Scholar 

  47. Björck J, Chen X, De Sa C, Gomes CP, Weinberger K. Low-precision reinforcement learning: running soft actor-critic in half precision. In: Proc. of the 38th international conference on machine learning. 2021. pp. 980–91.

    Google Scholar 

  48. Venkataramani S, et al. Efficient AI system design with cross-layer approximate computing. Proc IEEE. 2020;108(12):2232–50.

    Article  Google Scholar 

  49. Firoz JS, Li A, Li J, Barker K. On the feasibility of using reduced-precision tensor core operations for graph analytics. In: Proc. of IEEE high performance extreme computing conference (HPEC), Waltham, MA, USA. 2020. pp. 1–7.

    Google Scholar 

  50. Carvalho A, Azevedo R. Towards a transprecision polymorphic floating-point unit for mixed-precision computing. In: Proc. of 31st international symposium on computer architecture and high performance computing (SBAC-PAD). 2019. pp. 56–63.

    Google Scholar 

  51. Xie S, Davidson S, Magaki I, Khazraee M, Vega L, Zhang L, Taylor MB. Extreme datacenter specialization for planet-scale computing: ASIC clouds. ACM SIGOPS Oper Syst Rev. 2018;52(1):96–108

    Article  Google Scholar 

  52. Yang K, Chen YF, Roumpos G, Colby C, Anderson JR. High performance Monte Carlo simulation of Ising model on TPU clusters. CoRR, abs/1903.11714. 2019.

    Google Scholar 

  53. Henry G, Tang PTP, Heinecke A. Leveraging the bfloat16 artificial intelligence datatype for higher-precision computations. In: Proc. of IEEE 26th symposium on computer arithmetic (ARITH), Kyoto, Japan. 2019. pp. 69–76.

    Google Scholar 

  54. Mach S, Schuiki F, Zaruba F, Benini L. FPnew: an open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans Very Large Scale Integr (VLSI) Syst. 2021;29(4):774–87.

    Article  Google Scholar 

  55. Tagliavini G, Marongiu A, Benini L. FlexFloat: a software library for transprecision computing. IEEE Trans Comput Aided Des Integr Circ Syst. 2020;39(1):145–56.

    Article  Google Scholar 

  56. Wang N, Choi J, Brand D, Chen CY, Gopalakrishnan K. Training deep neural networks with 8-bit floating point numbers. In: Proc. of the 32nd international conference on neural information processing systems, Red Hook, NY, USA. 2018. pp. 7686–95.

    Google Scholar 

  57. Xu S, Gregg D. Bitslice vectors: a software approach to customizable data precision on processors with SIMD extensions. In: Proc. of 46th international conference on parallel processing (ICPP). 2017. pp. 442–51.

    Google Scholar 

  58. Tambe T, et al. Algorithm-hardware co-design of adaptive floating-point encodings for resilient deep learning inference. In: Proc. of 57th ACM/IEEE design automation conference (DAC). 2020. pp. 1–6.

    Google Scholar 

  59. Franceschi M, Nannarelli A, Valle M. Tunable floating-point for embedded machine learning algorithms implementation. In: Proc. of 15th international conference on synthesis, modeling, analysis and simulation methods and applications to circuit design (SMACD). 2018. pp. 89–92.

    Google Scholar 

  60. Franceschi M, Nannarelli A, Valle M. Tunable floating-point for artificial neural networks. In Proc. of 25th IEEE international conference on electronics, circuits and systems (ICECS). 2018. pp. 289–92.

    Google Scholar 

  61. Nannarelli A. Tunable floating-point adder. IEEE Trans Comput. 2019;68(10):1553–60.

    Article  MathSciNet  Google Scholar 

  62. Elam D, Iovescu C. A block floating point implementation for an N-Point FFT on the TMS320C55x DSP. Application Report, Texas Instruments. 2003.

    Google Scholar 

  63. Drumond M, Lin T, Jaggi M, Falsafi B. Training DNNs with hybrid block floating point. In: Proc. of the 32nd international conference on neural information processing systems (NIPS’18), Red Hook, NY, USA. 2018. pp. 451–61.

    Google Scholar 

  64. Fox S, Rasoulinezhad S, Faraone J, Leong P. A block minifloat representation for training deep neural networks. In: Proc. of international conference on learning representations. 2020.

    Google Scholar 

  65. Langville AN, Meyer CD. Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press; 2012.

    MATH  Google Scholar 

  66. Takac L, Zabovsky M. Data analysis in public social networks. In: Proc. of international scientific conference and international workshop present day trends of innovations, Lomza, Poland. 2012.

    Google Scholar 

  67. Jouppi NP, Yoon DH, Kurian G, Li S, Patil N, Laudon J, Young C, Patterson D. A domain-specific supercomputer for training deep neural networks. Commun ACM. 2020;63(7):67–78.

    Article  Google Scholar 

  68. Hauser J. The SoftFloat and TestFloat validation suite for binary floating-point arithmetic. Technical Report, University of California, Berkeley. 1999.

    Google Scholar 

  69. Gerlach L, Payá-Vayá G, Blume H. Efficient emulation of floating-point arithmetic on fixed-point SIMD processors. In Proc. of IEEE international workshop on signal processing systems (SiPS). 2016. pp. 254–9.

    Google Scholar 

  70. Chen B, et al. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. arXiv preprint arXiv:1903.03129. 2019.

    Google Scholar 

  71. Mutlu O, Ghose S, Gómez-Luna J, Ausavarungnirun R. Processing data where it makes sense: Enabling in-memory computation. Microprocess Microsyst 2019;67:28–41.

    Article  Google Scholar 

  72. Lee J, Peterson GD, Nikolopoulos DS, Vandierendonck H. AIR: Iterative refinement acceleration using arbitrary dynamic precision. Parallel Comput 2020;97:1–13.

    Article  MathSciNet  Google Scholar 

  73. Zoni D, Galimberti A, Fornaciari W. An FPU design template to optimize the accuracy-efficiency-area trade-off. Sustain Comput Inform Syst. 2021;29:1–10.

    Google Scholar 

  74. Higham NJ, Pranesh S. Simulating low precision floating-point arithmetic. SIAM J Sci Comput. 2019;41(5):585–602.

    Article  MathSciNet  Google Scholar 

  75. Fasi M, Mikaitis M. CPFloat: A C library for emulating low-precision arithmetic. Technical Report, The University of Manchester. 2020.

    Google Scholar 

  76. Romanov AY, et al. Analysis of Posit and Bfloat arithmetic of real numbers for machine learning. IEEE Access. 2021;9:82318–24.

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by FCT (Fundação para a Ciência e a Tecnologia, Portugal) through the Project UIDB/50021/2020, DiPET project (grant agreement EP/T022345/1 and CHIST-ERA Consortium of European Funding Agencies project no CHIST-ERA-18-SDCDN-002), and OPRECOMP project (European Union’s H2020-EU.1.2.2.—FET Proactive research and innovation programme under grant agreement no. 732631), and Entrans project (EU Marie Curie Fellowship, grant agreement no 798209).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonel Sousa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Cite this chapter

Sabbagh Molahosseini, A., Sousa, L., Emrani Zarandi, A.A., Vandierendonck, H. (2022). Low-Precision Floating-Point Formats: From General-Purpose to Application-Specific. In: Liu, W., Lombardi, F. (eds) Approximate Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-98347-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98347-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98346-8

  • Online ISBN: 978-3-030-98347-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics