Low-Precision Floating-Point Formats: From General-Purpose to Application-Specific

Sabbagh Molahosseini, Amir; Sousa, Leonel; Emrani Zarandi, Azadeh Alsadat; Vandierendonck, Hans

doi:10.1007/978-3-030-98347-5_4

Amir Sabbagh Molahosseini³,
Leonel Sousa⁴,
Azadeh Alsadat Emrani Zarandi⁵ &
…
Hans Vandierendonck³

1391 Accesses

Abstract

Low-precision floating-point formats are used to enhance performance by providing the minimal precision required for the application’s requirements. General-purpose formats, such as IEEE Half-Precision floating-point (FP) format, are not practical for all applications. When dealing with small bit widths, careful adjustment of precision and dynamic range is required, and applications require bespoke number format. This chapter provides a comprehensive review of low-precision FP formats identifying: (1) the numerical features (e.g., dynamic range and precision) of each format, (2) their usage in the target applications, (3) their accuracy and performance. Finally, guidelines are provided to design high-performance and efficient applications via customized FP formats.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

IEEE standard for floating point numbers

Article 01 January 2016

Wordlength Optimization for Custom Floating-Point Systems

High-Radix Formats for Enhancing Floating-Point FPGA Implementations

Article Open access 02 December 2021

References

Goldberg D. What every computer scientist should know about floating-point arithmetic. ACM Comput Surv. 1991;23(1):5–48.
Article Google Scholar
Molahosseini AS, Sousa L, Chang CH, editors. Embedded systems design with special arithmetic and number systems. Springer; 2017.
Google Scholar
Parhami B. Computer arithmetic: algorithms and hardware designs. 2nd ed. New York: Oxford University Press; 2010.
Google Scholar
IEEE Computer Society. IEEE standard for floating-point arithmetic. IEEE Std 754-2008:1–70; 2008.
Google Scholar
Khan M, Bilal J, Haleem F. Deep learning: convergence to big data analytics. Springer; 2019.
Book Google Scholar
Alioto M, editor. Enabling the Internet of Things: from integrated circuits to integrated systems. Springer; 2017.
Google Scholar
Mach S. Floating-point architectures for energy-efficient transprecision computing. Doctoral Thesis, ETH ZURICH, Switzerland. 2021.
Google Scholar
Malossi ACI, et al. The transprecision computing paradigm: Concept, design, and applications. In: Proc. of design, automation and test in Europe conference and exhibition (DATE). 2018.
Google Scholar
Sousa L. Nonconventional computer arithmetic circuits, systems and applications. IEEE Circ Syst Mag. 2021;21(1):6–40.
Article Google Scholar
Cherubin S, Agosta G. Tools for reduced precision computation: a survey. ACM Comput Surv. 2020;53(2):1.
Article Google Scholar
Lee J, Vandierendonck H. Towards lower precision adaptive filters: facts from backward error analysis of RLS. IEEE Trans Signal Process. 2021;69:3446–3458.
Article MathSciNet Google Scholar
Klöwer M, Düben PD, Palmer TN. Number formats, error mitigation, and scope for 16-bit arithmetics in weather and climate modeling analyzed with a shallow water model. J Adv Model Earth Syst. 2020;12(10):1–17.
Article Google Scholar
Micikevicius P, Narang S, Alben J, Diamos G, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H, Mixed precision training. Preprint. arXiv:1710.03740. 2017.
Google Scholar
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint. arXiv:1603.04467. 2016.
Google Scholar
Norrie T, et al. The design process for Google’s training chips: TPUv2 and TPUv3. IEEE Micro. 2021;41(2):56–63.
Article Google Scholar
Ying C, Kumar S, Chen D, Wang T, Cheng Y. Image classification at supercomputer scale. CoRR, abs/1811.06992. 2018.
Google Scholar
Kalamkar D, et al. A study of BFLOAT16 for deep learning training. Preprint. arXiv:1905.12322. 2019.
Google Scholar
Burgess N, Milanovic J, Stephens N, Monachopoulos K, Mansell D. Bfloat16 processing for neural networks. In: Proc. of IEEE 26th symposium on computer arithmetic (ARITH). 2019. pp. 88–91.
Google Scholar
Jouppi NP, et al. Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In: Proc. of ACM/IEEE 48th annual international symposium on computer architecture (ISCA). 2021. pp. 1–14.
Google Scholar
Choquette J, Gandhi W, Giroux O, Stam N, Krashinsky R. NVIDIA A100 Tensor Core GPU: performance and innovation. IEEE Micro. 2021;41(2):29–35.
Article Google Scholar
Ozturk ME, Wang W, Szankin M, Shao L. Distributed BERT pre-training & fine-tuning with intel optimized TensorFlow on Intel Xeon scalable processors. In: Proc. of the ACM/IEEE international conference for high performance computing networking, storage, and analysis. 2020.
Google Scholar
Daghaghi S, Nicholas M, Mengnan Z, Shrivastava A. Accelerating slide deep learning on modern CPUs: Vectorization, quantizations, memory optimizations, and more. In: Proc. of machine learning and systems conference. 2021.
Google Scholar
bfloat16 – Hardware Numerics Definition. White Paper, Intel Corporation, USA, Nov. 2018.
Google Scholar
Chromczak J, Wheeler M, Chiasson C, How D, Langhammer M, Vanderhoek T, Zgheib G, Ganusov I. Architectural enhancements in Intel Agilex FPGAs. In: Proc. of the ACM/SIGDA international symposium on field-programmable gate arrays (FPGA ’20), New York, NY, USA. 2020. pp. 140–149.
Google Scholar
Agrawal A, et al. DLFloat: A 16-b floating point format designed for deep learning training and inference. In Proc. of IEEE symposium on computer arithmetic (ARITH), Kyoto, Japan. 2019. pp. 92–5.
Google Scholar
Bordawekar R, Abali B, Chen MH. EFloat: Entropy-coded floating point format for deep learning. arXiv:2102.02705. 2021.
Google Scholar
Krashinsky R, Giroux O, Jones S, Stam N, Ramaswamy S. NVIDIA ampere architecture in-depth, NVidia Blog. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/. Last Accessed: 30 Aug 2021.
Chung E, et al. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro. 2018;38(2):8–20.
Article Google Scholar
Köster U, et al. Flexpoint: an adaptive numerical format for efficient training of deep neural networks. In: Proc. of the 31st international conference on neural information processing systems, Red Hook, NY, USA. 2017. pp. 1740–50.
Google Scholar
Anderson A, Muralidharan S, Gregg D. Efficient multibyte floating point data formats using vectorization. IEEE Trans Comput. 2017;66(12):2081–96.
Article MathSciNet Google Scholar
Nannarelli A. Tunable floating-point for energy efficient accelerators. In: Proc. of IEEE symposium on computer arithmetic, Amherst, MA, USA. 2018. pp. 29–36.
Google Scholar
Nannarelli A. Variable precision 16-bit floating-point vector unit for embedded processors. In: Proc. of IEEE symposium on computer arithmetic (ARITH), Portland, OR, USA. 2020. pp. 96–102.
Google Scholar
Gustafson JL, Yonemoto I. Beating floating point at its own game: posit arithmetic. Supercomput Front Innov 2017;4(2):71–86.
Google Scholar
Carmichael Z, Langroudi HF, Khazanov C, Lillie J, Gustafson JL, Kudithipudi D. Performance-efficiency trade-off of low-precision numerical formats in deep neural networks. In: Proc. of the conference for next generation arithmetic (CoNGA’19), New York, NY, USA. 2019. pp. 3, 1–9.
Google Scholar
Montero RM. Leveraging posit arithmetic in deep neural networks. Master Thesis, Complutense University of Madrid. 2021.
Google Scholar
Lu J, Fang C, Xu M, Lin J, Wang Z. Evaluations on deep neural networks training using posit number system. IEEE Trans Comput. 2021;70(2):174–87.
Article Google Scholar
Langroudi HF, Carmichael Z, Gustafson JL, Kudithipudi D. PositNN framework: tapered precision deep learning inference for the edge. In: Proc. of IEEE space computing conference (SCC). 2019. pp. 53–9.
Google Scholar
Grützmacher T, Cojean T, Flegar G, Anzt H, Quintana-Ortí ES. Acceleration of PageRank with customized precision based on mantissa segmentation. ACM Trans Parallel Comput. 2020;7(1):1.
Article Google Scholar
Grützmacher T, Cojean T, Flegar G, Göbel F, Anzt H. A customized precision format based on mantissa segmentation for accelerating sparse linear algebra. Concurr Comput Pract Exp. 2020;32(15):1–12.
Article Google Scholar
Molahosseini AS, Vandierendonck H. Half-precision floating-point formats for PageRank: opportunities and challenges. In: Proc. of IEEE high performance extreme computing conference (HPEC). 2020. pp. 1–7.
Google Scholar
Markidis S, Chien SWD, Laure E, Peng IB, Vetter JS. Nvidia Tensor Core programmability, performance & precision. In Proc. of IEEE international parallel and distributed processing symposium workshops (IPDPSW). 2018. pp. 522–531.
Google Scholar
Mach S, Rossi D, Tagliavini G, Marongiu A, Benini L. A transprecision floating-point architecture for energy-efficient embedded computing. In Proc. of IEEE international symposium on circuits and systems (ISCAS). 2018. pp. 1–5.
Google Scholar
Ho N, Wong W. Exploiting half precision arithmetic in Nvidia GPUs. In: Proc. of IEEE high performance extreme computing conference (HPEC). 2017. pp. 1–7.
Google Scholar
Haidar A, Tomov S, Dongarra J, Higham NJ. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proc. of international conference for high performance computing, networking, storage and analysis. 2018. pp. 603–13.
Google Scholar
Zambelli C, Ranhel J. Half-precision floating point on spiking neural networks simulations in FPGA. In: Proc. of international joint conference on neural networks (IJCNN). 2018. pp. 1–6.
Google Scholar
Brennan J, Bonner S, Atapour-Abarghouei A, Jackson PT, Obara B, McGough AS. Not half bad: exploring half-precision in graph convolutional neural networks. In: Proc. of IEEE international conference on big data (Big Data). 2020. pp. 2725–34.
Google Scholar
Björck J, Chen X, De Sa C, Gomes CP, Weinberger K. Low-precision reinforcement learning: running soft actor-critic in half precision. In: Proc. of the 38th international conference on machine learning. 2021. pp. 980–91.
Google Scholar
Venkataramani S, et al. Efficient AI system design with cross-layer approximate computing. Proc IEEE. 2020;108(12):2232–50.
Article Google Scholar
Firoz JS, Li A, Li J, Barker K. On the feasibility of using reduced-precision tensor core operations for graph analytics. In: Proc. of IEEE high performance extreme computing conference (HPEC), Waltham, MA, USA. 2020. pp. 1–7.
Google Scholar
Carvalho A, Azevedo R. Towards a transprecision polymorphic floating-point unit for mixed-precision computing. In: Proc. of 31st international symposium on computer architecture and high performance computing (SBAC-PAD). 2019. pp. 56–63.
Google Scholar
Xie S, Davidson S, Magaki I, Khazraee M, Vega L, Zhang L, Taylor MB. Extreme datacenter specialization for planet-scale computing: ASIC clouds. ACM SIGOPS Oper Syst Rev. 2018;52(1):96–108
Article Google Scholar
Yang K, Chen YF, Roumpos G, Colby C, Anderson JR. High performance Monte Carlo simulation of Ising model on TPU clusters. CoRR, abs/1903.11714. 2019.
Google Scholar
Henry G, Tang PTP, Heinecke A. Leveraging the bfloat16 artificial intelligence datatype for higher-precision computations. In: Proc. of IEEE 26th symposium on computer arithmetic (ARITH), Kyoto, Japan. 2019. pp. 69–76.
Google Scholar
Mach S, Schuiki F, Zaruba F, Benini L. FPnew: an open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans Very Large Scale Integr (VLSI) Syst. 2021;29(4):774–87.
Article Google Scholar
Tagliavini G, Marongiu A, Benini L. FlexFloat: a software library for transprecision computing. IEEE Trans Comput Aided Des Integr Circ Syst. 2020;39(1):145–56.
Article Google Scholar
Wang N, Choi J, Brand D, Chen CY, Gopalakrishnan K. Training deep neural networks with 8-bit floating point numbers. In: Proc. of the 32nd international conference on neural information processing systems, Red Hook, NY, USA. 2018. pp. 7686–95.
Google Scholar
Xu S, Gregg D. Bitslice vectors: a software approach to customizable data precision on processors with SIMD extensions. In: Proc. of 46th international conference on parallel processing (ICPP). 2017. pp. 442–51.
Google Scholar
Tambe T, et al. Algorithm-hardware co-design of adaptive floating-point encodings for resilient deep learning inference. In: Proc. of 57th ACM/IEEE design automation conference (DAC). 2020. pp. 1–6.
Google Scholar
Franceschi M, Nannarelli A, Valle M. Tunable floating-point for embedded machine learning algorithms implementation. In: Proc. of 15th international conference on synthesis, modeling, analysis and simulation methods and applications to circuit design (SMACD). 2018. pp. 89–92.
Google Scholar
Franceschi M, Nannarelli A, Valle M. Tunable floating-point for artificial neural networks. In Proc. of 25th IEEE international conference on electronics, circuits and systems (ICECS). 2018. pp. 289–92.
Google Scholar
Nannarelli A. Tunable floating-point adder. IEEE Trans Comput. 2019;68(10):1553–60.
Article MathSciNet Google Scholar
Elam D, Iovescu C. A block floating point implementation for an N-Point FFT on the TMS320C55x DSP. Application Report, Texas Instruments. 2003.
Google Scholar
Drumond M, Lin T, Jaggi M, Falsafi B. Training DNNs with hybrid block floating point. In: Proc. of the 32nd international conference on neural information processing systems (NIPS’18), Red Hook, NY, USA. 2018. pp. 451–61.
Google Scholar
Fox S, Rasoulinezhad S, Faraone J, Leong P. A block minifloat representation for training deep neural networks. In: Proc. of international conference on learning representations. 2020.
Google Scholar
Langville AN, Meyer CD. Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press; 2012.
MATH Google Scholar
Takac L, Zabovsky M. Data analysis in public social networks. In: Proc. of international scientific conference and international workshop present day trends of innovations, Lomza, Poland. 2012.
Google Scholar
Jouppi NP, Yoon DH, Kurian G, Li S, Patil N, Laudon J, Young C, Patterson D. A domain-specific supercomputer for training deep neural networks. Commun ACM. 2020;63(7):67–78.
Article Google Scholar
Hauser J. The SoftFloat and TestFloat validation suite for binary floating-point arithmetic. Technical Report, University of California, Berkeley. 1999.
Google Scholar
Gerlach L, Payá-Vayá G, Blume H. Efficient emulation of floating-point arithmetic on fixed-point SIMD processors. In Proc. of IEEE international workshop on signal processing systems (SiPS). 2016. pp. 254–9.
Google Scholar
Chen B, et al. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. arXiv preprint arXiv:1903.03129. 2019.
Google Scholar
Mutlu O, Ghose S, Gómez-Luna J, Ausavarungnirun R. Processing data where it makes sense: Enabling in-memory computation. Microprocess Microsyst 2019;67:28–41.
Article Google Scholar
Lee J, Peterson GD, Nikolopoulos DS, Vandierendonck H. AIR: Iterative refinement acceleration using arbitrary dynamic precision. Parallel Comput 2020;97:1–13.
Article MathSciNet Google Scholar
Zoni D, Galimberti A, Fornaciari W. An FPU design template to optimize the accuracy-efficiency-area trade-off. Sustain Comput Inform Syst. 2021;29:1–10.
Google Scholar
Higham NJ, Pranesh S. Simulating low precision floating-point arithmetic. SIAM J Sci Comput. 2019;41(5):585–602.
Article MathSciNet Google Scholar
Fasi M, Mikaitis M. CPFloat: A C library for emulating low-precision arithmetic. Technical Report, The University of Manchester. 2020.
Google Scholar
Romanov AY, et al. Analysis of Posit and Bfloat arithmetic of real numbers for machine learning. IEEE Access. 2021;9:82318–24.
Article Google Scholar

Download references

Acknowledgements

This work is supported by FCT (Fundação para a Ciência e a Tecnologia, Portugal) through the Project UIDB/50021/2020, DiPET project (grant agreement EP/T022345/1 and CHIST-ERA Consortium of European Funding Agencies project no CHIST-ERA-18-SDCDN-002), and OPRECOMP project (European Union’s H2020-EU.1.2.2.—FET Proactive research and innovation programme under grant agreement no. 732631), and Entrans project (EU Marie Curie Fellowship, grant agreement no 798209).

Author information

Authors and Affiliations

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, UK
Amir Sabbagh Molahosseini & Hans Vandierendonck
INESC-ID, Instituto Superior Técnico (IST), University of Lisbon, Lisbon, Portugal
Leonel Sousa
Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
Azadeh Alsadat Emrani Zarandi

Authors

Amir Sabbagh Molahosseini
View author publications
You can also search for this author in PubMed Google Scholar
Leonel Sousa
View author publications
You can also search for this author in PubMed Google Scholar
Azadeh Alsadat Emrani Zarandi
View author publications
You can also search for this author in PubMed Google Scholar
Hans Vandierendonck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonel Sousa .

Editor information

Editors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Weiqiang Liu
Northeastern University, Boston, MA, USA
Fabrizio Lombardi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sabbagh Molahosseini, A., Sousa, L., Emrani Zarandi, A.A., Vandierendonck, H. (2022). Low-Precision Floating-Point Formats: From General-Purpose to Application-Specific. In: Liu, W., Lombardi, F. (eds) Approximate Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-98347-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-98347-5_4
Published: 24 February 2012
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98346-8
Online ISBN: 978-3-030-98347-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics