Abstract
Low-precision floating-point formats are used to enhance performance by providing the minimal precision required for the application’s requirements. General-purpose formats, such as IEEE Half-Precision floating-point (FP) format, are not practical for all applications. When dealing with small bit widths, careful adjustment of precision and dynamic range is required, and applications require bespoke number format. This chapter provides a comprehensive review of low-precision FP formats identifying: (1) the numerical features (e.g., dynamic range and precision) of each format, (2) their usage in the target applications, (3) their accuracy and performance. Finally, guidelines are provided to design high-performance and efficient applications via customized FP formats.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Goldberg D. What every computer scientist should know about floating-point arithmetic. ACM Comput Surv. 1991;23(1):5–48.
Molahosseini AS, Sousa L, Chang CH, editors. Embedded systems design with special arithmetic and number systems. Springer; 2017.
Parhami B. Computer arithmetic: algorithms and hardware designs. 2nd ed. New York: Oxford University Press; 2010.
IEEE Computer Society. IEEE standard for floating-point arithmetic. IEEE Std 754-2008:1–70; 2008.
Khan M, Bilal J, Haleem F. Deep learning: convergence to big data analytics. Springer; 2019.
Alioto M, editor. Enabling the Internet of Things: from integrated circuits to integrated systems. Springer; 2017.
Mach S. Floating-point architectures for energy-efficient transprecision computing. Doctoral Thesis, ETH ZURICH, Switzerland. 2021.
Malossi ACI, et al. The transprecision computing paradigm: Concept, design, and applications. In: Proc. of design, automation and test in Europe conference and exhibition (DATE). 2018.
Sousa L. Nonconventional computer arithmetic circuits, systems and applications. IEEE Circ Syst Mag. 2021;21(1):6–40.
Cherubin S, Agosta G. Tools for reduced precision computation: a survey. ACM Comput Surv. 2020;53(2):1.
Lee J, Vandierendonck H. Towards lower precision adaptive filters: facts from backward error analysis of RLS. IEEE Trans Signal Process. 2021;69:3446–3458.
Klöwer M, Düben PD, Palmer TN. Number formats, error mitigation, and scope for 16-bit arithmetics in weather and climate modeling analyzed with a shallow water model. J Adv Model Earth Syst. 2020;12(10):1–17.
Micikevicius P, Narang S, Alben J, Diamos G, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H, Mixed precision training. Preprint. arXiv:1710.03740. 2017.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint. arXiv:1603.04467. 2016.
Norrie T, et al. The design process for Google’s training chips: TPUv2 and TPUv3. IEEE Micro. 2021;41(2):56–63.
Ying C, Kumar S, Chen D, Wang T, Cheng Y. Image classification at supercomputer scale. CoRR, abs/1811.06992. 2018.
Kalamkar D, et al. A study of BFLOAT16 for deep learning training. Preprint. arXiv:1905.12322. 2019.
Burgess N, Milanovic J, Stephens N, Monachopoulos K, Mansell D. Bfloat16 processing for neural networks. In: Proc. of IEEE 26th symposium on computer arithmetic (ARITH). 2019. pp. 88–91.
Jouppi NP, et al. Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In: Proc. of ACM/IEEE 48th annual international symposium on computer architecture (ISCA). 2021. pp. 1–14.
Choquette J, Gandhi W, Giroux O, Stam N, Krashinsky R. NVIDIA A100 Tensor Core GPU: performance and innovation. IEEE Micro. 2021;41(2):29–35.
Ozturk ME, Wang W, Szankin M, Shao L. Distributed BERT pre-training & fine-tuning with intel optimized TensorFlow on Intel Xeon scalable processors. In: Proc. of the ACM/IEEE international conference for high performance computing networking, storage, and analysis. 2020.
Daghaghi S, Nicholas M, Mengnan Z, Shrivastava A. Accelerating slide deep learning on modern CPUs: Vectorization, quantizations, memory optimizations, and more. In: Proc. of machine learning and systems conference. 2021.
bfloat16 – Hardware Numerics Definition. White Paper, Intel Corporation, USA, Nov. 2018.
Chromczak J, Wheeler M, Chiasson C, How D, Langhammer M, Vanderhoek T, Zgheib G, Ganusov I. Architectural enhancements in Intel Agilex FPGAs. In: Proc. of the ACM/SIGDA international symposium on field-programmable gate arrays (FPGA ’20), New York, NY, USA. 2020. pp. 140–149.
Agrawal A, et al. DLFloat: A 16-b floating point format designed for deep learning training and inference. In Proc. of IEEE symposium on computer arithmetic (ARITH), Kyoto, Japan. 2019. pp. 92–5.
Bordawekar R, Abali B, Chen MH. EFloat: Entropy-coded floating point format for deep learning. arXiv:2102.02705. 2021.
Krashinsky R, Giroux O, Jones S, Stam N, Ramaswamy S. NVIDIA ampere architecture in-depth, NVidia Blog. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/. Last Accessed: 30 Aug 2021.
Chung E, et al. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro. 2018;38(2):8–20.
Köster U, et al. Flexpoint: an adaptive numerical format for efficient training of deep neural networks. In: Proc. of the 31st international conference on neural information processing systems, Red Hook, NY, USA. 2017. pp. 1740–50.
Anderson A, Muralidharan S, Gregg D. Efficient multibyte floating point data formats using vectorization. IEEE Trans Comput. 2017;66(12):2081–96.
Nannarelli A. Tunable floating-point for energy efficient accelerators. In: Proc. of IEEE symposium on computer arithmetic, Amherst, MA, USA. 2018. pp. 29–36.
Nannarelli A. Variable precision 16-bit floating-point vector unit for embedded processors. In: Proc. of IEEE symposium on computer arithmetic (ARITH), Portland, OR, USA. 2020. pp. 96–102.
Gustafson JL, Yonemoto I. Beating floating point at its own game: posit arithmetic. Supercomput Front Innov 2017;4(2):71–86.
Carmichael Z, Langroudi HF, Khazanov C, Lillie J, Gustafson JL, Kudithipudi D. Performance-efficiency trade-off of low-precision numerical formats in deep neural networks. In: Proc. of the conference for next generation arithmetic (CoNGA’19), New York, NY, USA. 2019. pp. 3, 1–9.
Montero RM. Leveraging posit arithmetic in deep neural networks. Master Thesis, Complutense University of Madrid. 2021.
Lu J, Fang C, Xu M, Lin J, Wang Z. Evaluations on deep neural networks training using posit number system. IEEE Trans Comput. 2021;70(2):174–87.
Langroudi HF, Carmichael Z, Gustafson JL, Kudithipudi D. PositNN framework: tapered precision deep learning inference for the edge. In: Proc. of IEEE space computing conference (SCC). 2019. pp. 53–9.
Grützmacher T, Cojean T, Flegar G, Anzt H, Quintana-Ortí ES. Acceleration of PageRank with customized precision based on mantissa segmentation. ACM Trans Parallel Comput. 2020;7(1):1.
Grützmacher T, Cojean T, Flegar G, Göbel F, Anzt H. A customized precision format based on mantissa segmentation for accelerating sparse linear algebra. Concurr Comput Pract Exp. 2020;32(15):1–12.
Molahosseini AS, Vandierendonck H. Half-precision floating-point formats for PageRank: opportunities and challenges. In: Proc. of IEEE high performance extreme computing conference (HPEC). 2020. pp. 1–7.
Markidis S, Chien SWD, Laure E, Peng IB, Vetter JS. Nvidia Tensor Core programmability, performance & precision. In Proc. of IEEE international parallel and distributed processing symposium workshops (IPDPSW). 2018. pp. 522–531.
Mach S, Rossi D, Tagliavini G, Marongiu A, Benini L. A transprecision floating-point architecture for energy-efficient embedded computing. In Proc. of IEEE international symposium on circuits and systems (ISCAS). 2018. pp. 1–5.
Ho N, Wong W. Exploiting half precision arithmetic in Nvidia GPUs. In: Proc. of IEEE high performance extreme computing conference (HPEC). 2017. pp. 1–7.
Haidar A, Tomov S, Dongarra J, Higham NJ. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proc. of international conference for high performance computing, networking, storage and analysis. 2018. pp. 603–13.
Zambelli C, Ranhel J. Half-precision floating point on spiking neural networks simulations in FPGA. In: Proc. of international joint conference on neural networks (IJCNN). 2018. pp. 1–6.
Brennan J, Bonner S, Atapour-Abarghouei A, Jackson PT, Obara B, McGough AS. Not half bad: exploring half-precision in graph convolutional neural networks. In: Proc. of IEEE international conference on big data (Big Data). 2020. pp. 2725–34.
Björck J, Chen X, De Sa C, Gomes CP, Weinberger K. Low-precision reinforcement learning: running soft actor-critic in half precision. In: Proc. of the 38th international conference on machine learning. 2021. pp. 980–91.
Venkataramani S, et al. Efficient AI system design with cross-layer approximate computing. Proc IEEE. 2020;108(12):2232–50.
Firoz JS, Li A, Li J, Barker K. On the feasibility of using reduced-precision tensor core operations for graph analytics. In: Proc. of IEEE high performance extreme computing conference (HPEC), Waltham, MA, USA. 2020. pp. 1–7.
Carvalho A, Azevedo R. Towards a transprecision polymorphic floating-point unit for mixed-precision computing. In: Proc. of 31st international symposium on computer architecture and high performance computing (SBAC-PAD). 2019. pp. 56–63.
Xie S, Davidson S, Magaki I, Khazraee M, Vega L, Zhang L, Taylor MB. Extreme datacenter specialization for planet-scale computing: ASIC clouds. ACM SIGOPS Oper Syst Rev. 2018;52(1):96–108
Yang K, Chen YF, Roumpos G, Colby C, Anderson JR. High performance Monte Carlo simulation of Ising model on TPU clusters. CoRR, abs/1903.11714. 2019.
Henry G, Tang PTP, Heinecke A. Leveraging the bfloat16 artificial intelligence datatype for higher-precision computations. In: Proc. of IEEE 26th symposium on computer arithmetic (ARITH), Kyoto, Japan. 2019. pp. 69–76.
Mach S, Schuiki F, Zaruba F, Benini L. FPnew: an open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans Very Large Scale Integr (VLSI) Syst. 2021;29(4):774–87.
Tagliavini G, Marongiu A, Benini L. FlexFloat: a software library for transprecision computing. IEEE Trans Comput Aided Des Integr Circ Syst. 2020;39(1):145–56.
Wang N, Choi J, Brand D, Chen CY, Gopalakrishnan K. Training deep neural networks with 8-bit floating point numbers. In: Proc. of the 32nd international conference on neural information processing systems, Red Hook, NY, USA. 2018. pp. 7686–95.
Xu S, Gregg D. Bitslice vectors: a software approach to customizable data precision on processors with SIMD extensions. In: Proc. of 46th international conference on parallel processing (ICPP). 2017. pp. 442–51.
Tambe T, et al. Algorithm-hardware co-design of adaptive floating-point encodings for resilient deep learning inference. In: Proc. of 57th ACM/IEEE design automation conference (DAC). 2020. pp. 1–6.
Franceschi M, Nannarelli A, Valle M. Tunable floating-point for embedded machine learning algorithms implementation. In: Proc. of 15th international conference on synthesis, modeling, analysis and simulation methods and applications to circuit design (SMACD). 2018. pp. 89–92.
Franceschi M, Nannarelli A, Valle M. Tunable floating-point for artificial neural networks. In Proc. of 25th IEEE international conference on electronics, circuits and systems (ICECS). 2018. pp. 289–92.
Nannarelli A. Tunable floating-point adder. IEEE Trans Comput. 2019;68(10):1553–60.
Elam D, Iovescu C. A block floating point implementation for an N-Point FFT on the TMS320C55x DSP. Application Report, Texas Instruments. 2003.
Drumond M, Lin T, Jaggi M, Falsafi B. Training DNNs with hybrid block floating point. In: Proc. of the 32nd international conference on neural information processing systems (NIPS’18), Red Hook, NY, USA. 2018. pp. 451–61.
Fox S, Rasoulinezhad S, Faraone J, Leong P. A block minifloat representation for training deep neural networks. In: Proc. of international conference on learning representations. 2020.
Langville AN, Meyer CD. Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press; 2012.
Takac L, Zabovsky M. Data analysis in public social networks. In: Proc. of international scientific conference and international workshop present day trends of innovations, Lomza, Poland. 2012.
Jouppi NP, Yoon DH, Kurian G, Li S, Patil N, Laudon J, Young C, Patterson D. A domain-specific supercomputer for training deep neural networks. Commun ACM. 2020;63(7):67–78.
Hauser J. The SoftFloat and TestFloat validation suite for binary floating-point arithmetic. Technical Report, University of California, Berkeley. 1999.
Gerlach L, Payá-Vayá G, Blume H. Efficient emulation of floating-point arithmetic on fixed-point SIMD processors. In Proc. of IEEE international workshop on signal processing systems (SiPS). 2016. pp. 254–9.
Chen B, et al. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. arXiv preprint arXiv:1903.03129. 2019.
Mutlu O, Ghose S, Gómez-Luna J, Ausavarungnirun R. Processing data where it makes sense: Enabling in-memory computation. Microprocess Microsyst 2019;67:28–41.
Lee J, Peterson GD, Nikolopoulos DS, Vandierendonck H. AIR: Iterative refinement acceleration using arbitrary dynamic precision. Parallel Comput 2020;97:1–13.
Zoni D, Galimberti A, Fornaciari W. An FPU design template to optimize the accuracy-efficiency-area trade-off. Sustain Comput Inform Syst. 2021;29:1–10.
Higham NJ, Pranesh S. Simulating low precision floating-point arithmetic. SIAM J Sci Comput. 2019;41(5):585–602.
Fasi M, Mikaitis M. CPFloat: A C library for emulating low-precision arithmetic. Technical Report, The University of Manchester. 2020.
Romanov AY, et al. Analysis of Posit and Bfloat arithmetic of real numbers for machine learning. IEEE Access. 2021;9:82318–24.
Acknowledgements
This work is supported by FCT (Fundação para a Ciência e a Tecnologia, Portugal) through the Project UIDB/50021/2020, DiPET project (grant agreement EP/T022345/1 and CHIST-ERA Consortium of European Funding Agencies project no CHIST-ERA-18-SDCDN-002), and OPRECOMP project (European Union’s H2020-EU.1.2.2.—FET Proactive research and innovation programme under grant agreement no. 732631), and Entrans project (EU Marie Curie Fellowship, grant agreement no 798209).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sabbagh Molahosseini, A., Sousa, L., Emrani Zarandi, A.A., Vandierendonck, H. (2022). Low-Precision Floating-Point Formats: From General-Purpose to Application-Specific. In: Liu, W., Lombardi, F. (eds) Approximate Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-98347-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-98347-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98346-8
Online ISBN: 978-3-030-98347-5
eBook Packages: Computer ScienceComputer Science (R0)