Towards a Better 16-Bit Number Representation for Training Neural Networks

De Silva, Himeshi; Tan, Hongshi; Ho, Nhut-Minh; Gustafson, John L.; Wong, Weng-Fai

doi:10.1007/978-3-031-32180-1_8

Towards a Better 16-Bit Number Representation for Training Neural Networks

Himeshi De Silva¹⁰,
Hongshi Tan¹¹,
Nhut-Minh Ho¹¹,
John L. Gustafson¹² &
…
Weng-Fai Wong¹¹

Conference paper
First Online: 12 May 2023

186 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13851))

Abstract

Error resilience in neural networks has allowed for the adoption of low-precision floating-point representations for mixed-precision training to improve efficiency. Although the IEEE 754 standard had long defined a 16-bit float representation, several other alternatives targeting mixed-precision training have also emerged. However, their varying numerical properties and differing hardware characteristics, among other things, make them more or less suitable for the task. Therefore, there is no clear choice of a 16-bit floating-point representation for neural network training that is commonly accepted. In this work, we evaluate all 16-bit float variants and upcoming posit™ number representations proposed for neural network training on a set of Convolutional Neural Networks (CNNs) and other benchmarks to compare their suitability. Posits generally achieve better results, indicating that their non-uniform accuracy distribution is more conducive for the training task. Our analysis suggests that instead of having the same accuracy for all weight values, as is the case with floats, having greater accuracy for the more commonly occurring weights with larger magnitude improves the training results, thereby challenging previously held assumptions while bringing new insight into the dynamic range and precision requirements. We also evaluate the efficiency on hardware for mixed-precision training based on FPGA implementations. Finally, we propose the use of statistics based on the distribution of network weight values as a heuristic for selecting the number representation to be used.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Agrawal, A., et al.: DLFloat: a 16-bit floating point format designed for deep learning training and inference. In: 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pp. 92–95. IEEE (2019)
Google Scholar
Burgess, N., Milanovic, J., Stephens, N., Monachopoulos, K., Mansell, D.: BFloat16 processing for neural networks. In: 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pp. 88–91. IEEE (2019)
Google Scholar
IMS Committee: IEEE Standard for Floating-Point Arithmetic. IEEE Std. 754-2019 (2019)
Google Scholar
Das, D., et al.: Mixed precision training of convolutional neural networks using integer operations. arXiv preprint arXiv:1802.00930 (2018)
De Silva, H., Gustafson, J.L., Wong, W.F.: Making Strassen matrix multiplication safe. In: 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pp. 173–182. IEEE (2018)
Google Scholar
Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: International Conference on Machine Learning, pp. 1737–1746 (2015)
Google Scholar
Ho, N.M., De Silva, H., Gustafson, J.L., Wong, W.F.: Qtorch+: next generation arithmetic for Pytorch machine learning. In: Gustafson, J., Dimitrov, V. (eds.) CoNGA 2022. LNCS, pp. 31–49. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-09779-9_3
Chapter Google Scholar
Ho, N.M., Nguyen, D.T., De Silva, H., Gustafson, J.L., Wong, W.F., Chang, I.J.: Posit arithmetic for the training and deployment of generative adversarial networks. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1350–1355. IEEE (2021)
Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Kalamkar, D., et al.: A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)
Köster, U., et al.: Flexpoint: an adaptive numerical format for efficient training of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 1742–1752 (2017)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Citeseer (2009)
Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
Lu, J., et al.: Training deep neural networks using posit number system. arXiv preprint arXiv:1909.03831 (2019)
Mellempudi, N., Srinivasan, S., Das, D., Kaul, B.: Mixed precision training with 8-bit floating point. arXiv preprint arXiv:1905.12334 (2019)
Micikevicius, P., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)
Murillo, R., Del Barrio, A.A., Botella, G.: Deep PeNSieve: a deep learning framework based on the posit number system. Digit. Signal Process. 102762 (2020)
Google Scholar
Nvidia: Training mixed precision user guide (2020). https://docs.nvidia.com/deeplearning/sdk/pdf/Training-Mixed-Precision-User-Guide.pdf. Accessed 07 Mar 2020
Posit standard documentation (2022). https://posithub.org/docs/posit_standard-2.pdf. Accessed 07 Jan 2023
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sun, X., et al.: Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In: Advances in Neural Information Processing Systems, pp. 4901–4910 (2019)
Google Scholar
Sun, X., et al.: Ultra-low precision 4-bit training of deep neural networks. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K.: Training deep neural networks with 8-bit floating point numbers. In: Advances in Neural Information Processing Systems, pp. 7675–7684 (2018)
Google Scholar
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

Download references

Author information

Authors and Affiliations

Institute for Infocomm Research, A*STAR, Singapore, Singapore
Himeshi De Silva
National University of Singapore, Singapore, Singapore
Hongshi Tan, Nhut-Minh Ho & Weng-Fai Wong
Arizona State University, Tempe, USA
John L. Gustafson

Authors

Himeshi De Silva
View author publications
You can also search for this author in PubMed Google Scholar
Hongshi Tan
View author publications
You can also search for this author in PubMed Google Scholar
Nhut-Minh Ho
View author publications
You can also search for this author in PubMed Google Scholar
John L. Gustafson
View author publications
You can also search for this author in PubMed Google Scholar
Weng-Fai Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Himeshi De Silva .

Editor information

Editors and Affiliations

Arizona State University, Tempe, AZ, USA
John Gustafson
Swiss National Supercomputing Centre, ETH Zurich, Lugano, Switzerland
Siew Hoon Leong
National Supercomputing Centre, Singapore, Singapore
Marek Michalewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Silva, H., Tan, H., Ho, NM., Gustafson, J.L., Wong, WF. (2023). Towards a Better 16-Bit Number Representation for Training Neural Networks. In: Gustafson, J., Leong, S.H., Michalewicz, M. (eds) Next Generation Arithmetic. CoNGA 2023. Lecture Notes in Computer Science, vol 13851. Springer, Cham. https://doi.org/10.1007/978-3-031-32180-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-32180-1_8
Published: 12 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-32179-5
Online ISBN: 978-3-031-32180-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics