Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

Mukunoki, Daichi; Ogita, Takeshi; Ozaki, Katsuhisa

doi:10.1007/978-3-030-43229-4_44

Daichi Mukunoki¹²,
Takeshi Ogita¹³ &
Katsuhisa Ozaki¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12043))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

954 Accesses
9 Citations

Abstract

Generally, floating-point computations comprise rounding errors; the result may be inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has many factors that affect reproducibility. The loss of accuracy and reproducibility could be a crucial issue in debugging complex codes and the reliability of computations. In this paper, we propose high-performance implementations of reproducible basic linear algebra subprograms (BLAS) routines with tunable accuracy for many-core architectures. Our approach is based on an accurate matrix-multiplication method, Ozaki scheme, which can be constructed on level-3 BLAS that performs standard floating-point operations. We demonstrate the performance of three routines: inner product (DOT), matrix-vector multiplication (GEMV), and matrix-multiplication (GEMM) on NVIDIA’s Volta GPU by comparing these with the standard routines provided by the vendor. Furthermore, we demonstrate the reproducibility between CPU and GPU and its accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.netlib.org/xblas.
2.
http://mplapack.sourceforge.net.
3.
https://software.intel.com/en-us/mkl.
4.
https://developer.nvidia.com/cublas.
5.
https://bebop.cs.berkeley.edu/reproblas.
6.
https://exblas.lip6.fr.
7.
It must be implemented based on the standard floating-point inner product without the use of the divide-and-conquer approach such as Strassen’s algorithm.
8.
Available at http://www.math.twcu.ac.jp/ogita/post-k/results.html.
9.
“+1” corresponds the working space for storing x at line 10 in Algorithm 1, which can be shared between the two input matrices.
10.
http://www.mpfr.org.
11.
In Algorithm 1, line 9 loads x and stores \(x_{\mathrm {split}}\), and line 10 stores x. Line 11 can be performed through line 10 and does not need to be performed at the end. Instead, line 3 is needed to be performed at the first time. Thus, three vector accesses occur per d on each vector.

References

Chohra, C., Langlois, P., Parello, D.: Reproducible, accurately rounded and efficient BLAS. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 609–620. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_49
Chapter Google Scholar
Demmel, J., Ahrens, P., Nguyen, H.D.: Efficient reproducible floating point summation and BLAS. Technical report, UCB/EECS-2016-121, EECS Department, University of California, Berkeley (2016)
Google Scholar
Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990). https://doi.org/10.1145/77626.79170
Article MathSciNet MATH Google Scholar
Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched BLAS on modern high-performance computing systems. In: International Conference on Computational Science (ICCS 2017), vol. 108, pp. 495–504 (2017). https://doi.org/10.1016/j.procs.2017.05.138
Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2), 131–1315 (2007). https://doi.org/10.1145/1236463.1236468
Article MathSciNet MATH Google Scholar
Iakymchuk, R., Collange, S., Defour, D., Graillat, S.: ExBLAS: reproducible and accurate BLAS library. In: Proceedings of the Numerical Reproducibility at Exascale (NRE2015) at SC 2015 (2015)
Google Scholar
Ichimura, S., Katagiri, T., Ozaki, K., Ogita, T., Nagai, T.: Threaded accurate matrix-matrix multiplications with sparse matrix-vector multiplications. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1093–1102 (2018). https://doi.org/10.1109/IPDPSW.2018.00168
Li, X.S., et al.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2000). https://doi.org/10.1145/567806.567808
Article MathSciNet Google Scholar
Nakata, M.: The MPACK; multiple precision arithmetic BLAS (MBLAS) and LAPACK (MLAPACK). http://mplapack.sourceforge.net
Ozaki, K., Ogita, T., Oishi, S., Rump, S.M.: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012). https://doi.org/10.1007/s11075-011-9478-1
Article MathSciNet MATH Google Scholar
Rump, S., Ogita, T., Oishi, S.: Accurate floating-point summation part II: sign, K-fold faithful and rounding to nearest. SIAM J. Sci. Comput. 31(2), 1269–1302 (2009). https://doi.org/10.1137/07068816X
Article MATH Google Scholar
Todd, R.: Introduction to Conditional Numerical Reproducibility (CNR) (2012). https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

Download references

Acknowledgment

This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numerical computations and super high-performance computing environment for extreme researches) and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286.

Author information

Authors and Affiliations

RIKEN Center for Computational Science, Hyogo, Japan
Daichi Mukunoki
Tokyo Woman’s Christian University, Tokyo, Japan
Takeshi Ogita
Shibaura Institute of Technology, Saitama, Japan
Katsuhisa Ozaki

Authors

Daichi Mukunoki
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Ogita
View author publications
You can also search for this author in PubMed Google Scholar
Katsuhisa Ozaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daichi Mukunoki .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukunoki, D., Ogita, T., Ozaki, K. (2020). Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-43229-4_44
Published: 19 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43228-7
Online ISBN: 978-3-030-43229-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures