Abstract
Generally, floating-point computations comprise rounding errors; the result may be inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has many factors that affect reproducibility. The loss of accuracy and reproducibility could be a crucial issue in debugging complex codes and the reliability of computations. In this paper, we propose high-performance implementations of reproducible basic linear algebra subprograms (BLAS) routines with tunable accuracy for many-core architectures. Our approach is based on an accurate matrix-multiplication method, Ozaki scheme, which can be constructed on level-3 BLAS that performs standard floating-point operations. We demonstrate the performance of three routines: inner product (DOT), matrix-vector multiplication (GEMV), and matrix-multiplication (GEMM) on NVIDIA’s Volta GPU by comparing these with the standard routines provided by the vendor. Furthermore, we demonstrate the reproducibility between CPU and GPU and its accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
It must be implemented based on the standard floating-point inner product without the use of the divide-and-conquer approach such as Strassen’s algorithm.
- 8.
Available at http://www.math.twcu.ac.jp/ogita/post-k/results.html.
- 9.
“+1” corresponds the working space for storing x at line 10 in Algorithm 1, which can be shared between the two input matrices.
- 10.
- 11.
In Algorithm 1, line 9 loads x and stores \(x_{\mathrm {split}}\), and line 10 stores x. Line 11 can be performed through line 10 and does not need to be performed at the end. Instead, line 3 is needed to be performed at the first time. Thus, three vector accesses occur per d on each vector.
References
Chohra, C., Langlois, P., Parello, D.: Reproducible, accurately rounded and efficient BLAS. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 609–620. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_49
Demmel, J., Ahrens, P., Nguyen, H.D.: Efficient reproducible floating point summation and BLAS. Technical report, UCB/EECS-2016-121, EECS Department, University of California, Berkeley (2016)
Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990). https://doi.org/10.1145/77626.79170
Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched BLAS on modern high-performance computing systems. In: International Conference on Computational Science (ICCS 2017), vol. 108, pp. 495–504 (2017). https://doi.org/10.1016/j.procs.2017.05.138
Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2), 131–1315 (2007). https://doi.org/10.1145/1236463.1236468
Iakymchuk, R., Collange, S., Defour, D., Graillat, S.: ExBLAS: reproducible and accurate BLAS library. In: Proceedings of the Numerical Reproducibility at Exascale (NRE2015) at SC 2015 (2015)
Ichimura, S., Katagiri, T., Ozaki, K., Ogita, T., Nagai, T.: Threaded accurate matrix-matrix multiplications with sparse matrix-vector multiplications. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1093–1102 (2018). https://doi.org/10.1109/IPDPSW.2018.00168
Li, X.S., et al.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2000). https://doi.org/10.1145/567806.567808
Nakata, M.: The MPACK; multiple precision arithmetic BLAS (MBLAS) and LAPACK (MLAPACK). http://mplapack.sourceforge.net
Ozaki, K., Ogita, T., Oishi, S., Rump, S.M.: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012). https://doi.org/10.1007/s11075-011-9478-1
Rump, S., Ogita, T., Oishi, S.: Accurate floating-point summation part II: sign, K-fold faithful and rounding to nearest. SIAM J. Sci. Comput. 31(2), 1269–1302 (2009). https://doi.org/10.1137/07068816X
Todd, R.: Introduction to Conditional Numerical Reproducibility (CNR) (2012). https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr
Acknowledgment
This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numerical computations and super high-performance computing environment for extreme researches) and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Mukunoki, D., Ogita, T., Ozaki, K. (2020). Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-43229-4_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43228-7
Online ISBN: 978-3-030-43229-4
eBook Packages: Computer ScienceComputer Science (R0)