Skip to main content

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12043))

Abstract

Generally, floating-point computations comprise rounding errors; the result may be inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has many factors that affect reproducibility. The loss of accuracy and reproducibility could be a crucial issue in debugging complex codes and the reliability of computations. In this paper, we propose high-performance implementations of reproducible basic linear algebra subprograms (BLAS) routines with tunable accuracy for many-core architectures. Our approach is based on an accurate matrix-multiplication method, Ozaki scheme, which can be constructed on level-3 BLAS that performs standard floating-point operations. We demonstrate the performance of three routines: inner product (DOT), matrix-vector multiplication (GEMV), and matrix-multiplication (GEMM) on NVIDIA’s Volta GPU by comparing these with the standard routines provided by the vendor. Furthermore, we demonstrate the reproducibility between CPU and GPU and its accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.netlib.org/xblas.

  2. 2.

    http://mplapack.sourceforge.net.

  3. 3.

    https://software.intel.com/en-us/mkl.

  4. 4.

    https://developer.nvidia.com/cublas.

  5. 5.

    https://bebop.cs.berkeley.edu/reproblas.

  6. 6.

    https://exblas.lip6.fr.

  7. 7.

    It must be implemented based on the standard floating-point inner product without the use of the divide-and-conquer approach such as Strassen’s algorithm.

  8. 8.

    Available at http://www.math.twcu.ac.jp/ogita/post-k/results.html.

  9. 9.

    “+1” corresponds the working space for storing x at line 10 in Algorithm 1, which can be shared between the two input matrices.

  10. 10.

    http://www.mpfr.org.

  11. 11.

    In Algorithm 1, line 9 loads x and stores \(x_{\mathrm {split}}\), and line 10 stores x. Line 11 can be performed through line 10 and does not need to be performed at the end. Instead, line 3 is needed to be performed at the first time. Thus, three vector accesses occur per d on each vector.

References

  1. Chohra, C., Langlois, P., Parello, D.: Reproducible, accurately rounded and efficient BLAS. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 609–620. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_49

    Chapter  Google Scholar 

  2. Demmel, J., Ahrens, P., Nguyen, H.D.: Efficient reproducible floating point summation and BLAS. Technical report, UCB/EECS-2016-121, EECS Department, University of California, Berkeley (2016)

    Google Scholar 

  3. Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990). https://doi.org/10.1145/77626.79170

    Article  MathSciNet  MATH  Google Scholar 

  4. Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched BLAS on modern high-performance computing systems. In: International Conference on Computational Science (ICCS 2017), vol. 108, pp. 495–504 (2017). https://doi.org/10.1016/j.procs.2017.05.138

  5. Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2), 131–1315 (2007). https://doi.org/10.1145/1236463.1236468

    Article  MathSciNet  MATH  Google Scholar 

  6. Iakymchuk, R., Collange, S., Defour, D., Graillat, S.: ExBLAS: reproducible and accurate BLAS library. In: Proceedings of the Numerical Reproducibility at Exascale (NRE2015) at SC 2015 (2015)

    Google Scholar 

  7. Ichimura, S., Katagiri, T., Ozaki, K., Ogita, T., Nagai, T.: Threaded accurate matrix-matrix multiplications with sparse matrix-vector multiplications. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1093–1102 (2018). https://doi.org/10.1109/IPDPSW.2018.00168

  8. Li, X.S., et al.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2000). https://doi.org/10.1145/567806.567808

    Article  MathSciNet  Google Scholar 

  9. Nakata, M.: The MPACK; multiple precision arithmetic BLAS (MBLAS) and LAPACK (MLAPACK). http://mplapack.sourceforge.net

  10. Ozaki, K., Ogita, T., Oishi, S., Rump, S.M.: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012). https://doi.org/10.1007/s11075-011-9478-1

    Article  MathSciNet  MATH  Google Scholar 

  11. Rump, S., Ogita, T., Oishi, S.: Accurate floating-point summation part II: sign, K-fold faithful and rounding to nearest. SIAM J. Sci. Comput. 31(2), 1269–1302 (2009). https://doi.org/10.1137/07068816X

    Article  MATH  Google Scholar 

  12. Todd, R.: Introduction to Conditional Numerical Reproducibility (CNR) (2012). https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

Download references

Acknowledgment

This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numerical computations and super high-performance computing environment for extreme researches) and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daichi Mukunoki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mukunoki, D., Ogita, T., Ozaki, K. (2020). Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43229-4_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43228-7

  • Online ISBN: 978-3-030-43229-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics