Abstract
There is a significant interest in the computational physics community to perform lattice quantum chromodynamics (LQCD) simulations, which can run into the trillions of operations. LQCD computations solve a sparse linear system using a Wilson Dslash kernel, which has an arithmetic intensity of 0.88–2.29. This makes Dslash memory bandwidth-bound on most architectures, including Intel Xeon Phi Knights Corner (KNC). Most research optimizing the Dslash operator has been focused on single right-hand side (SRHS) linear solvers. There is a class of LQCD computations which aims to solve systems with multiple right-hand sides (MRHS), presenting additional opportunities for data reuse and vectorization. We present two approaches to MRHS Dslash: a vector register blocking approach and one using the software package QPhiX with a custom code generator for low-level intrinsics. We observed significant speedups using our approaches, with sustained performance of over 700 GFLOPS (single precision) in one instance. We achieved up to 29 % of theoretical peak performance compared to a maximum of 13 % obtained by the previous SRHS method using QPhiX.
Notice: Authored by Jefferson Science Associates, LLC under U.S. DOE Contract No. DE-AC05-06OR23177. The U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce this manuscript for U.S. Government purposes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Intel\(^{\textregistered }\) XeonPhi™ Coprocessor: Software developers guide. Technical report, Intel Corporation, March 2014
Walden, A.: An optimized multiple right-hand side Dslash Kernel for Intel\(^{\textregistered }\) Xeon Phi™. Master’s thesis, Old Dominion University, Norfolk, VA (2016). http://www.cs.odu.edu/~awalden/walden_ms_thesis.pdf
Joó, B., et. al: Code generator for the QPhiX library, Wilson fermions. https://github.com/JeffersonLab/qphix-codegen
Joó, B., et. al: QPhiX: QCD for Intel Xeon Phi and Xeon processors. https://github.com/JeffersonLab/qphix
Diavastos, A., Stylianou, G., Koutsou, G.: Exploring parallelism on the Intel\(^{\textregistered }\) Xeon Phi™ with lattice-QCD kernels. http://clusterware.cyi.ac.cy/data/paper.pdf
Gupta, R.: Introduction to lattice QCD. \(\text{arXiv}\):\(\text{ hep-lat/9807028 }\). http://arxiv.org/abs/hep-lat/9807028
Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M.,Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on Intel\(^{\textregistered }\) Xeon Phi™ co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storageand Analysis, SC 2014, pp. 69–80. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/SC.2014.11
Joó, B., Kalamkar, D.D., Vaidyanathan, K., Smelyanskiy, M., Pamnany, K., Lee, V.W., Dubey, P., Watson, W.: Lattice QCD on Intel\(^{\textregistered }\) Xeon Phi™ coprocessors. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) Supercomputing. LNCS, vol. 7905, pp. 40–54. Springer, Heidelberg (2013)
Joó, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Chapter 9-Wilson Dslash kernel from lattice QCD optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls Volume Two: Multicore and Many-Core Programming Approaches, vol. 2, pp. 139–170. Morgan Kaufmann, Boston (2015). http://www.sciencedirect.com/science/article/pii/B9780128038192000239
Kaczmarek, O., Schmidt, C., Steinbrecher, P., Mukherjee, S., Wagner, M.: HISQ inverter on Intel Xeon Phi and NVIDIA GPUs. CoRR abs/1409.1510 (2014). http://arxiv.org/abs/1409.1510
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13. IEEE Computer Society, Washington, DC (2010). http://dx.doi.org/10.1109/SC.2010.2
Richtmann, D., Heybrock, S., Wettig, T.: Multiple right-hand-sidesetup for the DD-\(\alpha \)AMG. In: Proceedings of the 33rd International Symposium on Lattice Field Theory, July 2015. http://arxiv.org/abs/1601.03184
Sakurai, T., Tadano, H., Kuramashi, Y.: Application of block Krylovsubspace algorithms to the Wilson-Dirac equation with multiple right-hand sides inlattice QCD. Comput. Phys. Commun. 181(1), 113–117 (2010). http://www.sciencedirect.com/science/article/pii/S0010465509002859
Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joó, B., Chhugani,J., Clark, M.A., Dubey, P.: High-performance lattice QCD for multi-core based parallelsystems using a cache-friendly hybrid threaded-MPI approach. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–10, November 2011
Acknowledgments
This work was partially supported by a grant from Jefferson Lab. Aaron Walden and Sabbir Khan were also partially supported by the Old Dominion University Modeling and Simulation Fellowship Program and gratefully acknowledge this support. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Walden, A., Khan, S., Joó, B., Ranjan, D., Zubair, M. (2016). Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights Corner. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-46079-6_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)