skip to main content
10.1145/2063384.2063477acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Published: 12 November 2011 Publication History

Abstract

Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2-3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 323 x 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.

References

[1]
R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.
[2]
H. Baier et al. QPACE -- a QCD parallel computer based on Cell processors. PoS, LAT2009:001, 2009.
[3]
F. Belletti et al. QCD on the Cell Broadband Engine. PoS, LAT2007:039, 2007.
[4]
P. Boyle, D. Chen, N. Christ, M. Clark, S. Cohen, Z. Dong, A. Gara, B. Joo, C. Jung, L. Levkova, X. Liao, G. Liu, R. Mawhinney, S. Ohta, K. Petrov, T. Wettig, A. Yamaguchi, and C. Cristian. QCDOC: A 10 Teraflops Computer for Tightly-Coupled Calculations. In Proceedings of the ACM/IEEE SC2004 Conference, SC '04, page 40, 2004.
[5]
P. A. Boyle. The bagel assembler generation library. Computer Physics Communications, 180(12):2739--2748, 2009. 40 YEARS OF CPC: A celebratory issue focused on quality software for high performance, grid and novel computing architectures.
[6]
D. Chen, P. Chen, N. H. Christ, R. G. Edwards, G. Fleming, A. Gara, S. Hansen, C. Jung, A. Kahler, S. Kasow, A. D. Kennedy, G. Kilcup, Y. Luo, C. Malureanu, R. D. Mawhinney, J. Parsons, C. Sui, P. Vranas, and Y. Zhestkov. Qcdsp machines: design, performance and cost. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '98, pages 1--6, Washington, DC, USA, 1998. IEEE Computer Society.
[7]
J. Chen and W. W. Iii. Multi-threading performance on commodity multi-core processors. In In Proceedings of 9th International Conference on High Performance Computing in Asia Pacific Region (HPCAsia, 2007.
[8]
M. A. Clark, R. Babich, K. Barros, R. C. Brower, and C. Rebbi. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun., 181:1517--1528, 2010.
[9]
M. Creutz. QUARKS, GLUONS AND LATTICES. Cambridge, Uk: Univ. Pr. (1983) 169 P. (Cambridge Monographs On Mathematical Physics).
[10]
R. G. Edwards and B. Joo. The Chroma software system for lattice QCD. Nucl. Phys. Proc. Suppl., 140:832, 2005.
[11]
A. Gellrich, D. Pop, P. Wegner, H. Wittig, M. Hasenbusch, and K. Jansen. Lattice qcd calculations on commodity clusters at desy, 2003.
[12]
M. R. Hestenes and E. Stiefel. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 49(6):409--436, Dec. 1952.
[13]
D. J. Holmgren. PC clusters for lattice QCD. Nucl. Phys. Proc. Suppl., 140:183--189, 2005.
[14]
K. Z. Ibrahim and F. Bodin. Efficient simdization and data management of the lattice qcd computation on the cell broadband engine. Sci. Program., 17:153--172, January 2009.
[15]
InfiniBand Trade Association. 2004, http://www.infinibandta.org.
[16]
Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVX-Programming-Reference-31943302.pdf.
[17]
Intel SSE4 programming reference. 2007, http://www.intel.com/design/processor/manuals/253667.pdf.
[18]
Intel Corporation. Intel MPI: Message-Passing Interface Library. http://software.intel.com/en-us/articles/intel-mpi-library/.
[19]
N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.
[20]
M. Luscher. Schwarz-preconditioned HMC algorithm for two-flavour lattice QCD. Comput. Phys. Commun., 165:199--220, 2005.
[21]
C. McClendon. Optimized lattice qcd kernels for a pentium 4 cluster. Technical Report JLAB-THY-01-29, Thomas Jefferson National Laboratory, 12000 Jefferson Ave, Newport News, VA 23606, USA, 2001.
[22]
MPI: A Message-Passing Interface Standard. Mar 1994.
[23]
D. Molka, D. Hackenberg, R. Schone, and M. S. Muller. Memory performance and cache coherency effects on an intel nehalem multiprocessor system. Parallel Architectures and Compilation Techniques, International Conference on, 0:261--270, 2009.
[24]
I. Montvay and G. Munster. Quantum fields on a lattice. Cambridge, UK: Univ. Pr. (1994) 491 p. (Cambridge monographs on mathematical physics).
[25]
A. D. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5--d blocking optimization for stencil computations on modern cpus and gpus. In SC, pages 1--13, 2010.
[26]
A. Pochinsky. Writing efficient QCD code made simpler: QA(0). PoS, LATTICE2008:040, 2008.
[27]
H. J. Rothe. Lattice gauge theories: An Introduction. World Sci. Lect. Notes Phys., 74:1--605, 2005.
[28]
J. Spray, J. Hill, and A. Trew. Performance of a Lattice Quantum Chromodynamics Kernel on the Cell Processor. Comput. Phys. Commun., 179:642--646, 2008.
[29]
R. Strzodka and D. Göddeke. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), pages 259--268, Apr. 2006.
[30]
H. A. van der Vorst. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing, 13(2):631--644, 1992.
[31]
P. Vranas, G. Bhanot, M. Blumrich, D. Chen, A. Gara, P. Heidelberger, V. Salapura, and J. C. Sexton. The bluegene/l supercomputer and quantum chromodynamics. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC '06, New York, NY, USA, 2006. ACM.
[32]
K. G. Wilson. Quarks and Strings on a Lattice. In Zichichi, A., editor, New Phenomena in Subnuclear Physics, page 69. Plenum Press, New York, 1975.
[33]
Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, PLDI '04, pages 255--266, 2004.

Cited By

View all
  • (2019)Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC49587.2019.00007(14-25)Online publication date: Nov-2019
  • (2017)Scalable NUMA-Aware Wilson-Dirac on Supercomputers2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.56(315-324)Online publication date: Jul-2017
  • (2016)Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights CornerHigh Performance Computing10.1007/978-3-319-46079-6_28(390-401)Online publication date: 6-Oct-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC49587.2019.00007(14-25)Online publication date: Nov-2019
  • (2017)Scalable NUMA-Aware Wilson-Dirac on Supercomputers2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.56(315-324)Online publication date: Jul-2017
  • (2016)Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights CornerHigh Performance Computing10.1007/978-3-319-46079-6_28(390-401)Online publication date: 6-Oct-2016
  • (2015)Lattice-CSC: Optimizing and Building an Efficient Supercomputer for Lattice-QCD and to Achieve First Place in Green500High Performance Computing10.1007/978-3-319-20119-1_14(179-196)Online publication date: 20-Jun-2015
  • (2014)Enabling efficient multithreaded MPI communication through a library-based implementation of MPI endpointsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.45(487-498)Online publication date: 16-Nov-2014
  • (2013)Lattice QCD based on OpenCLComputer Physics Communications10.1016/j.cpc.2013.03.020184:9(2042-2052)Online publication date: Sep-2013
  • (2013)Lattice QCD on Intel® Xeon PhiTM CoprocessorsSupercomputing10.1007/978-3-642-38750-0_4(40-54)Online publication date: 2013
  • (2012)Peta-scale lattice quantum chromodynamics on a blue gene/Q supercomputerProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389058(1-10)Online publication date: 10-Nov-2012
  • (2012)High-efficiency Lattice QCD computations on the Fermi architecture2012 Innovative Parallel Computing (InPar)10.1109/InPar.2012.6339591(1-9)Online publication date: May-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media