research-article

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Authors:

Mikhail Smelyanskiy,

Karthikeyan Vaidyanathan,

Jatin Chhugani,

Michael A. Clark,

Pradeep DubeyAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 69, Pages 1 - 11

https://doi.org/10.1145/2063384.2063477

Published: 12 November 2011 Publication History

Abstract

Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2-3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 32³ x 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.

References

[1]

R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[2]

H. Baier et al. QPACE -- a QCD parallel computer based on Cell processors. PoS, LAT2009:001, 2009.

[3]

F. Belletti et al. QCD on the Cell Broadband Engine. PoS, LAT2007:039, 2007.

[4]

P. Boyle, D. Chen, N. Christ, M. Clark, S. Cohen, Z. Dong, A. Gara, B. Joo, C. Jung, L. Levkova, X. Liao, G. Liu, R. Mawhinney, S. Ohta, K. Petrov, T. Wettig, A. Yamaguchi, and C. Cristian. QCDOC: A 10 Teraflops Computer for Tightly-Coupled Calculations. In Proceedings of the ACM/IEEE SC2004 Conference, SC '04, page 40, 2004.

Digital Library

[5]

P. A. Boyle. The bagel assembler generation library. Computer Physics Communications, 180(12):2739--2748, 2009. 40 YEARS OF CPC: A celebratory issue focused on quality software for high performance, grid and novel computing architectures.

[6]

D. Chen, P. Chen, N. H. Christ, R. G. Edwards, G. Fleming, A. Gara, S. Hansen, C. Jung, A. Kahler, S. Kasow, A. D. Kennedy, G. Kilcup, Y. Luo, C. Malureanu, R. D. Mawhinney, J. Parsons, C. Sui, P. Vranas, and Y. Zhestkov. Qcdsp machines: design, performance and cost. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '98, pages 1--6, Washington, DC, USA, 1998. IEEE Computer Society.

Digital Library

[7]

J. Chen and W. W. Iii. Multi-threading performance on commodity multi-core processors. In In Proceedings of 9th International Conference on High Performance Computing in Asia Pacific Region (HPCAsia, 2007.

[8]

M. A. Clark, R. Babich, K. Barros, R. C. Brower, and C. Rebbi. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun., 181:1517--1528, 2010.

[9]

M. Creutz. QUARKS, GLUONS AND LATTICES. Cambridge, Uk: Univ. Pr. (1983) 169 P. (Cambridge Monographs On Mathematical Physics).

[10]

R. G. Edwards and B. Joo. The Chroma software system for lattice QCD. Nucl. Phys. Proc. Suppl., 140:832, 2005.

[11]

A. Gellrich, D. Pop, P. Wegner, H. Wittig, M. Hasenbusch, and K. Jansen. Lattice qcd calculations on commodity clusters at desy, 2003.

[12]

M. R. Hestenes and E. Stiefel. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 49(6):409--436, Dec. 1952.

[13]

D. J. Holmgren. PC clusters for lattice QCD. Nucl. Phys. Proc. Suppl., 140:183--189, 2005.

[14]

K. Z. Ibrahim and F. Bodin. Efficient simdization and data management of the lattice qcd computation on the cell broadband engine. Sci. Program., 17:153--172, January 2009.

Digital Library

[15]

InfiniBand Trade Association. 2004, http://www.infinibandta.org.

[16]

Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVX-Programming-Reference-31943302.pdf.

[17]

Intel SSE4 programming reference. 2007, http://www.intel.com/design/processor/manuals/253667.pdf.

[18]

Intel Corporation. Intel MPI: Message-Passing Interface Library. http://software.intel.com/en-us/articles/intel-mpi-library/.

[19]

N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.

[20]

M. Luscher. Schwarz-preconditioned HMC algorithm for two-flavour lattice QCD. Comput. Phys. Commun., 165:199--220, 2005.

[21]

C. McClendon. Optimized lattice qcd kernels for a pentium 4 cluster. Technical Report JLAB-THY-01-29, Thomas Jefferson National Laboratory, 12000 Jefferson Ave, Newport News, VA 23606, USA, 2001.

[22]

MPI: A Message-Passing Interface Standard. Mar 1994.

[23]

D. Molka, D. Hackenberg, R. Schone, and M. S. Muller. Memory performance and cache coherency effects on an intel nehalem multiprocessor system. Parallel Architectures and Compilation Techniques, International Conference on, 0:261--270, 2009.

Digital Library

[24]

I. Montvay and G. Munster. Quantum fields on a lattice. Cambridge, UK: Univ. Pr. (1994) 491 p. (Cambridge monographs on mathematical physics).

[25]

A. D. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5--d blocking optimization for stencil computations on modern cpus and gpus. In SC, pages 1--13, 2010.

Digital Library

[26]

A. Pochinsky. Writing efficient QCD code made simpler: QA(0). PoS, LATTICE2008:040, 2008.

[27]

H. J. Rothe. Lattice gauge theories: An Introduction. World Sci. Lect. Notes Phys., 74:1--605, 2005.

[28]

J. Spray, J. Hill, and A. Trew. Performance of a Lattice Quantum Chromodynamics Kernel on the Cell Processor. Comput. Phys. Commun., 179:642--646, 2008.

[29]

R. Strzodka and D. Göddeke. Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), pages 259--268, Apr. 2006.

Digital Library

[30]

H. A. van der Vorst. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing, 13(2):631--644, 1992.

Digital Library

[31]

P. Vranas, G. Bhanot, M. Blumrich, D. Chen, A. Gara, P. Heidelberger, V. Salapura, and J. C. Sexton. The bluegene/l supercomputer and quantum chromodynamics. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC '06, New York, NY, USA, 2006. ACM.

Digital Library

[32]

K. G. Wilson. Quarks and Strings on a Lattice. In Zichichi, A., editor, New Phenomena in Subnuclear Physics, page 69. Plenum Press, New York, 1975.

[33]

Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, PLDI '04, pages 255--266, 2004.

Digital Library

Cited By

Joo BKurth TClark MKim JTrott CIbanez DSunderland DDeslippe J(2019)Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC49587.2019.00007(14-25)Online publication date: Nov-2019
https://doi.org/10.1109/P3HPC49587.2019.00007
Tadonki C(2017)Scalable NUMA-Aware Wilson-Dirac on Supercomputers2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.56(315-324)Online publication date: Jul-2017
https://doi.org/10.1109/HPCS.2017.56
Walden AKhan SJoó BRanjan DZubair M(2016)Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights CornerHigh Performance Computing10.1007/978-3-319-46079-6_28(390-401)Online publication date: 6-Oct-2016
https://doi.org/10.1007/978-3-319-46079-6_28
Show More Cited By

Recommendations

Parallel programming model for the Epiphany many-core coprocessor using threaded MPI

We investigate the use of MPI for programming the Epiphany RISC array processor.A threaded MPI implementation adapted for coprocessor offload is presented.Existing MPI code for four scientific applications was re-used with minimal changes.Demonstrated ...
Hybrid multi-core architecture for boosting single-threaded performance

The scaling of technology and the diminishing return of complicated uniprocessors have driven the industry towards multicore processors. While multithreaded applications can naturally leverage the enhanced throughput of multi-core processors, a large ...
High Performance Parallel Summed-Area Table Kernels for Multi-core and Many-core Systems
Proceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833

The summed-area table SAT, also known as integral image, is a data structure extensively used in computer graphics and vision for fast image filtering. The parallelization of its construction has been thoroughly investigated and many algorithms have ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
263
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Joo BKurth TClark MKim JTrott CIbanez DSunderland DDeslippe J(2019)Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC49587.2019.00007(14-25)Online publication date: Nov-2019
https://doi.org/10.1109/P3HPC49587.2019.00007
Tadonki C(2017)Scalable NUMA-Aware Wilson-Dirac on Supercomputers2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.56(315-324)Online publication date: Jul-2017
https://doi.org/10.1109/HPCS.2017.56
Walden AKhan SJoó BRanjan DZubair M(2016)Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights CornerHigh Performance Computing10.1007/978-3-319-46079-6_28(390-401)Online publication date: 6-Oct-2016
https://doi.org/10.1007/978-3-319-46079-6_28
Rohr DBach MNešković GLindenstruth VPinke CPhilipsen O(2015)Lattice-CSC: Optimizing and Building an Efficient Supercomputer for Lattice-QCD and to Achieve First Place in Green500High Performance Computing10.1007/978-3-319-20119-1_14(179-196)Online publication date: 20-Jun-2015
https://doi.org/10.1007/978-3-319-20119-1_14
Sridharan SDinan JKalamkar DDamkroger TDongarra J(2014)Enabling efficient multithreaded MPI communication through a library-based implementation of MPI endpointsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.45(487-498)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/SC.2014.45
Bach MLindenstruth VPhilipsen OPinke C(2013)Lattice QCD based on OpenCLComputer Physics Communications10.1016/j.cpc.2013.03.020184:9(2042-2052)Online publication date: Sep-2013
https://doi.org/10.1016/j.cpc.2013.03.020
Joó BKalamkar DVaidyanathan KSmelyanskiy MPamnany KLee VDubey PWatson W(2013)Lattice QCD on Intel® Xeon PhiTM CoprocessorsSupercomputing10.1007/978-3-642-38750-0_4(40-54)Online publication date: 2013
https://doi.org/10.1007/978-3-642-38750-0_4
Doi JHollingsworth J(2012)Peta-scale lattice quantum chromodynamics on a blue gene/Q supercomputerProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389058(1-10)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389058
Clark MBabich R(2012)High-efficiency Lattice QCD computations on the Fermi architecture2012 Innovative Parallel Computing (InPar)10.1109/InPar.2012.6339591(1-9)Online publication date: May-2012
https://doi.org/10.1109/InPar.2012.6339591

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten