Optimizing and Scaling HPCG on Tianhe-2: Early Experience

Zhang, Xianyi; Yang, Chao; Liu, Fangfang; Liu, Yiqun; Lu, Yutong

doi:10.1007/978-3-319-11197-1_3

Xianyi Zhang^24,26,
Chao Yang^24,25,
Fangfang Liu²⁴,
Yiqun Liu^24,26 &
…
Yutong Lu²⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8630))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2787 Accesses
13 Citations

Abstract

In this paper, a first attempt has been made on optimizing and scaling HPCG on the world’s largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU code without using the Intel Xeon Phi coprocessors. In our work, we reformulate the basic CG algorithm to minimize the cost of collective communication and employ several optimizing techniques such as SIMDization, loop unrolling, forward and backward sweep fusion, OpenMP parallization to further enhance the performance of kernels such as the sparse matrix vector multiplication, the symmetric Gauss–Seidel relaxation and the geometric multigrid v-cycle. We successfully scale the HPCG code from 256 up to 6,144 nodes (147,456 CPU cores) on Tianhe-2, with a nearly ideal weak scalability and an aggregate performance of 79.83 Tflops, which is 6.38X higher than the reference implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proc. ACM/IEEE Conference on Supercomputing (SC 2008), pp. 4:1–4:12. IEEE Press (2008)
Google Scholar
Dongarra, J., Heroux, M.A.: Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013-4744, Sandia National Laboratories (2013)
Google Scholar
Dongarra, J., Luszczek, P.: HPCG technical specification. Sandia Report SAND2013-8752, Sandia National Laboratories (2013)
Google Scholar
García, C., Lario, R., Prieto, M., Piñuel, L., Tirado, F.: Vectorization of multigrid codes using SIMD ISA extensions. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2003), p. 8. IEEE (2003)
Google Scholar
Ghysels, P., Vanroose, W.: Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. Parallel Computing (2013) (in press)
Google Scholar
Kumahata, K., Minami, K., Maruyama, N.: HPCG on the K computer. In: ASCR HPCG Workshop (2014)
Google Scholar
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 111–125. Springer, Heidelberg (2010)
Chapter Google Scholar
Park, J., Smelyanskiy, M.: Optimizing Gauss–Seidel smoother in HPCG. In: ASCR HPCG Workshop (2014)
Google Scholar
Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels 16(1), 521 (2005)
Google Scholar
Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proc. Computer Software and Applications Conference (COMPSAC 2009), vol. 1, pp. 579–586. IEEE (2009)
Google Scholar
Williams, S., Kalamkar, D.D., Singh, A., Deshpande, A.M., Van Straalen, B., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi-and manycore processors. In: Proc. Int’l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 96:1–96:11. IEEE Computer Society Press, Los Alamitos (2012)
Google Scholar
Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Computing 35(3), 178–194 (2009)
Article Google Scholar
Wonnacott, D.: Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2000), pp. 171–180. IEEE (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Xianyi Zhang, Chao Yang, Fangfang Liu & Yiqun Liu
State Key Laboratory of Computer Science, Chinese Academy of Sciences, Beijing, 100190, China
Chao Yang
University of Chinese Academy of Sciences, Beijing, 100049, China
Xianyi Zhang & Yiqun Liu
National University of Defense Technology, Changsha, Hunan, 410073, China
Yutong Lu

Authors

Xianyi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fangfang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yiqun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 60616-3793, Chicago, IL, USA
Xian-he Sun
School of Computer Science and Technology, Dalian Maritime University, 1 Linghai Road, 116026, Dalian, China
Wenyu Qu
University of Ottawa, SEECS, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Wanlei Zhou
Dalian Maritime University, NO.1 Linhai Road, 116026, Dailian, China
Zhiyang Li & Tingting Yang &
BeiHang University, XueYuan Road No.37,HaiDian District, Beijing, China
Hua Guo
University of Bradford, BD7 1DP, Bradford, West Yorkshire, United Kingdom
Geyong Min
Computer Network Information Center, Chinese Academy of Sciences, 100190, Beijing, China
Yulei Wu
27 Shanda Nanlu, 250100, Jinan City, Shandong Province, China
Lei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Yang, C., Liu, F., Liu, Y., Lu, Y. (2014). Optimizing and Scaling HPCG on Tianhe-2: Early Experience. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8630. Springer, Cham. https://doi.org/10.1007/978-3-319-11197-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-11197-1_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11196-4
Online ISBN: 978-3-319-11197-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics