Skip to main content

Optimizing and Scaling HPCG on Tianhe-2: Early Experience

  • Conference paper
Algorithms and Architectures for Parallel Processing (ICA3PP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8630))

Abstract

In this paper, a first attempt has been made on optimizing and scaling HPCG on the world’s largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU code without using the Intel Xeon Phi coprocessors. In our work, we reformulate the basic CG algorithm to minimize the cost of collective communication and employ several optimizing techniques such as SIMDization, loop unrolling, forward and backward sweep fusion, OpenMP parallization to further enhance the performance of kernels such as the sparse matrix vector multiplication, the symmetric Gauss–Seidel relaxation and the geometric multigrid v-cycle. We successfully scale the HPCG code from 256 up to 6,144 nodes (147,456 CPU cores) on Tianhe-2, with a nearly ideal weak scalability and an aggregate performance of 79.83 Tflops, which is 6.38X higher than the reference implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proc. ACM/IEEE Conference on Supercomputing (SC 2008), pp. 4:1–4:12. IEEE Press (2008)

    Google Scholar 

  2. Dongarra, J., Heroux, M.A.: Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013-4744, Sandia National Laboratories (2013)

    Google Scholar 

  3. Dongarra, J., Luszczek, P.: HPCG technical specification. Sandia Report SAND2013-8752, Sandia National Laboratories (2013)

    Google Scholar 

  4. García, C., Lario, R., Prieto, M., Piñuel, L., Tirado, F.: Vectorization of multigrid codes using SIMD ISA extensions. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2003), p. 8. IEEE (2003)

    Google Scholar 

  5. Ghysels, P., Vanroose, W.: Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. Parallel Computing (2013) (in press)

    Google Scholar 

  6. Kumahata, K., Minami, K., Maruyama, N.: HPCG on the K computer. In: ASCR HPCG Workshop (2014)

    Google Scholar 

  7. Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 111–125. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  8. Park, J., Smelyanskiy, M.: Optimizing Gauss–Seidel smoother in HPCG. In: ASCR HPCG Workshop (2014)

    Google Scholar 

  9. Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels 16(1), 521 (2005)

    Google Scholar 

  10. Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proc. Computer Software and Applications Conference (COMPSAC 2009), vol. 1, pp. 579–586. IEEE (2009)

    Google Scholar 

  11. Williams, S., Kalamkar, D.D., Singh, A., Deshpande, A.M., Van Straalen, B., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi-and manycore processors. In: Proc. Int’l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 96:1–96:11. IEEE Computer Society Press, Los Alamitos (2012)

    Google Scholar 

  12. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Computing 35(3), 178–194 (2009)

    Article  Google Scholar 

  13. Wonnacott, D.: Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2000), pp. 171–180. IEEE (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhang, X., Yang, C., Liu, F., Liu, Y., Lu, Y. (2014). Optimizing and Scaling HPCG on Tianhe-2: Early Experience. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8630. Springer, Cham. https://doi.org/10.1007/978-3-319-11197-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11197-1_3

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11196-4

  • Online ISBN: 978-3-319-11197-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics