Tuning a general purpose software cache library for TaihuLight’s SW26010 processor

Duan, Xiaohui; Zhang, Meng; Liu, Weiguo; Fu, Haohuan; Gan, Lin; Xue, Wei; Yang, Guangwen

doi:10.1007/s42514-020-00031-y

Tuning a general purpose software cache library for TaihuLight’s SW26010 processor

Regular Paper
Published: 13 May 2020

Volume 2, pages 164–182, (2020)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Xiaohui Duan^1,5,
Meng Zhang^1,5,
Weiguo Liu^1,5,
Haohuan Fu^3,5,
Lin Gan^2,4,5,
Wei Xue^2,5 &
…
Guangwen Yang^2,4,5

553 Accesses
2 Altmetric
Explore all metrics

Abstract

The Sunway TaihuLight supercomputer has been installed for several years and many applications have been ported or built for TaihuLight. Initially most applications running on TaihuLight are with regular memory access patterns, such as dense linear algebra, structured grids and dynamic programming. At the year of 2018, developers have published a general purpose graph processing framework, a ported version of LAMMPS and a sparse triangular solver. These applications are with irregular memory access patterns which need a lot of special processings to make use of the computing processing elements (CPEs) of TaihuLight. While those strategies are efficient, doing such processing may be difficult for wider range of applications, especially for the constantly changing molecular dynamics applications or dynamic unstructured grids. In this paper, we present our work of designing a general purpose software cache library, SWCache, for simplifying the work of applying software cache in kernels, as well as a series of tools for tuning and modelling the performance of our software cache. After a series of optimizations including reordering branches for better branch prediction, hand-tuning register allocation, we evaluate our implementation in two mini-apps: miniFE and miniMD. Experiments show that our tuned software cache library can be applied in these applications, and can provide 20% speedup in miniMD compared to the strategies in a previous port of LAMMPS. Also, the workload of writing code can be reduced by using our library. Besides, the experience of efficient macro-based programming should be valuable for further application development on CPEs which are lack of C++ support.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 6

An OpenMP Implementation of the TVD–Hopmoc Method Based on a Synchronization Mechanism Using Locks Between Adjacent Threads on Xeon Phi (TM) Accelerators

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

Article Open access 24 January 2022

Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

Notes

Available at https://gitee.com/swmore/swcache-assets/blob/master/dma_macros.h.
Available at https://gitee.com/swmore/swcache-assets/blob/master/cal.h.

References

Duan, X., Gao, P., Zhang, T., Zhang, M., Liu, W., Zhang, W., Xue, W., Fu, H., Gan, L., Chen, D., et al.: Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 12. IEEE Press (2018)
Duan, X., Xu, K., Chan, Y., Hundt, C., Schmidt, B., Balaji, P., Liu, W.: S-aligner: Ultrascalable read mapping on sunway taihu light. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 36–46. IEEE (2017)
Fang, J., Fu, H., Zhao, W., Chen, B., Zheng, W., Yang, G.: swdnn: A library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 615–624. IEEE (2017)
Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., et al.: 18.9-pflops nonlinear earthquake simulation on sunway taihulight: enabling depiction of 18-hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 2. ACM (2017)
Fu, H., Liao, J., Ding, N., Duan, X., Gan, L., Liang, Y., Wang, X., Yang, J., Zheng, Y., Liu, W., et al.: Redesigning cam-se for peta-scale climate modeling performance and ultra-high resolution on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 1. ACM (2017)
Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F., Qiao, F., et al.: The sunway taihulight supercomputer: system and applications. Sci. China Inform. Sci. 59(7), 072001 (2016)
Article Google Scholar
Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574 3 (2009)
Lin, H., Zhu, X., Yu, B., Tang, X., Xue, W., Chen, W., Zhang, L., Hoefler, T., Ma, X., Liu, X., et al.: Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 56. IEEE Press (2018)
Rosinski, J.: Gptl - general purpose timing library (2014 (Accessed Oct 18, 2019)). https://jmrosinski.github.io/GPTL/
Wang, X., Liu, W., Xue, W., Wu, L.: swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 338–353 (2018)
Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Arch News 23(1), 20–24 (1995)
Article Google Scholar
Xu, K., Song, Z., Chan, Y., Wang, S., Meng, X., Liu, W., Xue, W.: Refactoring and optimizing wrf model on sunway taihulight. In: Proceedings of the 48th International Conference on Parallel Processing, p. 72. ACM (2019)
Yang, C., Xue, W., Fu, H., You, H., Wang, X., Ao, Y., Liu, F., Gan, L., Xu, P., Wang, L., et al.: 10m-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 6. IEEE Press (2016)
Yu, Y., An, H., Chen, J., Liang, W., Xu, Q., Chen, Y.: Pipelining computation and optimization strategies for scaling gromacs on the sunway many-core processor. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 18–32. Springer (2017)
Zhang, T., Li, Y., Gao, P., Shao, Q., Shao, M., Zhang, M., Zhang, J., Duan, X., Liu, Z., Gan, L., Fu, H., Xue, W., Liu, W., Yang, G.: Sw\_gromacs: Accelerate gromacs on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 2. ACM (2019)

Download references

Author information

Authors and Affiliations

School of Software, Shandong University, Jinan, China
Xiaohui Duan, Meng Zhang & Weiguo Liu
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Lin Gan, Wei Xue & Guangwen Yang
Ministry of Education Key Lab for Earth System Modeling, and Department of Earth System Science, Tsinghua University, Beijing, China
Haohuan Fu
Beijing National Research Center For Information Science And Technology, Beijing, China
Lin Gan & Guangwen Yang
National Supercomputing Center in Wuxi, Wuxi, China
Xiaohui Duan, Meng Zhang, Weiguo Liu, Haohuan Fu, Lin Gan, Wei Xue & Guangwen Yang

Authors

Xiaohui Duan
View author publications
You can also search for this author inPubMed Google Scholar
Meng Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Weiguo Liu
View author publications
You can also search for this author inPubMed Google Scholar
Haohuan Fu
View author publications
You can also search for this author inPubMed Google Scholar
Lin Gan
View author publications
You can also search for this author inPubMed Google Scholar
Wei Xue
View author publications
You can also search for this author inPubMed Google Scholar
Guangwen Yang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Duan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendices

A reading assemblies in this paper

There are several class of instructions mentioned in Figs. 7 and 10, we describe them here for helping reading our paper:

1.
Binary operations: take 3 or 4 operands, former registers is source register and last register is destination. These instructions contains add, sub, and, bic, s8addl, sll, srl, with optional w/l suffixes for identifying the operation is on integer/long. s8addl a, b, c is equivalent to $c= 8 \cdot a + b$, and bic a, r, c is equivalent to $c = a \wedge \lnot b$. sll/srl is logical shift to the left/right.
2.
Memory operations: take two register operands and a constant offset like op dest, offset(base), the base + offset forms the memory address. ld/st with a w/l suffixes are loading and storing integer/long. vldd is vector load, and faaw fetch a 32-bit integer from shared memory and add the data in shared memory by 1 atomically. ldi is simply dest = base + offset, this is helpful for adding a larger constant to a register.
3.
Vector manipulation: vextf a, b, c extracts bth 64-bit element from vector register a to the lowest bits of register c, vinsf a, b, c, d inserts a to cth 64-bit element of vector b and the result is written in vector d.

B usage of dma_macros.h

dma_macros.h has been widely used to avoid DMA problems mentioned in §4.1.3.

1.1 B.1 Initialization

To initialize dma_macros.h, we provide two preprocessor switches to control the underlying implementation of those macros:

DMA_FAST: Use DMA intrinsics directly instead of calling athread library.
DMA_COUNT_BYTES: Add DMA interception to record bytes transferred via DMA.

These switches is turned off by default. Either writing define directives in the code or defining in compiler command line can turn on these switches. An example of directive based approach is shown in Fig. 18. For compiler command line arguments, just add ${\texttt {-D<switch\_name>}}$ to the compiler argument, like ${\texttt {sw5cc -slave -DDMA\_FAST <your-file>.c}}$.

Beyond including the header file and turn these switches, another thing is defining corresponding variables in a user function, this can be done by expanding macro dma_init() at the beginning of an user function.

1.2 B.2 Available macros

${\texttt {<mode>\_<op>(mem, ldm, size)}}$: send DMA request, transfer size bytes between mem and ldm. op can be get or put, mode can be pe, row, rank, bcast, brow. Valid combinations is shown in Table 3
dma_syn(): wait for pending DMA requests to be finished.
dma_set_stride(mode, stride, bsize): set stride and block size of a specific DMA mode, considering users may access several arrays with the same structure, we separate stride setting and request sending.
bcast_set_mask(mode, mask): set broadcast mask of corresponding DMA mode, valid modes are bcast and brow. Default mask value is 0xff.
dma_reset_stride(mode): set stride and block size to 0 for corresponding DMA mode.
bcast_reset_mask(mode): set broadcast mask to 0xff for corresponding DMA mode.

Table 3 Valid combinations of DMA mode and operations in dma_macros.h

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duan, X., Zhang, M., Liu, W. et al. Tuning a general purpose software cache library for TaihuLight’s SW26010 processor. CCF Trans. HPC 2, 164–182 (2020). https://doi.org/10.1007/s42514-020-00031-y

Download citation

Received: 15 November 2019
Accepted: 15 April 2020
Published: 13 May 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s42514-020-00031-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tuning a general purpose software cache library for TaihuLight’s SW26010 processor

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An OpenMP Implementation of the TVD–Hopmoc Method Based on a Synchronization Mechanism Using Locks Between Adjacent Threads on Xeon Phi (TM) Accelerators

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendices

A reading assemblies in this paper

B usage of dma_macros.h

1.1 B.1 Initialization

1.2 B.2 Available macros

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now