Skip to main content
Log in

Tuning a general purpose software cache library for TaihuLight’s SW26010 processor

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

The Sunway TaihuLight supercomputer has been installed for several years and many applications have been ported or built for TaihuLight. Initially most applications running on TaihuLight are with regular memory access patterns, such as dense linear algebra, structured grids and dynamic programming. At the year of 2018, developers have published a general purpose graph processing framework, a ported version of LAMMPS and a sparse triangular solver. These applications are with irregular memory access patterns which need a lot of special processings to make use of the computing processing elements (CPEs) of TaihuLight. While those strategies are efficient, doing such processing may be difficult for wider range of applications, especially for the constantly changing molecular dynamics applications or dynamic unstructured grids. In this paper, we present our work of designing a general purpose software cache library, SWCache, for simplifying the work of applying software cache in kernels, as well as a series of tools for tuning and modelling the performance of our software cache. After a series of optimizations including reordering branches for better branch prediction, hand-tuning register allocation, we evaluate our implementation in two mini-apps: miniFE and miniMD. Experiments show that our tuned software cache library can be applied in these applications, and can provide 20% speedup in miniMD compared to the strategies in a previous port of LAMMPS. Also, the workload of writing code can be reduced by using our library. Besides, the experience of efficient macro-based programming should be valuable for further application development on CPEs which are lack of C++ support.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. Available at https://gitee.com/swmore/swcache-assets/blob/master/dma_macros.h.

  2. Available at https://gitee.com/swmore/swcache-assets/blob/master/cal.h.

References

  • Duan, X., Gao, P., Zhang, T., Zhang, M., Liu, W., Zhang, W., Xue, W., Fu, H., Gan, L., Chen, D., et al.: Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 12. IEEE Press (2018)

  • Duan, X., Xu, K., Chan, Y., Hundt, C., Schmidt, B., Balaji, P., Liu, W.: S-aligner: Ultrascalable read mapping on sunway taihu light. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 36–46. IEEE (2017)

  • Fang, J., Fu, H., Zhao, W., Chen, B., Zheng, W., Yang, G.: swdnn: A library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 615–624. IEEE (2017)

  • Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., et al.: 18.9-pflops nonlinear earthquake simulation on sunway taihulight: enabling depiction of 18-hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 2. ACM (2017)

  • Fu, H., Liao, J., Ding, N., Duan, X., Gan, L., Liang, Y., Wang, X., Yang, J., Zheng, Y., Liu, W., et al.: Redesigning cam-se for peta-scale climate modeling performance and ultra-high resolution on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 1. ACM (2017)

  • Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F., Qiao, F., et al.: The sunway taihulight supercomputer: system and applications. Sci. China Inform. Sci. 59(7), 072001 (2016)

    Article  Google Scholar 

  • Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574 3 (2009)

  • Lin, H., Zhu, X., Yu, B., Tang, X., Xue, W., Chen, W., Zhang, L., Hoefler, T., Ma, X., Liu, X., et al.: Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 56. IEEE Press (2018)

  • Rosinski, J.: Gptl - general purpose timing library (2014 (Accessed Oct 18, 2019)). https://jmrosinski.github.io/GPTL/

  • Wang, X., Liu, W., Xue, W., Wu, L.: swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 338–353 (2018)

  • Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Arch News 23(1), 20–24 (1995)

    Article  Google Scholar 

  • Xu, K., Song, Z., Chan, Y., Wang, S., Meng, X., Liu, W., Xue, W.: Refactoring and optimizing wrf model on sunway taihulight. In: Proceedings of the 48th International Conference on Parallel Processing, p. 72. ACM (2019)

  • Yang, C., Xue, W., Fu, H., You, H., Wang, X., Ao, Y., Liu, F., Gan, L., Xu, P., Wang, L., et al.: 10m-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 6. IEEE Press (2016)

  • Yu, Y., An, H., Chen, J., Liang, W., Xu, Q., Chen, Y.: Pipelining computation and optimization strategies for scaling gromacs on the sunway many-core processor. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 18–32. Springer (2017)

  • Zhang, T., Li, Y., Gao, P., Shao, Q., Shao, M., Zhang, M., Zhang, J., Duan, X., Liu, Z., Gan, L., Fu, H., Xue, W., Liu, W., Yang, G.: Sw\_gromacs: Accelerate gromacs on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 2. ACM (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaohui Duan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendices

A reading assemblies in this paper

There are several class of instructions mentioned in Figs. 7 and 10, we describe them here for helping reading our paper:

  1. 1.

    Binary operations: take 3 or 4 operands, former registers is source register and last register is destination. These instructions contains add, sub, and, bic, s8addl, sll, srl, with optional w/l suffixes for identifying the operation is on integer/long. s8addl a, b, c is equivalent to \(c= 8 \cdot a + b\), and bic a, r, c is equivalent to \(c = a \wedge \lnot b\). sll/srl is logical shift to the left/right.

  2. 2.

    Memory operations: take two register operands and a constant offset like op dest, offset(base), the base + offset forms the memory address. ld/st with a w/l suffixes are loading and storing integer/long. vldd is vector load, and faaw fetch a 32-bit integer from shared memory and add the data in shared memory by 1 atomically. ldi is simply dest = base + offset, this is helpful for adding a larger constant to a register.

  3. 3.

    Vector manipulation: vextf a, b, c extracts bth 64-bit element from vector register a to the lowest bits of register c, vinsf a, b, c, d inserts a to cth 64-bit element of vector b and the result is written in vector d.

B usage of dma_macros.h

dma_macros.h has been widely used to avoid DMA problems mentioned in §4.1.3.

1.1 B.1 Initialization

To initialize dma_macros.h, we provide two preprocessor switches to control the underlying implementation of those macros:

  • DMA_FAST: Use DMA intrinsics directly instead of calling athread library.

  • DMA_COUNT_BYTES: Add DMA interception to record bytes transferred via DMA.

Fig. 18
figure 18

Initializing dma_macros.h with switches

These switches is turned off by default. Either writing define directives in the code or defining in compiler command line can turn on these switches. An example of directive based approach is shown in Fig. 18. For compiler command line arguments, just add \({\texttt {-D<switch\_name>}}\) to the compiler argument, like \({\texttt {sw5cc -slave -DDMA\_FAST <your-file>.c}}\).

Beyond including the header file and turn these switches, another thing is defining corresponding variables in a user function, this can be done by expanding macro dma_init() at the beginning of an user function.

1.2 B.2 Available macros

  • \({\texttt {<mode>\_<op>(mem, ldm, size)}}\): send DMA request, transfer size bytes between mem and ldm. op can be get or put, mode can be pe, row, rank, bcast, brow. Valid combinations is shown in Table 3

  • dma_syn(): wait for pending DMA requests to be finished.

  • dma_set_stride(mode, stride, bsize): set stride and block size of a specific DMA mode, considering users may access several arrays with the same structure, we separate stride setting and request sending.

  • bcast_set_mask(mode, mask): set broadcast mask of corresponding DMA mode, valid modes are bcast and brow. Default mask value is 0xff.

  • dma_reset_stride(mode): set stride and block size to 0 for corresponding DMA mode.

  • bcast_reset_mask(mode): set broadcast mask to 0xff for corresponding DMA mode.

Table 3 Valid combinations of DMA mode and operations in dma_macros.h

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duan, X., Zhang, M., Liu, W. et al. Tuning a general purpose software cache library for TaihuLight’s SW26010 processor. CCF Trans. HPC 2, 164–182 (2020). https://doi.org/10.1007/s42514-020-00031-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-020-00031-y

Keywords

Navigation