Skip to main content
Log in

Adapting combined tiling to stencil optimizations on sunway processor

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Stencil is one of the indispensable computation patterns in scientific applications, which is a long-standing optimization target in the field of high performance computing (HPC). The Sunway processor adopted in Sunway TaihuLight supercomputer has demonstrated its performance potential with unique heterogeneous many-core architecture. Although a large number of optimization methods have been proposed, the memory-bound nature of stencil computation and the limited bandwidth of Sunway processor make it challenging to adapt stencil computation efficiently on Sunway processor. To better use the computation capability of Sunway processor, we propose a combined tiling optimization of stencil computation tailored for the architectural features. In addition, we implement double buffering, vectorization, and register communication to further accelerate stencil computation on Sunway processor. We evaluate our method on six stencil benchmarks with different orders and shapes (thus different memory access patterns and computation intensities). The experimental results show that our implementation can achieve 1.97\(\times\) speedup on average compared to the state-of-the-art stencil implementation on Sunway.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

The authors confirm that the data supporting the findings of this study are available within the article.

References

  • Ahmad, Z., Chowdhury, R., Das, R., Ganapathi, P., Gregory, A., Zhu, Y.: Fast stencil computations using fast fourier transforms. In: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, pp. 8–21 (2021)

  • Ao, Y., Yang, C., Wang, X., Xue, W., Fu, H., Liu, F., Gan, L., Xu, P., Ma, W.: 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 535–544 (2017). https://doi.org/10.1109/IPDPS.2017.9

  • Bertolacci, I.J., Olschanowsky, C., Harshbarger, B., Chamberlain, B.L., Wonnacott, D.G., Strout, M.M.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 197–206 (2015)

  • Cai, Y., Yang, C., Ma, W., Ao, Y.: Extreme-scale realistic stencil computations on sunway taihulight with ten million cores. In: 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 566–571 (2018). https://doi.org/10.1109/CCGRID.2018.00086

  • Chen, Y., Li, K., Yang, W., Xiao, G., Xie, X., Li, T.: Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30(4), 923–938 (2019). https://doi.org/10.1109/TPDS.2018.2871189

    Article  Google Scholar 

  • Chen, Y., Xiao, G., Özsu, M.T., Liu, C., Zomaya, A.Y., Li, T.: AESPTV: an adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE Trans. Parallel Distrib. Syst. 31(10), 2329–2345 (2020). https://doi.org/10.1109/TPDS.2020.2990429

    Article  Google Scholar 

  • Dongarra, J., Peterson, G., Tomov, S., Allred, J., Natoli, V., Richie, D.: Exploring new architectures in accelerating cfd for air force applications. In: 2008 DoD HPCMP Users Group Conference, pp. 472–478. IEEE (2008)

  • Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: Proceedings of the 19th Annual International Conference on Supercomputing, pp. 361–366 (2005)

  • Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., et al.: 9-pflops nonlinear earthquake simulation on sunway taihulight: enabling depiction of 18-hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2017)

  • Garvey, J.D., Abdelrahman, T.S.: Automatic performance tuning of stencil computations on gpus. In: 2015 44th International Conference on Parallel Processing, pp. 300–309. IEEE (2015)

  • Guo, J., Bikshandi, G., Fraguela, B.B., Padua, D.: Writing productive stencil codes with overlapped tiling. Concurr. Comput. Pract. Exp. 21(1), 25–39 (2009)

    Article  Google Scholar 

  • Habich, J., Zeiser, T., Hager, G., Wellein, G.: Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization. In: 21st International Conference on Parallel Computational Fluid Dynamics, pp. 178–182 (2009)

  • Jiang, L., Yang, C., Ao, Y., Yin, W., Ma, W., Sun, Q., Liu, F., Lin, R., Zhang, P.: Towards highly efficient dgemm on the emerging sw26010 many-core processor. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 422–431 (2017). https://doi.org/10.1109/ICPP.2017.51

  • Li, L., Fang, J., Fu, H., Jiang, J., Zhao, W., He, C., You, X., Yang, G.: swcaffe: A parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 413–422 (2018). https://doi.org/10.1109/CLUSTER.2018.00087

  • Li, M., Liu, Y., Yang, H., Hu, Y., Sun, Q., Chen, B., You, X., Liu, X., Luan, Z., Qian, D.: Automatic code generation and optimization of large-scale stencil computation on many-core processors. In: 50th International Conference on Parallel Processing, pp. 1–12 (2021)

  • Li, K., Yuan, L., Zhang, Y., Yue, Y.: Reducing redundancy in data organization and arithmetic calculation for stencil computations. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2021)

  • Li, K., Yuan, L., Zhang, Y., Yue, Y., Cao, H.: An efficient vectorization scheme for stencil computation. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 650–660. IEEE (2022)

  • Liu, C., Xie, B., Liu, X., Xue, W., Yang, H., Liu, X.: Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing. ICS ’18, pp. 363–373. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3205289.3205313

  • Liu, Y., Liu, L., Hu, M., Wang, W., Xue, W., Zhu, Q.: Performance modeling of stencil computation on sw26010 processors. In: Qiu, M. (ed.) Algorithms and Architectures for Parallel Processing, pp. 386–400. Springer, Cham (2020)

    Chapter  Google Scholar 

  • Matsumura, K., Zohouri, H.R., Wahib, M., Endo, T., Matsuoka, S.: An5d: automated stencil framework for high-degree temporal blocking on gpus. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp. 199–211 (2020)

  • Micikevicius, P.: 3d finite difference computation on gpus using cuda. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84 (2009)

  • Mostafazadeh, B., Marti, F., Liu, F., Chandramowlishwaran, A.: Roofline guided design and analysis of a multi-stencil cfd solver for multicore performance. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 753–762. IEEE (2018)

  • Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010). https://doi.org/10.1109/SC.2010.2

  • Powers, J.G., Klemp, J.B., Skamarock, W.C., Davis, C.A., Dudhia, J., Gill, D.O., Coen, J.L., Gochis, D.J., Ahmadov, R., Peckham, S.E., et al.: The weather research and forecasting model: overview, system efforts, and future directions. Bull. Am. Meteor. Soc. 98(8), 1717–1737 (2017)

    Article  Google Scholar 

  • Rawat, P.S., Vaidya, M., Sukumaran-Rajam, A., Ravishankar, M., Grover, V., Rountev, A., Pouchet, L.-N., Sadayappan, P.: Domain-specific optimization and generation of high-performance gpu code for stencil computations. Proc. IEEE 106(11), 1902–1920 (2018)

    Article  Google Scholar 

  • Rawat, P.S., Vaidya, M., Sukumaran-Rajam, A., Rountev, A., Pouchet, L.-N., Sadayappan, P.: On optimizing complex stencils on gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 641–652. IEEE (2019)

  • Rivera, G., Tseng, C.-W.: Tiling optimizations for 3d scientific computations. In: SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, p. 32. IEEE (2000)

  • Sun, Q., Liu, Y., Yang, H., Jiang, Z., Liu, X., Dun, M., Luan, Z., Qian, D.: cstuner: Scalable auto-tuning framework for complex stencil computation on gpus. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER) (2021)

  • Tang, Y., Li, M., Chen, Z., Xue, C., Zhao, C., Yang, H.: Parallel optimization of stencil computation base on sunway taihulight. In: Sun, X., Wang, J., Bertino, E. (eds.) Artificial Intelligence and Security, pp. 141–152. Springer, Singapore (2020)

    Chapter  Google Scholar 

  • Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: 2009 33rd Annual IEEE International Computer Software and Applications Conference, vol. 1, pp. 579–586. IEEE (2009)

  • Xu, Z., Lin, J., Matsuoka, S.: Benchmarking sw26010 many-core processor. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 743–752 (2017). https://doi.org/10.1109/IPDPSW.2017.9

  • Xu, S., Xu, Y., Xue, W., Shen, X., Zheng, F., Huang, X., Yang, G.: Taming the“ monster”: Overcoming program optimization challenges on sw26010 through precise performance modeling. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 763–773. IEEE (2018)

  • Yang, C., Xue, W., Fu, H., You, H., Wang, X., Ao, Y., Liu, F., Gan, L., Xu, P., Wang, L., et al.: 10m-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 57–68. IEEE (2016)

  • Yount, C.R., Tobin, J., Breuer, A., Duran, A.: Yask-yet another stencil kernel: A framework for hpc stencil code-generation and tuning. 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39 (2016)

  • Yuan, L., Zhang, Y., Guo, P., Huang, S.: Tessellating stencils. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)

  • Yuan, L., Cao, H., Zhang, Y., Li, K., Lu, P., Yue, Y.: Temporal vectorization for stencils. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’21. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3458817.3476149

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2022ZD0117805), National Natural Science Foundation of China (No. 62072018 and U22A2028), the Fundamental Research Funds for the Central Universities, and Iluvatar CoreX semiconductor Co., Ltd. Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailong Yang.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, B., Li, M., Yang, H. et al. Adapting combined tiling to stencil optimizations on sunway processor. CCF Trans. HPC 5, 322–333 (2023). https://doi.org/10.1007/s42514-023-00147-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-023-00147-x

Keywords

Navigation