Adapting combined tiling to stencil optimizations on sunway processor

Sun, Biao; Li, Mingzhen; Yang, Hailong; Xu, Jun; Luan, Zhongzhi; Qian, Depei

doi:10.1007/s42514-023-00147-x

Adapting combined tiling to stencil optimizations on sunway processor

Regular Paper
Published: 17 May 2023

Volume 5, pages 322–333, (2023)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Biao Sun¹,
Mingzhen Li¹,
Hailong Yang ORCID: orcid.org/0000-0003-1101-7927¹,
Jun Xu²,
Zhongzhi Luan¹ &
…
Depei Qian¹

239 Accesses
3 Citations
Explore all metrics

Abstract

Stencil is one of the indispensable computation patterns in scientific applications, which is a long-standing optimization target in the field of high performance computing (HPC). The Sunway processor adopted in Sunway TaihuLight supercomputer has demonstrated its performance potential with unique heterogeneous many-core architecture. Although a large number of optimization methods have been proposed, the memory-bound nature of stencil computation and the limited bandwidth of Sunway processor make it challenging to adapt stencil computation efficiently on Sunway processor. To better use the computation capability of Sunway processor, we propose a combined tiling optimization of stencil computation tailored for the architectural features. In addition, we implement double buffering, vectorization, and register communication to further accelerate stencil computation on Sunway processor. We evaluate our method on six stencil benchmarks with different orders and shapes (thus different memory access patterns and computation intensities). The experimental results show that our implementation can achieve 1.97$\times$ speedup on average compared to the state-of-the-art stencil implementation on Sunway.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Optimization of Stencil Computation Base on Sunway TaihuLight

Performance Modeling of Stencil Computation on SW26010 Processors

High Performance Stencil Computations for Intel $$^{\normalsize \circledR }$$ Xeon Phi™ Coprocessor

Data availability

The authors confirm that the data supporting the findings of this study are available within the article.

References

Ahmad, Z., Chowdhury, R., Das, R., Ganapathi, P., Gregory, A., Zhu, Y.: Fast stencil computations using fast fourier transforms. In: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, pp. 8–21 (2021)
Ao, Y., Yang, C., Wang, X., Xue, W., Fu, H., Liu, F., Gan, L., Xu, P., Ma, W.: 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 535–544 (2017). https://doi.org/10.1109/IPDPS.2017.9
Bertolacci, I.J., Olschanowsky, C., Harshbarger, B., Chamberlain, B.L., Wonnacott, D.G., Strout, M.M.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 197–206 (2015)
Cai, Y., Yang, C., Ma, W., Ao, Y.: Extreme-scale realistic stencil computations on sunway taihulight with ten million cores. In: 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 566–571 (2018). https://doi.org/10.1109/CCGRID.2018.00086
Chen, Y., Li, K., Yang, W., Xiao, G., Xie, X., Li, T.: Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30(4), 923–938 (2019). https://doi.org/10.1109/TPDS.2018.2871189
Article Google Scholar
Chen, Y., Xiao, G., Özsu, M.T., Liu, C., Zomaya, A.Y., Li, T.: AESPTV: an adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE Trans. Parallel Distrib. Syst. 31(10), 2329–2345 (2020). https://doi.org/10.1109/TPDS.2020.2990429
Article Google Scholar
Dongarra, J., Peterson, G., Tomov, S., Allred, J., Natoli, V., Richie, D.: Exploring new architectures in accelerating cfd for air force applications. In: 2008 DoD HPCMP Users Group Conference, pp. 472–478. IEEE (2008)
Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: Proceedings of the 19th Annual International Conference on Supercomputing, pp. 361–366 (2005)
Fu, H., He, C., Chen, B., Yin, Z., Zhang, Z., Zhang, W., Zhang, T., Xue, W., Liu, W., Yin, W., et al.: 9-pflops nonlinear earthquake simulation on sunway taihulight: enabling depiction of 18-hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2017)
Garvey, J.D., Abdelrahman, T.S.: Automatic performance tuning of stencil computations on gpus. In: 2015 44th International Conference on Parallel Processing, pp. 300–309. IEEE (2015)
Guo, J., Bikshandi, G., Fraguela, B.B., Padua, D.: Writing productive stencil codes with overlapped tiling. Concurr. Comput. Pract. Exp. 21(1), 25–39 (2009)
Article Google Scholar
Habich, J., Zeiser, T., Hager, G., Wellein, G.: Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization. In: 21st International Conference on Parallel Computational Fluid Dynamics, pp. 178–182 (2009)
Jiang, L., Yang, C., Ao, Y., Yin, W., Ma, W., Sun, Q., Liu, F., Lin, R., Zhang, P.: Towards highly efficient dgemm on the emerging sw26010 many-core processor. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 422–431 (2017). https://doi.org/10.1109/ICPP.2017.51
Li, L., Fang, J., Fu, H., Jiang, J., Zhao, W., He, C., You, X., Yang, G.: swcaffe: A parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 413–422 (2018). https://doi.org/10.1109/CLUSTER.2018.00087
Li, M., Liu, Y., Yang, H., Hu, Y., Sun, Q., Chen, B., You, X., Liu, X., Luan, Z., Qian, D.: Automatic code generation and optimization of large-scale stencil computation on many-core processors. In: 50th International Conference on Parallel Processing, pp. 1–12 (2021)
Li, K., Yuan, L., Zhang, Y., Yue, Y.: Reducing redundancy in data organization and arithmetic calculation for stencil computations. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2021)
Li, K., Yuan, L., Zhang, Y., Yue, Y., Cao, H.: An efficient vectorization scheme for stencil computation. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 650–660. IEEE (2022)
Liu, C., Xie, B., Liu, X., Xue, W., Yang, H., Liu, X.: Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing. ICS ’18, pp. 363–373. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3205289.3205313
Liu, Y., Liu, L., Hu, M., Wang, W., Xue, W., Zhu, Q.: Performance modeling of stencil computation on sw26010 processors. In: Qiu, M. (ed.) Algorithms and Architectures for Parallel Processing, pp. 386–400. Springer, Cham (2020)
Chapter Google Scholar
Matsumura, K., Zohouri, H.R., Wahib, M., Endo, T., Matsuoka, S.: An5d: automated stencil framework for high-degree temporal blocking on gpus. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp. 199–211 (2020)
Micikevicius, P.: 3d finite difference computation on gpus using cuda. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84 (2009)
Mostafazadeh, B., Marti, F., Liu, F., Chandramowlishwaran, A.: Roofline guided design and analysis of a multi-stencil cfd solver for multicore performance. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 753–762. IEEE (2018)
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010). https://doi.org/10.1109/SC.2010.2
Powers, J.G., Klemp, J.B., Skamarock, W.C., Davis, C.A., Dudhia, J., Gill, D.O., Coen, J.L., Gochis, D.J., Ahmadov, R., Peckham, S.E., et al.: The weather research and forecasting model: overview, system efforts, and future directions. Bull. Am. Meteor. Soc. 98(8), 1717–1737 (2017)
Article Google Scholar
Rawat, P.S., Vaidya, M., Sukumaran-Rajam, A., Ravishankar, M., Grover, V., Rountev, A., Pouchet, L.-N., Sadayappan, P.: Domain-specific optimization and generation of high-performance gpu code for stencil computations. Proc. IEEE 106(11), 1902–1920 (2018)
Article Google Scholar
Rawat, P.S., Vaidya, M., Sukumaran-Rajam, A., Rountev, A., Pouchet, L.-N., Sadayappan, P.: On optimizing complex stencils on gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 641–652. IEEE (2019)
Rivera, G., Tseng, C.-W.: Tiling optimizations for 3d scientific computations. In: SC’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, p. 32. IEEE (2000)
Sun, Q., Liu, Y., Yang, H., Jiang, Z., Liu, X., Dun, M., Luan, Z., Qian, D.: cstuner: Scalable auto-tuning framework for complex stencil computation on gpus. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER) (2021)
Tang, Y., Li, M., Chen, Z., Xue, C., Zhao, C., Yang, H.: Parallel optimization of stencil computation base on sunway taihulight. In: Sun, X., Wang, J., Bertino, E. (eds.) Artificial Intelligence and Security, pp. 141–152. Springer, Singapore (2020)
Chapter Google Scholar
Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: 2009 33rd Annual IEEE International Computer Software and Applications Conference, vol. 1, pp. 579–586. IEEE (2009)
Xu, Z., Lin, J., Matsuoka, S.: Benchmarking sw26010 many-core processor. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 743–752 (2017). https://doi.org/10.1109/IPDPSW.2017.9
Xu, S., Xu, Y., Xue, W., Shen, X., Zheng, F., Huang, X., Yang, G.: Taming the“ monster”: Overcoming program optimization challenges on sw26010 through precise performance modeling. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 763–773. IEEE (2018)
Yang, C., Xue, W., Fu, H., You, H., Wang, X., Ao, Y., Liu, F., Gan, L., Xu, P., Wang, L., et al.: 10m-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 57–68. IEEE (2016)
Yount, C.R., Tobin, J., Breuer, A., Duran, A.: Yask-yet another stencil kernel: A framework for hpc stencil code-generation and tuning. 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp. 30–39 (2016)
Yuan, L., Zhang, Y., Guo, P., Huang, S.: Tessellating stencils. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)
Yuan, L., Cao, H., Zhang, Y., Li, K., Lu, P., Yue, Y.: Temporal vectorization for stencils. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’21. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3458817.3476149

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2022ZD0117805), National Natural Science Foundation of China (No. 62072018 and U22A2028), the Fundamental Research Funds for the Central Universities, and Iluvatar CoreX semiconductor Co., Ltd. Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Biao Sun, Mingzhen Li, Hailong Yang, Zhongzhi Luan & Depei Qian
Science and Technology on Special System Simulation Laboratory Beijing Simulation Center, Beijing, 100854, China
Jun Xu

Authors

Biao Sun
View author publications
You can also search for this author inPubMed Google Scholar
Mingzhen Li
View author publications
You can also search for this author inPubMed Google Scholar
Hailong Yang
View author publications
You can also search for this author inPubMed Google Scholar
Jun Xu
View author publications
You can also search for this author inPubMed Google Scholar
Zhongzhi Luan
View author publications
You can also search for this author inPubMed Google Scholar
Depei Qian
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, B., Li, M., Yang, H. et al. Adapting combined tiling to stencil optimizations on sunway processor. CCF Trans. HPC 5, 322–333 (2023). https://doi.org/10.1007/s42514-023-00147-x

Download citation

Received: 05 March 2023
Accepted: 18 April 2023
Published: 17 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s42514-023-00147-x

Keywords

Part of a collection:

Parallel System and Algorithm Optimization

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adapting combined tiling to stencil optimizations on sunway processor

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel Optimization of Stencil Computation Base on Sunway TaihuLight

Performance Modeling of Stencil Computation on SW26010 Processors

High Performance Stencil Computations for Intel $$^{\normalsize \circledR }$$ Xeon Phi™ Coprocessor

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now