Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor

Cao, Hao; Guo, Shaozhong; Hao, Jiangwei; Xia, Yuanyuan; Xu, Jinchen

doi:10.1007/s11227-021-03997-w

Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor

Published: 10 September 2021

Volume 78, pages 4827–4849, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hao Cao¹,
Shaozhong Guo¹,
Jiangwei Hao¹,
Yuanyuan Xia¹ &
…
Jinchen Xu ORCID: orcid.org/0000-0002-6275-2617¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The SW26010 many-core processor is based on the Sunway architecture that is composed of management and computing processing elements (MPE and CPE, respectively), each of which is equipped with a stand-alone math library. The issue is that each Sunway Math Library (SML) version is written in assembly which is outside the power of compilers that take high-level languages as input; existing optimization approaches thus mainly rely on manual strategies, which are considered inefficient. In this paper, we leverage the concept of superblock scheduling, a well-known compilation technique, and present a tool named SMPOT to optimize the SML. SMPOT first builds a superblock using a novel tail duplication algorithm, and then uses code motion restrictions to avoid code compensation, followed by matching the machine model. Finally, it reorders instructions on the main path by an activation algorithm based on available computing resources. The experimental results show that SMPOT can effectively improve the performance of the SML. The main path performance of MPE functions is improved by 10.61% on average and overall performance by 5.40%. The main path performance of CPE functions is improved by 5.72% on average and overall performance by 2.98%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Evaluation of NPB and SPEC CPU2006 on Various SIMD Extensions

Tuning a general purpose software cache library for TaihuLight’s SW26010 processor

Article 13 May 2020

A Cost Model for Heterogeneous Many-Core Processor

Notes

References

Lokuciejewski P, Kelter T, Marwedel P (2010) Superblock-Based Source Code Optimizations for WCET Reduction. 2010 10th IEEE International Conference on Computer and Information Technology, 2010, pp. 1918–1925. https://doi.org/10.1109/CIT.2010.327
Shobaki G, Wilken K (2005) Optimal Superblock Scheduling Using Enumeration. Proc Ann Intl Symp Microarch. https://doi.org/10.1109/MICRO.2004.27
Article Google Scholar
Heffernan M, Wilken K, Shobaki G (2006) Data-Dependency Graph Transformations for Superblock Scheduling. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). https://doi.org/10.1109/MICRO.2006.16
Ye JM, Chen T (2012) Exploring Potential Parallelism of Sequential Programs with Superblock Reordering. In: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC '12). https://doi.org/10.1109/HPCC.2012.12
Fu H, Liao J, Yang J et al (2016) The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci 59:1–16. https://doi.org/10.1007/s11432-016-5588-7
Article Google Scholar
Dongarra J (2016) Report on the Sunway TaihuLight System. Ut Eecs Technical Reports 1–24. http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
Lin J, Xu Z, Nukada A, Maruyama N, Matsuoka S (2017) Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-Core Processor. 46th Intl Conf Parallel Proc. https://doi.org/10.1109/ICPP.2017.52
NSCCWX (2021) Sunway TaihuLight Compiler User Guide. http://www.nsccwx.cn. Accessed 10 March 2021
Xu Z, Lin J, Matsuoka S (2017) Benchmarking SW26010 many-core processor. IEEE Intl Parallel Disrt Proc Symp Wkshp (IPDPSW). https://doi.org/10.1109/IPDPSW.2017.9
Article Google Scholar
Zhang J. Zhou C, Wang Y et al (2016) Extreme-scale phase field simulations of coarsening dynamics on the sunway TaihuLight supercomputer. SC '16: Proc Intl Conf High Perf Comput Netw Stor Anal. https://doi.org/10.1109/SC.2016.3
Touzeau RF (1984) A fortran compiler for the FPS-164 scientific computer. Sigplan Notices - SIGPLAN 19:48–57. https://doi.org/10.1145/502949.502879
Article Google Scholar
Gibbons PB, Muchnick SS (1986) Efficient instruction scheduling for a pipelined architecture. ACM SIGPLAN Notices doi 10(1145/12276):13312
Google Scholar
Warren HS (1990) Instruction scheduling for the IBM RISC System/6000. IBM J Res Dev 34(1):85–92. https://doi.org/10.1147/rd.341.0085
Article Google Scholar
Chang PP, Warter NJ, Mahlke S et al (1992) Three Superblock Scheduling Models for Superscalar and Superpipelined Processors. Coord Sci Lab Rep no. UILU-ENG-91–2250, CRHC-91–29. http://hdl.handle.net/2142/74519
Fisher JA (1979) Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. PhD thesis, Department of Computer Science, New York University. https://doi.org/10.2172/5752434
Fisher JA (1981) Trace scheduling: a technique for global microcode compaction. IEEE Transact Comput. https://doi.org/10.1109/TC.1981.1675827
Article Google Scholar
Ellis JR (1986) Bulldog: a compiler for VLIW architectures. MIT Press, Cambridge, Massachusetts
Google Scholar
Chang PP, Mahlke SA, Hwu WM (1991) Using profile information to assist classic compiler code optimizations. Softw Pract Exper. https://doi.org/10.1002/spe.4380211204
Article Google Scholar
Hwu WM, Mahlke SA, Chen WY et al (1993) The superblock: an effective technique for VLIW and superscalar compilation. Springer, Berlin
Google Scholar
Xu JC, Huang Y, Guo S et al (2015) Testing platform for floating mathematical function libraries. J Softw 26(6):1306
MathSciNet Google Scholar
Freudenberger SM, Gross TR, Lowney PG (1994) Avoidance and suppression of compensation code in a trace scheduling compiler. ACM Trans Prog Lang Sys. https://doi.org/10.1145/183432.183446
Article Google Scholar
Lin J, Xu Z, Cai L et al (2018) Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Comput. https://doi.org/10.1016/j.parco.2018.06.001
Article MathSciNet Google Scholar
Chase M, Malik AM, Russel T et al (2012) A computational study of heuristic and exact techniques for superblock instruction scheduling. J Sched. https://doi.org/10.1007/s10951-012-0276-y
Article Google Scholar
Lowney P (1993) The multiflow trace scheduling compiler. J Supercomp. https://doi.org/10.1007/BF01205182
Article Google Scholar
Hennessy J, Gross T (1983) Postpass code optimization of pipeline constraints. ACM Trans Program Lang Syst. https://doi.org/10.1145/2166.357217
Article MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments that helped improve the final paper.

Funding

This work was supported by the National Natural Science Foundation of China—Precision analysis and optimization of basic mathematical functions on domestic processors (No.61802434).

Author information

Authors and Affiliations

State Key Laboratory of Mathematical Engineering and Advanced Computing, No. 62 Science Avenue, High‑Tech Zone, Zhengzhou, 450001, Henan, China
Hao Cao, Shaozhong Guo, Jiangwei Hao, Yuanyuan Xia & Jinchen Xu

Authors

Hao Cao
View author publications
You can also search for this author inPubMed Google Scholar
Shaozhong Guo
View author publications
You can also search for this author inPubMed Google Scholar
Jiangwei Hao
View author publications
You can also search for this author inPubMed Google Scholar
Yuanyuan Xia
View author publications
You can also search for this author inPubMed Google Scholar
Jinchen Xu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jinchen Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, H., Guo, S., Hao, J. et al. Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor. J Supercomput 78, 4827–4849 (2022). https://doi.org/10.1007/s11227-021-03997-w

Download citation

Accepted: 12 July 2021
Published: 10 September 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11227-021-03997-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Performance Evaluation of NPB and SPEC CPU2006 on Various SIMD Extensions

Tuning a general purpose software cache library for TaihuLight’s SW26010 processor

A Cost Model for Heterogeneous Many-Core Processor

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now