Skip to main content
Log in

Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The SW26010 many-core processor is based on the Sunway architecture that is composed of management and computing processing elements (MPE and CPE, respectively), each of which is equipped with a stand-alone math library. The issue is that each Sunway Math Library (SML) version is written in assembly which is outside the power of compilers that take high-level languages as input; existing optimization approaches thus mainly rely on manual strategies, which are considered inefficient. In this paper, we leverage the concept of superblock scheduling, a well-known compilation technique, and present a tool named SMPOT to optimize the SML. SMPOT first builds a superblock using a novel tail duplication algorithm, and then uses code motion restrictions to avoid code compensation, followed by matching the machine model. Finally, it reorders instructions on the main path by an activation algorithm based on available computing resources. The experimental results show that SMPOT can effectively improve the performance of the SML. The main path performance of MPE functions is improved by 10.61% on average and overall performance by 5.40%. The main path performance of CPE functions is improved by 5.72% on average and overall performance by 2.98%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. https://github.com/mathlib-cn/SMPOT.

  2. https://pypi.org/project/pydotplus.

  3. https://github.com/mathlib-cn/Floating-Numbers-Generator.

References

  1. Lokuciejewski P, Kelter T, Marwedel P (2010) Superblock-Based Source Code Optimizations for WCET Reduction. 2010 10th IEEE International Conference on Computer and Information Technology, 2010, pp. 1918–1925. https://doi.org/10.1109/CIT.2010.327

  2. Shobaki G, Wilken K (2005) Optimal Superblock Scheduling Using Enumeration. Proc Ann Intl Symp Microarch. https://doi.org/10.1109/MICRO.2004.27

    Article  Google Scholar 

  3. Heffernan M, Wilken K, Shobaki G (2006) Data-Dependency Graph Transformations for Superblock Scheduling. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). https://doi.org/10.1109/MICRO.2006.16

  4. Ye JM, Chen T (2012) Exploring Potential Parallelism of Sequential Programs with Superblock Reordering. In: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC '12). https://doi.org/10.1109/HPCC.2012.12

  5. Fu H, Liao J, Yang J et al (2016) The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci 59:1–16. https://doi.org/10.1007/s11432-016-5588-7

    Article  Google Scholar 

  6. Dongarra J (2016) Report on the Sunway TaihuLight System. Ut Eecs Technical Reports 1–24. http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

  7. Lin J, Xu Z, Nukada A, Maruyama N, Matsuoka S (2017) Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-Core Processor. 46th Intl Conf Parallel Proc. https://doi.org/10.1109/ICPP.2017.52

  8. NSCCWX (2021) Sunway TaihuLight Compiler User Guide. http://www.nsccwx.cn. Accessed 10 March 2021

  9. Xu Z, Lin J, Matsuoka S (2017) Benchmarking SW26010 many-core processor. IEEE Intl Parallel Disrt Proc Symp Wkshp (IPDPSW). https://doi.org/10.1109/IPDPSW.2017.9

    Article  Google Scholar 

  10. Zhang J. Zhou C, Wang Y et al (2016) Extreme-scale phase field simulations of coarsening dynamics on the sunway TaihuLight supercomputer. SC '16: Proc Intl Conf High Perf Comput Netw Stor Anal. https://doi.org/10.1109/SC.2016.3

  11. Touzeau RF (1984) A fortran compiler for the FPS-164 scientific computer. Sigplan Notices - SIGPLAN 19:48–57. https://doi.org/10.1145/502949.502879

    Article  Google Scholar 

  12. Gibbons PB, Muchnick SS (1986) Efficient instruction scheduling for a pipelined architecture. ACM SIGPLAN Notices doi 10(1145/12276):13312

    Google Scholar 

  13. Warren HS (1990) Instruction scheduling for the IBM RISC System/6000. IBM J Res Dev 34(1):85–92. https://doi.org/10.1147/rd.341.0085

    Article  Google Scholar 

  14. Chang PP, Warter NJ, Mahlke S et al (1992) Three Superblock Scheduling Models for Superscalar and Superpipelined Processors. Coord Sci Lab Rep no. UILU-ENG-91–2250, CRHC-91–29. http://hdl.handle.net/2142/74519

  15. Fisher JA (1979) Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. PhD thesis, Department of Computer Science, New York University. https://doi.org/10.2172/5752434

  16. Fisher JA (1981) Trace scheduling: a technique for global microcode compaction. IEEE Transact Comput. https://doi.org/10.1109/TC.1981.1675827

    Article  Google Scholar 

  17. Ellis JR (1986) Bulldog: a compiler for VLIW architectures. MIT Press, Cambridge, Massachusetts

    Google Scholar 

  18. Chang PP, Mahlke SA, Hwu WM (1991) Using profile information to assist classic compiler code optimizations. Softw Pract Exper. https://doi.org/10.1002/spe.4380211204

    Article  Google Scholar 

  19. Hwu WM, Mahlke SA, Chen WY et al (1993) The superblock: an effective technique for VLIW and superscalar compilation. Springer, Berlin

    Google Scholar 

  20. Xu JC, Huang Y, Guo S et al (2015) Testing platform for floating mathematical function libraries. J Softw 26(6):1306

    MathSciNet  Google Scholar 

  21. Freudenberger SM, Gross TR, Lowney PG (1994) Avoidance and suppression of compensation code in a trace scheduling compiler. ACM Trans Prog Lang Sys. https://doi.org/10.1145/183432.183446

    Article  Google Scholar 

  22. Lin J, Xu Z, Cai L et al (2018) Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Comput. https://doi.org/10.1016/j.parco.2018.06.001

    Article  MathSciNet  Google Scholar 

  23. Chase M, Malik AM, Russel T et al (2012) A computational study of heuristic and exact techniques for superblock instruction scheduling. J Sched. https://doi.org/10.1007/s10951-012-0276-y

    Article  Google Scholar 

  24. Lowney P (1993) The multiflow trace scheduling compiler. J Supercomp. https://doi.org/10.1007/BF01205182

    Article  Google Scholar 

  25. Hennessy J, Gross T (1983) Postpass code optimization of pipeline constraints. ACM Trans Program Lang Syst. https://doi.org/10.1145/2166.357217

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments that helped improve the final paper.

Funding

This work was supported by the National Natural Science Foundation of China—Precision analysis and optimization of basic mathematical functions on domestic processors (No.61802434).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinchen Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, H., Guo, S., Hao, J. et al. Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor. J Supercomput 78, 4827–4849 (2022). https://doi.org/10.1007/s11227-021-03997-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03997-w

Keywords

Navigation