Abstract
The SW26010 many-core processor is based on the Sunway architecture that is composed of management and computing processing elements (MPE and CPE, respectively), each of which is equipped with a stand-alone math library. The issue is that each Sunway Math Library (SML) version is written in assembly which is outside the power of compilers that take high-level languages as input; existing optimization approaches thus mainly rely on manual strategies, which are considered inefficient. In this paper, we leverage the concept of superblock scheduling, a well-known compilation technique, and present a tool named SMPOT to optimize the SML. SMPOT first builds a superblock using a novel tail duplication algorithm, and then uses code motion restrictions to avoid code compensation, followed by matching the machine model. Finally, it reorders instructions on the main path by an activation algorithm based on available computing resources. The experimental results show that SMPOT can effectively improve the performance of the SML. The main path performance of MPE functions is improved by 10.61% on average and overall performance by 5.40%. The main path performance of CPE functions is improved by 5.72% on average and overall performance by 2.98%.















Similar content being viewed by others
References
Lokuciejewski P, Kelter T, Marwedel P (2010) Superblock-Based Source Code Optimizations for WCET Reduction. 2010 10th IEEE International Conference on Computer and Information Technology, 2010, pp. 1918–1925. https://doi.org/10.1109/CIT.2010.327
Shobaki G, Wilken K (2005) Optimal Superblock Scheduling Using Enumeration. Proc Ann Intl Symp Microarch. https://doi.org/10.1109/MICRO.2004.27
Heffernan M, Wilken K, Shobaki G (2006) Data-Dependency Graph Transformations for Superblock Scheduling. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). https://doi.org/10.1109/MICRO.2006.16
Ye JM, Chen T (2012) Exploring Potential Parallelism of Sequential Programs with Superblock Reordering. In: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC '12). https://doi.org/10.1109/HPCC.2012.12
Fu H, Liao J, Yang J et al (2016) The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci 59:1–16. https://doi.org/10.1007/s11432-016-5588-7
Dongarra J (2016) Report on the Sunway TaihuLight System. Ut Eecs Technical Reports 1–24. http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
Lin J, Xu Z, Nukada A, Maruyama N, Matsuoka S (2017) Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-Core Processor. 46th Intl Conf Parallel Proc. https://doi.org/10.1109/ICPP.2017.52
NSCCWX (2021) Sunway TaihuLight Compiler User Guide. http://www.nsccwx.cn. Accessed 10 March 2021
Xu Z, Lin J, Matsuoka S (2017) Benchmarking SW26010 many-core processor. IEEE Intl Parallel Disrt Proc Symp Wkshp (IPDPSW). https://doi.org/10.1109/IPDPSW.2017.9
Zhang J. Zhou C, Wang Y et al (2016) Extreme-scale phase field simulations of coarsening dynamics on the sunway TaihuLight supercomputer. SC '16: Proc Intl Conf High Perf Comput Netw Stor Anal. https://doi.org/10.1109/SC.2016.3
Touzeau RF (1984) A fortran compiler for the FPS-164 scientific computer. Sigplan Notices - SIGPLAN 19:48–57. https://doi.org/10.1145/502949.502879
Gibbons PB, Muchnick SS (1986) Efficient instruction scheduling for a pipelined architecture. ACM SIGPLAN Notices doi 10(1145/12276):13312
Warren HS (1990) Instruction scheduling for the IBM RISC System/6000. IBM J Res Dev 34(1):85–92. https://doi.org/10.1147/rd.341.0085
Chang PP, Warter NJ, Mahlke S et al (1992) Three Superblock Scheduling Models for Superscalar and Superpipelined Processors. Coord Sci Lab Rep no. UILU-ENG-91–2250, CRHC-91–29. http://hdl.handle.net/2142/74519
Fisher JA (1979) Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. PhD thesis, Department of Computer Science, New York University. https://doi.org/10.2172/5752434
Fisher JA (1981) Trace scheduling: a technique for global microcode compaction. IEEE Transact Comput. https://doi.org/10.1109/TC.1981.1675827
Ellis JR (1986) Bulldog: a compiler for VLIW architectures. MIT Press, Cambridge, Massachusetts
Chang PP, Mahlke SA, Hwu WM (1991) Using profile information to assist classic compiler code optimizations. Softw Pract Exper. https://doi.org/10.1002/spe.4380211204
Hwu WM, Mahlke SA, Chen WY et al (1993) The superblock: an effective technique for VLIW and superscalar compilation. Springer, Berlin
Xu JC, Huang Y, Guo S et al (2015) Testing platform for floating mathematical function libraries. J Softw 26(6):1306
Freudenberger SM, Gross TR, Lowney PG (1994) Avoidance and suppression of compensation code in a trace scheduling compiler. ACM Trans Prog Lang Sys. https://doi.org/10.1145/183432.183446
Lin J, Xu Z, Cai L et al (2018) Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Comput. https://doi.org/10.1016/j.parco.2018.06.001
Chase M, Malik AM, Russel T et al (2012) A computational study of heuristic and exact techniques for superblock instruction scheduling. J Sched. https://doi.org/10.1007/s10951-012-0276-y
Lowney P (1993) The multiflow trace scheduling compiler. J Supercomp. https://doi.org/10.1007/BF01205182
Hennessy J, Gross T (1983) Postpass code optimization of pipeline constraints. ACM Trans Program Lang Syst. https://doi.org/10.1145/2166.357217
Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive comments that helped improve the final paper.
Funding
This work was supported by the National Natural Science Foundation of China—Precision analysis and optimization of basic mathematical functions on domestic processors (No.61802434).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cao, H., Guo, S., Hao, J. et al. Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor. J Supercomput 78, 4827–4849 (2022). https://doi.org/10.1007/s11227-021-03997-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03997-w