Abstract
As fundamental software of high-performance computers, elementary functions have a significant impact on the performance of the high-level applications. Benefiting from the Chinese-designed manycore system consisting of processing cores and auxiliary cores, the Sunway TaihuLight supercomputer is considered as one of the fastest supercomputers in the world, having ranked on the top of the TOP500 supercomputer list several times. The processing cores of the Sunway architecture are coupled using a shared memory strategy, leading to high latency of memory accesses and performance degradation of the elementary functions where a variety of memory accesses exist. To address this issue, we propose a set of optimizations for memory latency of the Sunway processing cores. Firstly, we obtain a reduced data table in the context of guaranteed accuracy by optimizing underlying algorithms, grouping and mapping, removing error compensations, etc. Secondly, we perform data movement from the global memory shared by all processing cores to the scratchpad memory of individual processing cores, significantly reducing the memory latency. Finally, we convert the memory accesses that cannot be localized due to the limited space of the scratchpad memory into equivalent immediate loads and/or shift operators, further improving the performance. In addition, we automate the algorithm by carefully selecting the most suitable data conversion approach and table-lookup algorithm, mitigating the code explosion issue effectively. We implement our method and evaluate the effectiveness of the optimizations by conducting experiments on the Sunway architecture. The experimental results show that exponential functions can achieve performance improvements by 91 and 86.2% from the data movement and data conversion strategies.



















Similar content being viewed by others
References
Muller J-M (1999) A few results on table-based methods. Reliab Comput 5(3):279–288
Muller J-M (2006) Elementary functions: algorithms and implementation, 2nd edn. Birkhauser, Basel
Burden Richard L, Douglas Faires J (2010) Numerical analysis, 9th edn. BROOKS/COLE CENGAGE Learning, Boston
Gal S, Bachelis B (1991) An accurate elementary mathematical library for the IEEE floating point standard. ACM Trans Math Softw 17(1):26–45
Tang PTP (1991) Table-lookup algorithms for elementary functions and their error analysis. In: Kornerup P, Matula DW (eds) Proceedings of the 10th IEEE Symposium on Computer Arithmetic, IEEE Computer Society Press, Los Alamitos, CA, pp 232–236
Tang PTP (1999) Table-driven implementation of the logarithm function in IEEE Floating-point arithmetic. ACM Trans Math Softw 16(4):378–400
Tang PTP (1990) Accurate and efficient testing of the exponential and logarithm functions. ACM Trans Math Softw 16(3):185–200
Ma RL (2012) Design and optimization of key load store technology in high performance processor. Shanghai Jiao Tong University, Shanghai
Wang HY (2012) The optimization of memory controller for high performance CPU. National University of Defense Technology, Changsha
Zhou H, Conte TM (2003) Performance modeling of memory latency hiding techniques. Technical report. ECE Department, State University, NC
Mowry T (2009) Tolerating latency through software controlled data prefetching. In: PhD Thesis. Stanford University, Stanford
Gornish E, Granston E, Veidenbaum A (2009) Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: International Conference on Supercomputing
Liu W, Ma S, Huang L, Wang Z (2017) The design of NoC-side memory access scheduling for energy-efficient GPGPUs. Int J Parallel Program 46:1–14
Rau BR, Fisher JA (1993) Instruction-level parallel processing: history, overview, and perspective. J Supercomput 7(12):9–50
Naderan-Tahan M, Sarbazi-Azad H (2014) Adaptive prefetching using global history buffer in multicore processors. J Supercomput 68(3):1302–1320
Torrents M, Martnez R, Molina C (2016) Facing prefetching challenges in distributed shared memories for CMPs. J Supercomput 72(4):1453–1476
Anand CK (2010) Unified tables for exponential and logarithm families. ACM Trans Math Softw 37(3):28
Carlson DA (1991) Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer. J Supercomput 4(4):345–356
Xu JC (2014) Access optimization technique for mathematical library of slave processors on heterogeneous many-core architectures. Comput Sci 41(6):12–17
Filipovi J, Madzin M, Fouse J, Matyska K (2015) Optimizing CUDA code by kernel fusion: application on BLAS. J Supercomput 71(10):3934–3957
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhou, B., Huang, Y., Xu, J. et al. Memory latency optimizations for the elementary functions on the Sunway architecture. J Supercomput 75, 3917–3944 (2019). https://doi.org/10.1007/s11227-018-02741-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-02741-1