Skip to main content
Log in

Memory latency optimizations for the elementary functions on the Sunway architecture

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As fundamental software of high-performance computers, elementary functions have a significant impact on the performance of the high-level applications. Benefiting from the Chinese-designed manycore system consisting of processing cores and auxiliary cores, the Sunway TaihuLight supercomputer is considered as one of the fastest supercomputers in the world, having ranked on the top of the TOP500 supercomputer list several times. The processing cores of the Sunway architecture are coupled using a shared memory strategy, leading to high latency of memory accesses and performance degradation of the elementary functions where a variety of memory accesses exist. To address this issue, we propose a set of optimizations for memory latency of the Sunway processing cores. Firstly, we obtain a reduced data table in the context of guaranteed accuracy by optimizing underlying algorithms, grouping and mapping, removing error compensations, etc. Secondly, we perform data movement from the global memory shared by all processing cores to the scratchpad memory of individual processing cores, significantly reducing the memory latency. Finally, we convert the memory accesses that cannot be localized due to the limited space of the scratchpad memory into equivalent immediate loads and/or shift operators, further improving the performance. In addition, we automate the algorithm by carefully selecting the most suitable data conversion approach and table-lookup algorithm, mitigating the code explosion issue effectively. We implement our method and evaluate the effectiveness of the optimizations by conducting experiments on the Sunway architecture. The experimental results show that exponential functions can achieve performance improvements by 91 and 86.2% from the data movement and data conversion strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  1. Muller J-M (1999) A few results on table-based methods. Reliab Comput 5(3):279–288

    Article  MathSciNet  MATH  Google Scholar 

  2. Muller J-M (2006) Elementary functions: algorithms and implementation, 2nd edn. Birkhauser, Basel

    MATH  Google Scholar 

  3. Burden Richard L, Douglas Faires J (2010) Numerical analysis, 9th edn. BROOKS/COLE CENGAGE Learning, Boston

    MATH  Google Scholar 

  4. Gal S, Bachelis B (1991) An accurate elementary mathematical library for the IEEE floating point standard. ACM Trans Math Softw 17(1):26–45

    Article  MathSciNet  MATH  Google Scholar 

  5. Tang PTP (1991) Table-lookup algorithms for elementary functions and their error analysis. In: Kornerup P, Matula DW (eds) Proceedings of the 10th IEEE Symposium on Computer Arithmetic, IEEE Computer Society Press, Los Alamitos, CA, pp 232–236

  6. Tang PTP (1999) Table-driven implementation of the logarithm function in IEEE Floating-point arithmetic. ACM Trans Math Softw 16(4):378–400

    Article  MATH  Google Scholar 

  7. Tang PTP (1990) Accurate and efficient testing of the exponential and logarithm functions. ACM Trans Math Softw 16(3):185–200

    Article  MathSciNet  MATH  Google Scholar 

  8. Ma RL (2012) Design and optimization of key load store technology in high performance processor. Shanghai Jiao Tong University, Shanghai

    Google Scholar 

  9. Wang HY (2012) The optimization of memory controller for high performance CPU. National University of Defense Technology, Changsha

    Google Scholar 

  10. Zhou H, Conte TM (2003) Performance modeling of memory latency hiding techniques. Technical report. ECE Department, State University, NC

    Google Scholar 

  11. Mowry T (2009) Tolerating latency through software controlled data prefetching. In: PhD Thesis. Stanford University, Stanford

  12. Gornish E, Granston E, Veidenbaum A (2009) Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: International Conference on Supercomputing

  13. Liu W, Ma S, Huang L, Wang Z (2017) The design of NoC-side memory access scheduling for energy-efficient GPGPUs. Int J Parallel Program 46:1–14

    Google Scholar 

  14. Rau BR, Fisher JA (1993) Instruction-level parallel processing: history, overview, and perspective. J Supercomput 7(12):9–50

    Article  Google Scholar 

  15. Naderan-Tahan M, Sarbazi-Azad H (2014) Adaptive prefetching using global history buffer in multicore processors. J Supercomput 68(3):1302–1320

    Article  Google Scholar 

  16. Torrents M, Martnez R, Molina C (2016) Facing prefetching challenges in distributed shared memories for CMPs. J Supercomput 72(4):1453–1476

    Article  Google Scholar 

  17. Anand CK (2010) Unified tables for exponential and logarithm families. ACM Trans Math Softw 37(3):28

    Article  MathSciNet  MATH  Google Scholar 

  18. Carlson DA (1991) Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer. J Supercomput 4(4):345–356

    Article  Google Scholar 

  19. Xu JC (2014) Access optimization technique for mathematical library of slave processors on heterogeneous many-core architectures. Comput Sci 41(6):12–17

    Google Scholar 

  20. Filipovi J, Madzin M, Fouse J, Matyska K (2015) Optimizing CUDA code by kernel fusion: application on BLAS. J Supercomput 71(10):3934–3957

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongzhong Huang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, B., Huang, Y., Xu, J. et al. Memory latency optimizations for the elementary functions on the Sunway architecture. J Supercomput 75, 3917–3944 (2019). https://doi.org/10.1007/s11227-018-02741-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-02741-1

Keywords

Navigation