Memory latency optimizations for the elementary functions on the Sunway architecture

Zhou, Bei; Huang, Yongzhong; Xu, Jinchen; Guo, Shaozhong; Qi, Hongyuan

doi:10.1007/s11227-018-02741-1

Memory latency optimizations for the elementary functions on the Sunway architecture

Published: 22 January 2019

Volume 75, pages 3917–3944, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Bei Zhou ORCID: orcid.org/0000-0003-1515-0602¹,
Yongzhong Huang²,
Jinchen Xu¹,
Shaozhong Guo¹ &
…
Hongyuan Qi¹

214 Accesses
Explore all metrics

Abstract

As fundamental software of high-performance computers, elementary functions have a significant impact on the performance of the high-level applications. Benefiting from the Chinese-designed manycore system consisting of processing cores and auxiliary cores, the Sunway TaihuLight supercomputer is considered as one of the fastest supercomputers in the world, having ranked on the top of the TOP500 supercomputer list several times. The processing cores of the Sunway architecture are coupled using a shared memory strategy, leading to high latency of memory accesses and performance degradation of the elementary functions where a variety of memory accesses exist. To address this issue, we propose a set of optimizations for memory latency of the Sunway processing cores. Firstly, we obtain a reduced data table in the context of guaranteed accuracy by optimizing underlying algorithms, grouping and mapping, removing error compensations, etc. Secondly, we perform data movement from the global memory shared by all processing cores to the scratchpad memory of individual processing cores, significantly reducing the memory latency. Finally, we convert the memory accesses that cannot be localized due to the limited space of the scratchpad memory into equivalent immediate loads and/or shift operators, further improving the performance. In addition, we automate the algorithm by carefully selecting the most suitable data conversion approach and table-lookup algorithm, mitigating the code explosion issue effectively. We implement our method and evaluate the effectiveness of the optimizations by conducting experiments on the Sunway architecture. The experimental results show that exponential functions can achieve performance improvements by 91 and 86.2% from the data movement and data conversion strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

LC-MEMENTO: A Memory Model for Accelerated Architectures

In-Cache Streaming: Morphable Infrastructure for Many-Core Processing Systems

References

Muller J-M (1999) A few results on table-based methods. Reliab Comput 5(3):279–288
Article MathSciNet MATH Google Scholar
Muller J-M (2006) Elementary functions: algorithms and implementation, 2nd edn. Birkhauser, Basel
MATH Google Scholar
Burden Richard L, Douglas Faires J (2010) Numerical analysis, 9th edn. BROOKS/COLE CENGAGE Learning, Boston
MATH Google Scholar
Gal S, Bachelis B (1991) An accurate elementary mathematical library for the IEEE floating point standard. ACM Trans Math Softw 17(1):26–45
Article MathSciNet MATH Google Scholar
Tang PTP (1991) Table-lookup algorithms for elementary functions and their error analysis. In: Kornerup P, Matula DW (eds) Proceedings of the 10th IEEE Symposium on Computer Arithmetic, IEEE Computer Society Press, Los Alamitos, CA, pp 232–236
Tang PTP (1999) Table-driven implementation of the logarithm function in IEEE Floating-point arithmetic. ACM Trans Math Softw 16(4):378–400
Article MATH Google Scholar
Tang PTP (1990) Accurate and efficient testing of the exponential and logarithm functions. ACM Trans Math Softw 16(3):185–200
Article MathSciNet MATH Google Scholar
Ma RL (2012) Design and optimization of key load store technology in high performance processor. Shanghai Jiao Tong University, Shanghai
Google Scholar
Wang HY (2012) The optimization of memory controller for high performance CPU. National University of Defense Technology, Changsha
Google Scholar
Zhou H, Conte TM (2003) Performance modeling of memory latency hiding techniques. Technical report. ECE Department, State University, NC
Google Scholar
Mowry T (2009) Tolerating latency through software controlled data prefetching. In: PhD Thesis. Stanford University, Stanford
Gornish E, Granston E, Veidenbaum A (2009) Compiler-directed data prefetching in multiprocessors with memory hierarchies. In: International Conference on Supercomputing
Liu W, Ma S, Huang L, Wang Z (2017) The design of NoC-side memory access scheduling for energy-efficient GPGPUs. Int J Parallel Program 46:1–14
Google Scholar
Rau BR, Fisher JA (1993) Instruction-level parallel processing: history, overview, and perspective. J Supercomput 7(12):9–50
Article Google Scholar
Naderan-Tahan M, Sarbazi-Azad H (2014) Adaptive prefetching using global history buffer in multicore processors. J Supercomput 68(3):1302–1320
Article Google Scholar
Torrents M, Martnez R, Molina C (2016) Facing prefetching challenges in distributed shared memories for CMPs. J Supercomput 72(4):1453–1476
Article Google Scholar
Anand CK (2010) Unified tables for exponential and logarithm families. ACM Trans Math Softw 37(3):28
Article MathSciNet MATH Google Scholar
Carlson DA (1991) Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer. J Supercomput 4(4):345–356
Article Google Scholar
Xu JC (2014) Access optimization technique for mathematical library of slave processors on heterogeneous many-core architectures. Comput Sci 41(6):12–17
Google Scholar
Filipovi J, Madzin M, Fouse J, Matyska K (2015) Optimizing CUDA code by kernel fusion: application on BLAS. J Supercomput 71(10):3934–3957
Article Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Mathematical Engineering and Advanced Computing, No. 62, Science Avenue, High-Tech Zone, Zhengzhou, 450001, China
Bei Zhou, Jinchen Xu, Shaozhong Guo & Hongyuan Qi
Guilin University of Electronic Technology, Guilin, 541004, China
Yongzhong Huang

Authors

Bei Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Yongzhong Huang
View author publications
You can also search for this author inPubMed Google Scholar
Jinchen Xu
View author publications
You can also search for this author inPubMed Google Scholar
Shaozhong Guo
View author publications
You can also search for this author inPubMed Google Scholar
Hongyuan Qi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yongzhong Huang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, B., Huang, Y., Xu, J. et al. Memory latency optimizations for the elementary functions on the Sunway architecture. J Supercomput 75, 3917–3944 (2019). https://doi.org/10.1007/s11227-018-02741-1

Download citation

Published: 22 January 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s11227-018-02741-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory latency optimizations for the elementary functions on the Sunway architecture

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

LC-MEMENTO: A Memory Model for Accelerated Architectures

In-Cache Streaming: Morphable Infrastructure for Many-Core Processing Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now