Abstract
Communication or data movement cost is significantly higher than computation cost in existing large-scale clusters, for clusters having long network latency. For high-frequency parallel iterative applications, performance bottleneck is the long network latency caused by frequent data exchange. This paper presents an asynchronous algorithm capable of reducing the number of data exchanges among processes of parallel iterative applications. The proposed algorithm has been tested on a stencil-based parallel computation and compared with a BSP implementation of the same application. The asynchronous algorithm can effectively reduce the number of data exchanges at the expense of higher computation overhead and larger message size, performance can be improved up to 2.8x.
Supported by National Key R&D Program of China (2017YFB0202001), and National Natural Science Foundation of China (61432018, 61672208).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, Y., Huang, K., Wang, B., Li, G., Cui, X.: Samsara parallel: a non-BSP parallel-in-time model. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Barcelona (2016)
Ao, Y., et al.: 26 PFLOPS stencil computations for atmospheric modeling on sunway TaihuLight. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017). IEEE (2017)
Shield, C.K., French, C.W., Timm, J.: Development and implementation of the effective force testing method for seismic simulation of large-scale structures. Philos. Trans. Roy. Soc. London A: Math. Phys. Eng. Sci. 359(1786), 1911–1929 (2001)
Dennis, J.M., Edwards, J., Evans, K.J., et al.: CAM-SE: a scalable spectral element dynamical core for the community atmosphere model. Int. J. High Perform. Comput. Appl. 26(1), 74–89 (2012)
Dou, H.-S., Tsai, H.M., Khoo, B.C., Qiu, J.: Simulations of detonation wave propagation in rectangular ducts using a three-dimensional WENO scheme. Combust. Flame 154(4), 644–659 (2008)
Baffico, L., Bernard, S., Maday, Y., Turinici, G., Zerah, G.: Parallel-in-time molecular-dynamics simulations. Phys. Rev. E 66, 5 (2002)
Bahi, J.M., Contassot-Vivier, S., Couturier, R.: Evaluation of the asynchronous iterative algorithms in the context of distant heterogeneous clusters. Parallel Comput. 31(5), 439–461 (2005)
Blathras, K., Szyld, D.B., Shi, Y.: Timing models and local stopping criteria for asynchronous iterative algorithms. J. Parallel Distrib. Comput. 58(3), 446–465 (1999)
Lions, J.-L., Manday, Y., Turinici, G.: Resolution EDP par un schema en temps parareal. C. R. Acad. Sci. Numer. Anal. 332(7), 661–668 (2001)
Yu, Y.: Parallel implementation and performance optimization for refactoring GROMACS on the sunway many-core architecture. University of Science and Technology of China (2018)
Valiant, L.G.: A bridging model for parallel computation. SIAM J. Sci. Stat. Comput. 33, 103–111 (1990)
The Riken Himeno CFD Benchmark. http://accc.riken.jp/HPC/HimenoBMT/index e.html
Phillips, E.H., Fatica, M.: Implementing the Himeno benchmark with CUDA on GPU clusters. In: IEEE International Symposium on Parallel and Distributed Processing IEEE (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Tian, Z., Chen, Y., Zhang, L. (2020). An Asynchronous Algorithm to Reduce the Number of Data Exchanges. In: Wen, S., Zomaya, A., Yang, L.T. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2019. Lecture Notes in Computer Science(), vol 11945. Springer, Cham. https://doi.org/10.1007/978-3-030-38961-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-38961-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38960-4
Online ISBN: 978-3-030-38961-1
eBook Packages: Computer ScienceComputer Science (R0)