Trade-offs between computation, communication, and synchronization in stencil-collective alternate update

Xiao, Junmin; Peng, Jian

doi:10.1007/s42514-019-00011-x

Trade-offs between computation, communication, and synchronization in stencil-collective alternate update

Regular Paper
Published: 26 July 2019

Volume 1, pages 144–160, (2019)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

626 Accesses
2 Citations
Explore all metrics

Abstract

In a computing platform composed of several homogeneous processors, any parallel schedule of an algorithm usually involves three basic costs: arithmetic throughput on each processor, data movement between processors, and synchronization latency for several processors. The trade-offs between these three costs could realistically reflect lower bounds on the execution time for an algorithm. Therefore, the trade-off analysis is important for evaluating the optimality of a proposed schedule, and often yields new insights in parallel optimization. In this paper, we focus on the trade-offs between computation, communication, and synchronization in the stencil-collective alternate update, which is often executed repeatedly by the complex workflow with multiple stages in most numerical methods, such as the conjugate gradient (CG) method, the nonlinear time integration method in the dynamical core of a global atmospheric general circulation model (AGCM), and so on. Firstly, in order to formalize a workflow with multiple different stages, a novel operator representation of parallel algorithms is proposed. Based on the operator representation, we find the minimum vertex separator of the dependency graph for a stencil-collective alternate update. This breakthrough brings us the opportunity to obtain the cost lower bounds. Next, the general trade-off theory of the stencil-collective alternate update is founded successfully, which extends the recent trade-off theory to a more general theoretical context. Finally, by applying the general theoretical result to several algorithms, namely CG method and the nonlinear time integration method in AGCM, we obtain the corresponding lower bounds of computational cost, communication throughput, and synchronization latency. It should be noted that the general theory can also be widely used to analyze other complex numerical methods in real-world applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Generic Strategy for Multi-stage Stencils

Performance Modeling of Stencil Computation on SW26010 Processors

Generalized Parallel Computational Schemes for Time-Consuming Global Optimization

Article 25 May 2018

References

Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of prams. Theor. Comput. Sci. 71(1), 3–28 (1990)
Article MathSciNet MATH Google Scholar
Arakawa, A., Lamb, V.R.: Computational design of the basic dynamical processes of the ucla general circulation model. Methods Comput. Phys. Adv. Res. Appl. 17, 173–265 (1977)
Article Google Scholar
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In: Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, (New York, NY, USA), ACM, pp. 77–79 (2012)
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in linear algebra. SIAM J. Matrix Anal. Appl. 32(3), 866–901 (2011)
Article MathSciNet MATH Google Scholar
Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Num. 23, 1–155 (2014)
Article MathSciNet MATH Google Scholar
Bampis, E., Delorme, C., König, J.-C.: Optimal schedules for d-d grid graphs with communication delays. In: Proceedings of the 13th Annual Symposium on Theoretical Aspects of Computer Science, STACS ’96 (London, UK, UK). Springer, pp. 655–666 (1996)
Bilardi, G., Scquizzato, M., Silvestri, F.: A lower bound technique for communication on bsp with application to the fft. In: International Conference on Parallel Processing, pp. 676–687 (2012c)
Bilardi, G., Preparata, F.P.: Processor-time tradeoffs under bounded-speed message propagation: part ii. Low. Bounds 32, 531–559 (1999)
MATH Google Scholar
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’93 (New York, NY, USA). ACM, pp. 1–12 (1993)
Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2008)
Dennis, J.M., Edwards, J., Evans, K.J., Guba, O., Lauritzen, P.H., Mirin, A.A., Stcyr, A., Taylor, M.A., Worley, P.H.: Cam-se: a scalable spectral element dynamical core for the community atmosphere model. Int. J. High Perform. Comput. Appl. 26(1), 74–89 (2012)
Article Google Scholar
Fu, H., Liao, J., Xue, W., Wang, L., Chen, D., Gu, L., Xu, J., Ding, N., Wang, X., He, C., Xu, S., Liang, Y., Fang, J., Xu, Y., Zheng, W., Xu, J., Zheng, Z., Wei, W., Ji, X., Zhang, H., Chen, B., Li, K., Huang, X., Chen, W., Yang, G.: Refactoring and optimizing the community atmosphere model (cam) on the sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’ 16 (Piscataway, NJ, USA). IEEE, pp. 83:1–83:12 (2016)
Gysi, T., Grosser, T., Hoefler, T.: Modesto: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15 (New York, NY, USA). ACM, pp. 177–186 (2015)
Hamilton, K., Ohfuchi, W.: High Resolution Numerical Modelling of the Atmosphere and Ocean. Springer, Berlin (2008)
Book Google Scholar
Hong, J. W., Kung, H. T.: I/O complexity: The red-blue pebble game, In: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, STOC ’81 (New York, NY, USA), ACM, pp. 326–333 (1981)
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
Article MATH Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, vol. 1. Springer, Berlin (2006)
MATH Google Scholar
Papadimitriou, C.H., Ullman, J.D.: A communication-time tradeoff. SIAM J. Comput. 16(4), 639–646 (1987)
Article MathSciNet MATH Google Scholar
Phillips, N.A.: A coordinate system having some special advantages for numerical forecasting. J. Meteorol. 14(2), 184–185 (1957)
Article Google Scholar
Putman, W.M.: Development of the finite-volume dynamical core on the cubed-sphere. PhD thesis, The Florida State University (2007)
Rajbhandari, S., Rastello, F., Kowalski, K., Krishnamoorthy, S., Sadayappan, P.: Optimizing the four-index integral transform using data movement lower bounds analysis. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’ 17, (New York, NY, USA), pp. 327–340, ACM (2017)
Scquizzato, M., Silvestri, F.: Communication lower bounds for distributed-memory computations. In: Symposium on theoretical aspects of computer science, vol. 25 (2013)
Shimokawabe, T., Aoki, T., Takaki, T., Endo, T., Yamanaka, A., Maruyama, N., Nukada, A., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the tsubame 2.0 supercomputer. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC’ 11, (New York, NY, USA). ACM, pp. 3:1–3:11 (2011)
Solomonik, E., Carson, E., Knight, N., Demmel, J.: Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations. In: Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14 (New York, NY, USA). ACM, pp. 307–318 (2014)
Solomonik, E., Carson, E., Knight, N., Demmel, J.: Trade-offs between synchronization, communication, and computation in parallel linear algebra computations. ACM Trans. Parallel Comput. 3, 3:1–3:47 (2016)
Google Scholar
Taylor, M.A., Edwards, J., St. Cyr, A.: Petascale atmospheric models for the community climate system model: new developments and evaluation of scalable dynamical cores. J. Phys. Conf. Ser. 125(1), 12023–12032 (2008)
Article Google Scholar
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)
Article Google Scholar
Ullrich, P.A., Lauritzen, P.H., Jablonowski, C.: Geometrically exact conservative remapping (gecore): regular latitude-longitude and cubed-sphere grids. Mon. Weather Rev. 137(6), 1721–1741 (2009)
Article Google Scholar
Xiao, J., Li, S., Wu, B., Zhang, H., Li, K., Yao, E., Zhang, Y., Tan, G.: Communication-avoiding for dynamical core of atmospheric general circulation model. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP’ 18. Eugene, OR, USA, ACM (2018)
Xue, W., Yang, C., Fu, H., Wang, X., Xu, Y., Liao, J., Gan, L., Lu, Y., Ranjan, R., Wang, L.: Ultra-scalable cpu-mic acceleration of mesoscale atmospheric modeling on tianhe-2. IEEE Trans. Comput. 64, 2382–2393 (2015)
Article MathSciNet MATH Google Scholar
Yang, C., Xue, W., Fu, H., Gan, L., Li, L., Xu, Y., Lu, Y., Sun, J., Yang, G., Zheng, W.: A peta-scalable cpu-gpu algorithm for global atmospheric simulations. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, vol. 48, pp. 1–12 (2013)
Zhang, H., Zhang, M., Zeng, Q.: Sensitivity of simulated climate to two atmospheric models: Interpretation of differences between dry models and moist models. Mon. Weather Rev. 141(5), 1558–1576 (2013)
Article Google Scholar

Download references

Acknowledgements

The work is supported by the National Key Research and Development Program of China under Grant no. 2016YFB0200800 and National Natural Science Foundation of China under Grant no. 61802369.

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Junmin Xiao
University of Chinese Academy of Sciences, Beijing, China
Junmin Xiao
Chongqing University, Chongqing, China
Jian Peng

Authors

Junmin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Jian Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junmin Xiao.

Additional information

The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200800 and National Natural Science Foundation of China under Grant No. 61802369.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiao, J., Peng, J. Trade-offs between computation, communication, and synchronization in stencil-collective alternate update. CCF Trans. HPC 1, 144–160 (2019). https://doi.org/10.1007/s42514-019-00011-x

Download citation

Received: 30 March 2019
Accepted: 17 July 2019
Published: 26 July 2019
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s42514-019-00011-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trade-offs between computation, communication, and synchronization in stencil-collective alternate update

Abstract

Access this article

Similar content being viewed by others

A Generic Strategy for Multi-stage Stencils

Performance Modeling of Stencil Computation on SW26010 Processors

Generalized Parallel Computational Schemes for Time-Consuming Global Optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Trade-offs between computation, communication, and synchronization in stencil-collective alternate update

Abstract

Access this article

Similar content being viewed by others

A Generic Strategy for Multi-stage Stencils

Performance Modeling of Stencil Computation on SW26010 Processors

Generalized Parallel Computational Schemes for Time-Consuming Global Optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation