Skip to main content
Log in

Trade-offs between computation, communication, and synchronization in stencil-collective alternate update

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

In a computing platform composed of several homogeneous processors, any parallel schedule of an algorithm usually involves three basic costs: arithmetic throughput on each processor, data movement between processors, and synchronization latency for several processors. The trade-offs between these three costs could realistically reflect lower bounds on the execution time for an algorithm. Therefore, the trade-off analysis is important for evaluating the optimality of a proposed schedule, and often yields new insights in parallel optimization. In this paper, we focus on the trade-offs between computation, communication, and synchronization in the stencil-collective alternate update, which is often executed repeatedly by the complex workflow with multiple stages in most numerical methods, such as the conjugate gradient (CG) method, the nonlinear time integration method in the dynamical core of a global atmospheric general circulation model (AGCM), and so on. Firstly, in order to formalize a workflow with multiple different stages, a novel operator representation of parallel algorithms is proposed. Based on the operator representation, we find the minimum vertex separator of the dependency graph for a stencil-collective alternate update. This breakthrough brings us the opportunity to obtain the cost lower bounds. Next, the general trade-off theory of the stencil-collective alternate update is founded successfully, which extends the recent trade-off theory to a more general theoretical context. Finally, by applying the general theoretical result to several algorithms, namely CG method and the nonlinear time integration method in AGCM, we obtain the corresponding lower bounds of computational cost, communication throughput, and synchronization latency. It should be noted that the general theory can also be widely used to analyze other complex numerical methods in real-world applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of prams. Theor. Comput. Sci. 71(1), 3–28 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  • Arakawa, A., Lamb, V.R.: Computational design of the basic dynamical processes of the ucla general circulation model. Methods Comput. Phys. Adv. Res. Appl. 17, 173–265 (1977)

    Article  Google Scholar 

  • Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In: Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, (New York, NY, USA), ACM, pp. 77–79 (2012)

  • Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in linear algebra. SIAM J. Matrix Anal. Appl. 32(3), 866–901 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Num. 23, 1–155 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Bampis, E., Delorme, C., König, J.-C.: Optimal schedules for d-d grid graphs with communication delays. In: Proceedings of the 13th Annual Symposium on Theoretical Aspects of Computer Science, STACS ’96 (London, UK, UK). Springer, pp. 655–666 (1996)

  • Bilardi, G., Scquizzato, M., Silvestri, F.: A lower bound technique for communication on bsp with application to the fft. In: International Conference on Parallel Processing, pp. 676–687 (2012c)

  • Bilardi, G., Preparata, F.P.: Processor-time tradeoffs under bounded-speed message propagation: part ii. Low. Bounds 32, 531–559 (1999)

    MATH  Google Scholar 

  • Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’93 (New York, NY, USA). ACM, pp. 1–12 (1993)

  • Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2008)

  • Dennis, J.M., Edwards, J., Evans, K.J., Guba, O., Lauritzen, P.H., Mirin, A.A., Stcyr, A., Taylor, M.A., Worley, P.H.: Cam-se: a scalable spectral element dynamical core for the community atmosphere model. Int. J. High Perform. Comput. Appl. 26(1), 74–89 (2012)

    Article  Google Scholar 

  • Fu, H., Liao, J., Xue, W., Wang, L., Chen, D., Gu, L., Xu, J., Ding, N., Wang, X., He, C., Xu, S., Liang, Y., Fang, J., Xu, Y., Zheng, W., Xu, J., Zheng, Z., Wei, W., Ji, X., Zhang, H., Chen, B., Li, K., Huang, X., Chen, W., Yang, G.: Refactoring and optimizing the community atmosphere model (cam) on the sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’ 16 (Piscataway, NJ, USA). IEEE, pp. 83:1–83:12 (2016)

  • Gysi, T., Grosser, T., Hoefler, T.: Modesto: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15 (New York, NY, USA). ACM, pp. 177–186 (2015)

  • Hamilton, K., Ohfuchi, W.: High Resolution Numerical Modelling of the Atmosphere and Ocean. Springer, Berlin (2008)

    Book  Google Scholar 

  • Hong, J. W., Kung, H. T.: I/O complexity: The red-blue pebble game, In: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, STOC ’81 (New York, NY, USA), ACM, pp. 326–333 (1981)

  • Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)

    Article  MATH  Google Scholar 

  • Nocedal, J., Wright, S.J.: Numerical Optimization, vol. 1. Springer, Berlin (2006)

    MATH  Google Scholar 

  • Papadimitriou, C.H., Ullman, J.D.: A communication-time tradeoff. SIAM J. Comput. 16(4), 639–646 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  • Phillips, N.A.: A coordinate system having some special advantages for numerical forecasting. J. Meteorol. 14(2), 184–185 (1957)

    Article  Google Scholar 

  • Putman, W.M.: Development of the finite-volume dynamical core on the cubed-sphere. PhD thesis, The Florida State University (2007)

  • Rajbhandari, S., Rastello, F., Kowalski, K., Krishnamoorthy, S., Sadayappan, P.: Optimizing the four-index integral transform using data movement lower bounds analysis. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’ 17, (New York, NY, USA), pp. 327–340, ACM (2017)

  • Scquizzato, M., Silvestri, F.: Communication lower bounds for distributed-memory computations. In: Symposium on theoretical aspects of computer science, vol. 25 (2013)

  • Shimokawabe, T., Aoki, T., Takaki, T., Endo, T., Yamanaka, A., Maruyama, N., Nukada, A., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the tsubame 2.0 supercomputer. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC’ 11, (New York, NY, USA). ACM, pp. 3:1–3:11 (2011)

  • Solomonik, E., Carson, E., Knight, N., Demmel, J.: Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations. In: Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14 (New York, NY, USA). ACM, pp. 307–318 (2014)

  • Solomonik, E., Carson, E., Knight, N., Demmel, J.: Trade-offs between synchronization, communication, and computation in parallel linear algebra computations. ACM Trans. Parallel Comput. 3, 3:1–3:47 (2016)

    Google Scholar 

  • Taylor, M.A., Edwards, J., St. Cyr, A.: Petascale atmospheric models for the community climate system model: new developments and evaluation of scalable dynamical cores. J. Phys. Conf. Ser. 125(1), 12023–12032 (2008)

    Article  Google Scholar 

  • Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)

    Article  Google Scholar 

  • Ullrich, P.A., Lauritzen, P.H., Jablonowski, C.: Geometrically exact conservative remapping (gecore): regular latitude-longitude and cubed-sphere grids. Mon. Weather Rev. 137(6), 1721–1741 (2009)

    Article  Google Scholar 

  • Xiao, J., Li, S., Wu, B., Zhang, H., Li, K., Yao, E., Zhang, Y., Tan, G.: Communication-avoiding for dynamical core of atmospheric general circulation model. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP’ 18. Eugene, OR, USA, ACM (2018)

  • Xue, W., Yang, C., Fu, H., Wang, X., Xu, Y., Liao, J., Gan, L., Lu, Y., Ranjan, R., Wang, L.: Ultra-scalable cpu-mic acceleration of mesoscale atmospheric modeling on tianhe-2. IEEE Trans. Comput. 64, 2382–2393 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, C., Xue, W., Fu, H., Gan, L., Li, L., Xu, Y., Lu, Y., Sun, J., Yang, G., Zheng, W.: A peta-scalable cpu-gpu algorithm for global atmospheric simulations. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, vol. 48, pp. 1–12 (2013)

  • Zhang, H., Zhang, M., Zeng, Q.: Sensitivity of simulated climate to two atmospheric models: Interpretation of differences between dry models and moist models. Mon. Weather Rev. 141(5), 1558–1576 (2013)

    Article  Google Scholar 

Download references

Acknowledgements

The work is supported by the National Key Research and Development Program of China under Grant no. 2016YFB0200800 and National Natural Science Foundation of China under Grant no. 61802369.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junmin Xiao.

Additional information

The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200800 and National Natural Science Foundation of China under Grant No. 61802369.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, J., Peng, J. Trade-offs between computation, communication, and synchronization in stencil-collective alternate update. CCF Trans. HPC 1, 144–160 (2019). https://doi.org/10.1007/s42514-019-00011-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-019-00011-x

Keywords

Navigation