Abstract
In a computing platform composed of several homogeneous processors, any parallel schedule of an algorithm usually involves three basic costs: arithmetic throughput on each processor, data movement between processors, and synchronization latency for several processors. The trade-offs between these three costs could realistically reflect lower bounds on the execution time for an algorithm. Therefore, the trade-off analysis is important for evaluating the optimality of a proposed schedule, and often yields new insights in parallel optimization. In this paper, we focus on the trade-offs between computation, communication, and synchronization in the stencil-collective alternate update, which is often executed repeatedly by the complex workflow with multiple stages in most numerical methods, such as the conjugate gradient (CG) method, the nonlinear time integration method in the dynamical core of a global atmospheric general circulation model (AGCM), and so on. Firstly, in order to formalize a workflow with multiple different stages, a novel operator representation of parallel algorithms is proposed. Based on the operator representation, we find the minimum vertex separator of the dependency graph for a stencil-collective alternate update. This breakthrough brings us the opportunity to obtain the cost lower bounds. Next, the general trade-off theory of the stencil-collective alternate update is founded successfully, which extends the recent trade-off theory to a more general theoretical context. Finally, by applying the general theoretical result to several algorithms, namely CG method and the nonlinear time integration method in AGCM, we obtain the corresponding lower bounds of computational cost, communication throughput, and synchronization latency. It should be noted that the general theory can also be widely used to analyze other complex numerical methods in real-world applications.



Similar content being viewed by others
References
Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of prams. Theor. Comput. Sci. 71(1), 3–28 (1990)
Arakawa, A., Lamb, V.R.: Computational design of the basic dynamical processes of the ucla general circulation model. Methods Comput. Phys. Adv. Res. Appl. 17, 173–265 (1977)
Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In: Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, (New York, NY, USA), ACM, pp. 77–79 (2012)
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in linear algebra. SIAM J. Matrix Anal. Appl. 32(3), 866–901 (2011)
Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Num. 23, 1–155 (2014)
Bampis, E., Delorme, C., König, J.-C.: Optimal schedules for d-d grid graphs with communication delays. In: Proceedings of the 13th Annual Symposium on Theoretical Aspects of Computer Science, STACS ’96 (London, UK, UK). Springer, pp. 655–666 (1996)
Bilardi, G., Scquizzato, M., Silvestri, F.: A lower bound technique for communication on bsp with application to the fft. In: International Conference on Parallel Processing, pp. 676–687 (2012c)
Bilardi, G., Preparata, F.P.: Processor-time tradeoffs under bounded-speed message propagation: part ii. Low. Bounds 32, 531–559 (1999)
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’93 (New York, NY, USA). ACM, pp. 1–12 (1993)
Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix computations. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2008)
Dennis, J.M., Edwards, J., Evans, K.J., Guba, O., Lauritzen, P.H., Mirin, A.A., Stcyr, A., Taylor, M.A., Worley, P.H.: Cam-se: a scalable spectral element dynamical core for the community atmosphere model. Int. J. High Perform. Comput. Appl. 26(1), 74–89 (2012)
Fu, H., Liao, J., Xue, W., Wang, L., Chen, D., Gu, L., Xu, J., Ding, N., Wang, X., He, C., Xu, S., Liang, Y., Fang, J., Xu, Y., Zheng, W., Xu, J., Zheng, Z., Wei, W., Ji, X., Zhang, H., Chen, B., Li, K., Huang, X., Chen, W., Yang, G.: Refactoring and optimizing the community atmosphere model (cam) on the sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’ 16 (Piscataway, NJ, USA). IEEE, pp. 83:1–83:12 (2016)
Gysi, T., Grosser, T., Hoefler, T.: Modesto: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15 (New York, NY, USA). ACM, pp. 177–186 (2015)
Hamilton, K., Ohfuchi, W.: High Resolution Numerical Modelling of the Atmosphere and Ocean. Springer, Berlin (2008)
Hong, J. W., Kung, H. T.: I/O complexity: The red-blue pebble game, In: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, STOC ’81 (New York, NY, USA), ACM, pp. 326–333 (1981)
Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)
Nocedal, J., Wright, S.J.: Numerical Optimization, vol. 1. Springer, Berlin (2006)
Papadimitriou, C.H., Ullman, J.D.: A communication-time tradeoff. SIAM J. Comput. 16(4), 639–646 (1987)
Phillips, N.A.: A coordinate system having some special advantages for numerical forecasting. J. Meteorol. 14(2), 184–185 (1957)
Putman, W.M.: Development of the finite-volume dynamical core on the cubed-sphere. PhD thesis, The Florida State University (2007)
Rajbhandari, S., Rastello, F., Kowalski, K., Krishnamoorthy, S., Sadayappan, P.: Optimizing the four-index integral transform using data movement lower bounds analysis. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’ 17, (New York, NY, USA), pp. 327–340, ACM (2017)
Scquizzato, M., Silvestri, F.: Communication lower bounds for distributed-memory computations. In: Symposium on theoretical aspects of computer science, vol. 25 (2013)
Shimokawabe, T., Aoki, T., Takaki, T., Endo, T., Yamanaka, A., Maruyama, N., Nukada, A., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the tsubame 2.0 supercomputer. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC’ 11, (New York, NY, USA). ACM, pp. 3:1–3:11 (2011)
Solomonik, E., Carson, E., Knight, N., Demmel, J.: Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations. In: Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14 (New York, NY, USA). ACM, pp. 307–318 (2014)
Solomonik, E., Carson, E., Knight, N., Demmel, J.: Trade-offs between synchronization, communication, and computation in parallel linear algebra computations. ACM Trans. Parallel Comput. 3, 3:1–3:47 (2016)
Taylor, M.A., Edwards, J., St. Cyr, A.: Petascale atmospheric models for the community climate system model: new developments and evaluation of scalable dynamical cores. J. Phys. Conf. Ser. 125(1), 12023–12032 (2008)
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)
Ullrich, P.A., Lauritzen, P.H., Jablonowski, C.: Geometrically exact conservative remapping (gecore): regular latitude-longitude and cubed-sphere grids. Mon. Weather Rev. 137(6), 1721–1741 (2009)
Xiao, J., Li, S., Wu, B., Zhang, H., Li, K., Yao, E., Zhang, Y., Tan, G.: Communication-avoiding for dynamical core of atmospheric general circulation model. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP’ 18. Eugene, OR, USA, ACM (2018)
Xue, W., Yang, C., Fu, H., Wang, X., Xu, Y., Liao, J., Gan, L., Lu, Y., Ranjan, R., Wang, L.: Ultra-scalable cpu-mic acceleration of mesoscale atmospheric modeling on tianhe-2. IEEE Trans. Comput. 64, 2382–2393 (2015)
Yang, C., Xue, W., Fu, H., Gan, L., Li, L., Xu, Y., Lu, Y., Sun, J., Yang, G., Zheng, W.: A peta-scalable cpu-gpu algorithm for global atmospheric simulations. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, vol. 48, pp. 1–12 (2013)
Zhang, H., Zhang, M., Zeng, Q.: Sensitivity of simulated climate to two atmospheric models: Interpretation of differences between dry models and moist models. Mon. Weather Rev. 141(5), 1558–1576 (2013)
Acknowledgements
The work is supported by the National Key Research and Development Program of China under Grant no. 2016YFB0200800 and National Natural Science Foundation of China under Grant no. 61802369.
Author information
Authors and Affiliations
Corresponding author
Additional information
The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200800 and National Natural Science Foundation of China under Grant No. 61802369.
Rights and permissions
About this article
Cite this article
Xiao, J., Peng, J. Trade-offs between computation, communication, and synchronization in stencil-collective alternate update. CCF Trans. HPC 1, 144–160 (2019). https://doi.org/10.1007/s42514-019-00011-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-019-00011-x