Abstract
MPI communication optimization is essential for high-performance applications. The communication performance models have made some achievements in improving the efficiency of collective algorithms and optimizing communication scheduling. Instead of using hardware-related parameters such as bandwidth and latency for communication modeling, recent studies have focused more on software models, which simplify modeling by representing transmission as a sequence of implicit transfers. As a state-of-the-art software model, \(\tau \)-Lop adopts the concept of concurrent transfers for modeling on multiple platforms. However, \(\tau \)-Lop only focuses on the entire system, not the single MPI primitive. This makes \(\tau \)-Lop difficult to apply in systems where processes have different cost. The demand for high-precision concurrent communication modeling is increasing, thus, we extend \(\tau \)-Lop to model MPI primitives to handle this situation and model more, such as asynchronous communication. The modeling accuracy is improved after considering factors such as concurrent transmission, waiting time, communication ends, channels, and protocols. In the test of point-to-point and concurrent communication, the relative error of our model is less than 40% and the accuracy is more than 100% higher than the original \(\tau \)-Lop model in most cases, which means that our work can be used for practical optimization.














Similar content being viewed by others
References
Alexandrov AD, Ionescu MF, Schauser KE, Scheiman CJ (1995) Loggp: incorporating long messages into the logp model - one step closer towards a realistic model for parallel computation. In: Leiserson CE (ed) 7th annual ACM symposium on parallel algorithms and architectures, SPAA ’95, Santa Barbara, California, USA, July 17–19, 1995, ACM, pp 95–105, https://doi.org/10.1145/215399.215427
Argonne National Laboratory (2021) MPICH project — a high performance and widely portable implementation of the message passing interface (MPI) standard. https://www.mpich.org, URL https://www.mpich.org
Cameron KW, Ge R (2004) Predicting and evaluating distributed communication performance. In: Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 6–12 November 2004, Pittsburgh, PA, USA, CD-Rom, IEEE Computer Society, p 43, https://doi.org/10.1109/SC.2004.40
Cameron KW, Ge R, Sun X (2007) log\({}_{{\rm n}}\)p and log\({}_{{\rm 3}}\)p: accurate analytical models of point-to-point communication in distributed systems. IEEE Trans Comput 56(3):314–327. https://doi.org/10.1109/TC.2007.38
Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917. https://doi.org/10.1016/j.jpdc.2014.06.008
Chen W, Zhai J, Zhang J, Zheng W (2009) Loggpo: an accurate communication model for performance prediction of MPI programs. Sci China Ser F Inf Sci 52(10):1785–1791. https://doi.org/10.1007/s11432-009-0161-2
Culler DE, Karp RM, Patterson DA, Sahay A, Schauser KE, Santos EE, Subramonian R, von Eicken T (1993) Logp: Towards a realistic model of parallel computation. In: Chen MC, Halstead R (eds) Proceedings of the Fourth ACM SIGPLAN symposium on principles & practice of parallel programming (PPOPP), San Diego, California, USA, May 19–22, 1993, pp 1–12, https://doi.org/10.1145/155332.155333
Culler DE, Karp RM, Patterson DA, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T (1996) Logp: a practical model of parallel computation. Commun ACM 39(11):78–85. https://doi.org/10.1145/240455.240477
Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre JC, Barkai D, Berthou JY, Boku T, Braunschweig B et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60
Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. Proceedings, 11th European PVM/MPI users’ group meeting. The Open MPI Project, Budapest, Hungary, pp 97–104
Hasanov K, Lastovetsky AL (2017) Hierarchical redesign of classic MPI reduction algorithms. J Supercomput 73(2):713–725. https://doi.org/10.1007/s11227-016-1779-7
Hasanov K, Quintin J, Lastovetsky AL (2015) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):3991–4014. https://doi.org/10.1007/s11227-014-1133-x
Hasanov K, Quintin J, Lastovetsky AL (2015) Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms. Simul Model Pract Theory 58:30–39. https://doi.org/10.1016/j.simpat.2015.03.005
Hockney RW (1994) The communication challenge for MPP: intel paragon and meiko CS-2. Parallel Comput 20(3):389–398. https://doi.org/10.1016/S0167-8191(06)80021-9
Hoefler T, Cerquetti L, Mehlan T, Mietke F, Rehm W (2005) A practical approach to the rating of barrier algorithms using the log P model and open MPI. In: 34th International Conference on Parallel Processing Workshops (ICPP 2005 Workshops), 14–17 June 2005, Oslo, Norway, IEEE Computer Society, pp 562–569, https://doi.org/10.1109/ICPPW.2005.14
Ino F, Fujimoto N, Hagihara K (2001) Loggps: a parallel computational model for synchronization analysis. In: Heath MT, Lumsdaine A (eds) Proceedings of the 2001 ACM SIGPLAN symposium on principles and practice of parallel programming (PPOPP’01), Snowbird, Utah, USA, June 18–20, 2001, ACM, pp 133–142, https://doi.org/10.1145/379539.379592
Intel (2004) Intel MPI Benchmarks. https://software.intel.com/en-us/articles/intel-mpi-benchmarks, URL https://software.intel.com/en-us/articles/intel-mpi-benchmarks
Kielmann T, Bal HE, Verstoep K (2000) Fast measurement of logp parameters for message passing platforms. In: Rolim JDP (ed) Parallel and distributed processing, 15 IPDPS 2000 Workshops, Cancun, Mexico, May 1–5, 2000, Proceedings, Springer, Lecture Notes in Computer Science, vol 1800, pp 1176–1183, https://doi.org/10.1007/3-540-45591-4_162
Lastovetsky AL, Manumachu RR (2017) New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters. IEEE Trans Parallel Distrib Syst 28(4):1119–1133. https://doi.org/10.1109/TPDS.2016.2608824
Lastovetsky AL, Szustak L, Wyrzykowski R (2017) Model-based optimization of EULAG kernel on intel xeon phi through load imbalancing. IEEE Trans Parallel Distrib Syst 28(3):787–797. https://doi.org/10.1109/TPDS.2016.2599527
Rico-Gallego J, Martín JCD (2015) \(\tau \)-lop: modeling performance of shared memory MPI. Parallel Comput 46:14–31. https://doi.org/10.1016/j.parco.2015.02.006
Rico-Gallego J, Martín JCD, Lastovetsky AL (2016) Extending \(\tau \)-lop to model concurrent MPI communications in multicore clusters. Future Gener Comput Syst 61:66–82. https://doi.org/10.1016/j.future.2016.02.021
Rico-Gallego J, Lastovetsky AL, Martín JCD (2017) Model-based estimation of the communication cost of hybrid data-parallel applications on heterogeneous clusters. IEEE Trans Parallel Distrib Syst 28(11):3215–3228. https://doi.org/10.1109/TPDS.2017.2715809
Rico-Gallego JA, Martín JCD, Manumachu RR, Lastovetsky AL (2019) A survey of communication performance models for high-performance computing. ACM Comput Surv 51(6):126:1–126:36
Rico-Gallego JA, Moreno-Álvarez S, Martín JCD, Lastovetsky AL (2020) A tool to assess the communication cost of parallel kernels on heterogeneous platforms. J Supercomput 76(6):4629–4644. https://doi.org/10.1007/s11227-019-02919-1
Tu B, Fan J, Zhan J, Zhao X (2012) Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput 60(1):141–162. https://doi.org/10.1007/s11227-009-0296-3
Yan B, Zhou Y, Xiao L, Huo J, Wang Z (2019) Loggopsc: A parallel computation model extending network contention into log GOPS. In: 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019, Albuquerque, NM, USA, September 23–26, 2019, IEEE, pp 1–2, https://doi.org/10.1109/CLUSTER.2019.8891035
Yuan L, Zhang Y, Tang Y, Rao L, Sun X (2010) Loggph: A parallel computational model with hierarchical communication awareness. In: 13th IEEE International Conference on Computational Science and Engineering, CSE 2010, Hong Kong, China, December 11–13, 2010, IEEE Computer Society, pp 268–274, https://doi.org/10.1109/CSE.2010.40
Acknowledgments
This work was supported by the National Key Research and Development Program of China under Grant number 2016YFB0200902.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, Z., Chen, H., Dong, X. et al. Extending \(\tau \)-Lop to model MPI blocking primitives on shared memory. J Supercomput 78, 12046–12069 (2022). https://doi.org/10.1007/s11227-022-04352-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04352-3