Skip to main content
Log in

Extending \(\tau \)-Lop to model MPI blocking primitives on shared memory

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

MPI communication optimization is essential for high-performance applications. The communication performance models have made some achievements in improving the efficiency of collective algorithms and optimizing communication scheduling. Instead of using hardware-related parameters such as bandwidth and latency for communication modeling, recent studies have focused more on software models, which simplify modeling by representing transmission as a sequence of implicit transfers. As a state-of-the-art software model, \(\tau \)-Lop adopts the concept of concurrent transfers for modeling on multiple platforms. However, \(\tau \)-Lop only focuses on the entire system, not the single MPI primitive. This makes \(\tau \)-Lop difficult to apply in systems where processes have different cost. The demand for high-precision concurrent communication modeling is increasing, thus, we extend \(\tau \)-Lop to model MPI primitives to handle this situation and model more, such as asynchronous communication. The modeling accuracy is improved after considering factors such as concurrent transmission, waiting time, communication ends, channels, and protocols. In the test of point-to-point and concurrent communication, the relative error of our model is less than 40% and the accuracy is more than 100% higher than the original \(\tau \)-Lop model in most cases, which means that our work can be used for practical optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Alexandrov AD, Ionescu MF, Schauser KE, Scheiman CJ (1995) Loggp: incorporating long messages into the logp model - one step closer towards a realistic model for parallel computation. In: Leiserson CE (ed) 7th annual ACM symposium on parallel algorithms and architectures, SPAA ’95, Santa Barbara, California, USA, July 17–19, 1995, ACM, pp 95–105, https://doi.org/10.1145/215399.215427

  2. Argonne National Laboratory (2021) MPICH project — a high performance and widely portable implementation of the message passing interface (MPI) standard. https://www.mpich.org, URL https://www.mpich.org

  3. Cameron KW, Ge R (2004) Predicting and evaluating distributed communication performance. In: Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 6–12 November 2004, Pittsburgh, PA, USA, CD-Rom, IEEE Computer Society, p 43, https://doi.org/10.1109/SC.2004.40

  4. Cameron KW, Ge R, Sun X (2007) log\({}_{{\rm n}}\)p and log\({}_{{\rm 3}}\)p: accurate analytical models of point-to-point communication in distributed systems. IEEE Trans Comput 56(3):314–327. https://doi.org/10.1109/TC.2007.38

    Article  MathSciNet  Google Scholar 

  5. Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917. https://doi.org/10.1016/j.jpdc.2014.06.008

    Article  Google Scholar 

  6. Chen W, Zhai J, Zhang J, Zheng W (2009) Loggpo: an accurate communication model for performance prediction of MPI programs. Sci China Ser F Inf Sci 52(10):1785–1791. https://doi.org/10.1007/s11432-009-0161-2

    Article  MATH  Google Scholar 

  7. Culler DE, Karp RM, Patterson DA, Sahay A, Schauser KE, Santos EE, Subramonian R, von Eicken T (1993) Logp: Towards a realistic model of parallel computation. In: Chen MC, Halstead R (eds) Proceedings of the Fourth ACM SIGPLAN symposium on principles & practice of parallel programming (PPOPP), San Diego, California, USA, May 19–22, 1993, pp 1–12, https://doi.org/10.1145/155332.155333

  8. Culler DE, Karp RM, Patterson DA, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T (1996) Logp: a practical model of parallel computation. Commun ACM 39(11):78–85. https://doi.org/10.1145/240455.240477

    Article  Google Scholar 

  9. Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre JC, Barkai D, Berthou JY, Boku T, Braunschweig B et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60

    Article  Google Scholar 

  10. Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. Proceedings, 11th European PVM/MPI users’ group meeting. The Open MPI Project, Budapest, Hungary, pp 97–104

  11. Hasanov K, Lastovetsky AL (2017) Hierarchical redesign of classic MPI reduction algorithms. J Supercomput 73(2):713–725. https://doi.org/10.1007/s11227-016-1779-7

    Article  Google Scholar 

  12. Hasanov K, Quintin J, Lastovetsky AL (2015) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):3991–4014. https://doi.org/10.1007/s11227-014-1133-x

    Article  Google Scholar 

  13. Hasanov K, Quintin J, Lastovetsky AL (2015) Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms. Simul Model Pract Theory 58:30–39. https://doi.org/10.1016/j.simpat.2015.03.005

    Article  Google Scholar 

  14. Hockney RW (1994) The communication challenge for MPP: intel paragon and meiko CS-2. Parallel Comput 20(3):389–398. https://doi.org/10.1016/S0167-8191(06)80021-9

    Article  Google Scholar 

  15. Hoefler T, Cerquetti L, Mehlan T, Mietke F, Rehm W (2005) A practical approach to the rating of barrier algorithms using the log P model and open MPI. In: 34th International Conference on Parallel Processing Workshops (ICPP 2005 Workshops), 14–17 June 2005, Oslo, Norway, IEEE Computer Society, pp 562–569, https://doi.org/10.1109/ICPPW.2005.14

  16. Ino F, Fujimoto N, Hagihara K (2001) Loggps: a parallel computational model for synchronization analysis. In: Heath MT, Lumsdaine A (eds) Proceedings of the 2001 ACM SIGPLAN symposium on principles and practice of parallel programming (PPOPP’01), Snowbird, Utah, USA, June 18–20, 2001, ACM, pp 133–142, https://doi.org/10.1145/379539.379592

  17. Intel (2004) Intel MPI Benchmarks. https://software.intel.com/en-us/articles/intel-mpi-benchmarks, URL https://software.intel.com/en-us/articles/intel-mpi-benchmarks

  18. Kielmann T, Bal HE, Verstoep K (2000) Fast measurement of logp parameters for message passing platforms. In: Rolim JDP (ed) Parallel and distributed processing, 15 IPDPS 2000 Workshops, Cancun, Mexico, May 1–5, 2000, Proceedings, Springer, Lecture Notes in Computer Science, vol 1800, pp 1176–1183, https://doi.org/10.1007/3-540-45591-4_162

  19. Lastovetsky AL, Manumachu RR (2017) New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters. IEEE Trans Parallel Distrib Syst 28(4):1119–1133. https://doi.org/10.1109/TPDS.2016.2608824

    Article  Google Scholar 

  20. Lastovetsky AL, Szustak L, Wyrzykowski R (2017) Model-based optimization of EULAG kernel on intel xeon phi through load imbalancing. IEEE Trans Parallel Distrib Syst 28(3):787–797. https://doi.org/10.1109/TPDS.2016.2599527

    Article  Google Scholar 

  21. Rico-Gallego J, Martín JCD (2015) \(\tau \)-lop: modeling performance of shared memory MPI. Parallel Comput 46:14–31. https://doi.org/10.1016/j.parco.2015.02.006

    Article  Google Scholar 

  22. Rico-Gallego J, Martín JCD, Lastovetsky AL (2016) Extending \(\tau \)-lop to model concurrent MPI communications in multicore clusters. Future Gener Comput Syst 61:66–82. https://doi.org/10.1016/j.future.2016.02.021

    Article  Google Scholar 

  23. Rico-Gallego J, Lastovetsky AL, Martín JCD (2017) Model-based estimation of the communication cost of hybrid data-parallel applications on heterogeneous clusters. IEEE Trans Parallel Distrib Syst 28(11):3215–3228. https://doi.org/10.1109/TPDS.2017.2715809

    Article  Google Scholar 

  24. Rico-Gallego JA, Martín JCD, Manumachu RR, Lastovetsky AL (2019) A survey of communication performance models for high-performance computing. ACM Comput Surv 51(6):126:1–126:36

    Article  Google Scholar 

  25. Rico-Gallego JA, Moreno-Álvarez S, Martín JCD, Lastovetsky AL (2020) A tool to assess the communication cost of parallel kernels on heterogeneous platforms. J Supercomput 76(6):4629–4644. https://doi.org/10.1007/s11227-019-02919-1

    Article  Google Scholar 

  26. Tu B, Fan J, Zhan J, Zhao X (2012) Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput 60(1):141–162. https://doi.org/10.1007/s11227-009-0296-3

    Article  Google Scholar 

  27. Yan B, Zhou Y, Xiao L, Huo J, Wang Z (2019) Loggopsc: A parallel computation model extending network contention into log GOPS. In: 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019, Albuquerque, NM, USA, September 23–26, 2019, IEEE, pp 1–2, https://doi.org/10.1109/CLUSTER.2019.8891035

  28. Yuan L, Zhang Y, Tang Y, Rao L, Sun X (2010) Loggph: A parallel computational model with hierarchical communication awareness. In: 13th IEEE International Conference on Computational Science and Engineering, CSE 2010, Hong Kong, China, December 11–13, 2010, IEEE Computer Society, pp 268–274, https://doi.org/10.1109/CSE.2010.40

Download references

Acknowledgments

This work was supported by the National Key Research and Development Program of China under Grant number 2016YFB0200902.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heng Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Chen, H., Dong, X. et al. Extending \(\tau \)-Lop to model MPI blocking primitives on shared memory. J Supercomput 78, 12046–12069 (2022). https://doi.org/10.1007/s11227-022-04352-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04352-3

Keywords

Navigation