Extending $$\tau $$ -Lop to model MPI blocking primitives on shared memory

Wang, Ziheng; Chen, Heng; Dong, Xiaoshe; Cai, Weilin; Kang, Yan; Zhang, Xingjun

doi:10.1007/s11227-022-04352-3

Extending $\tau $-Lop to model MPI blocking primitives on shared memory

Published: 25 February 2022

Volume 78, pages 12046–12069, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ziheng Wang ORCID: orcid.org/0000-0001-5064-2376¹,
Heng Chen¹,
Xiaoshe Dong¹,
Weilin Cai¹,
Yan Kang¹ &
…
Xingjun Zhang¹

278 Accesses
Explore all metrics

Abstract

MPI communication optimization is essential for high-performance applications. The communication performance models have made some achievements in improving the efficiency of collective algorithms and optimizing communication scheduling. Instead of using hardware-related parameters such as bandwidth and latency for communication modeling, recent studies have focused more on software models, which simplify modeling by representing transmission as a sequence of implicit transfers. As a state-of-the-art software model, $\tau $-Lop adopts the concept of concurrent transfers for modeling on multiple platforms. However, $\tau $-Lop only focuses on the entire system, not the single MPI primitive. This makes $\tau $-Lop difficult to apply in systems where processes have different cost. The demand for high-precision concurrent communication modeling is increasing, thus, we extend $\tau $-Lop to model MPI primitives to handle this situation and model more, such as asynchronous communication. The modeling accuracy is improved after considering factors such as concurrent transmission, waiting time, communication ends, channels, and protocols. In the test of point-to-point and concurrent communication, the relative error of our model is less than 40% and the accuracy is more than 100% higher than the original $\tau $-Lop model in most cases, which means that our work can be used for practical optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Communication-Aware Hardware-Assisted MPI Overlap Engine

Finepoints: Partitioned Multithreaded MPI Communication

The Design of Advanced Communication to Reduce Memory Usage for Exa-scale Systems

References

Alexandrov AD, Ionescu MF, Schauser KE, Scheiman CJ (1995) Loggp: incorporating long messages into the logp model - one step closer towards a realistic model for parallel computation. In: Leiserson CE (ed) 7th annual ACM symposium on parallel algorithms and architectures, SPAA ’95, Santa Barbara, California, USA, July 17–19, 1995, ACM, pp 95–105, https://doi.org/10.1145/215399.215427
Argonne National Laboratory (2021) MPICH project — a high performance and widely portable implementation of the message passing interface (MPI) standard. https://www.mpich.org, URL https://www.mpich.org
Cameron KW, Ge R (2004) Predicting and evaluating distributed communication performance. In: Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 6–12 November 2004, Pittsburgh, PA, USA, CD-Rom, IEEE Computer Society, p 43, https://doi.org/10.1109/SC.2004.40
Cameron KW, Ge R, Sun X (2007) log${}_{{\rm n}}$p and log${}_{{\rm 3}}$p: accurate analytical models of point-to-point communication in distributed systems. IEEE Trans Comput 56(3):314–327. https://doi.org/10.1109/TC.2007.38
Article MathSciNet Google Scholar
Casanova H, Giersch A, Legrand A, Quinson M, Suter F (2014) Versatile, scalable, and accurate simulation of distributed applications and platforms. J Parallel Distrib Comput 74(10):2899–2917. https://doi.org/10.1016/j.jpdc.2014.06.008
Article Google Scholar
Chen W, Zhai J, Zhang J, Zheng W (2009) Loggpo: an accurate communication model for performance prediction of MPI programs. Sci China Ser F Inf Sci 52(10):1785–1791. https://doi.org/10.1007/s11432-009-0161-2
Article MATH Google Scholar
Culler DE, Karp RM, Patterson DA, Sahay A, Schauser KE, Santos EE, Subramonian R, von Eicken T (1993) Logp: Towards a realistic model of parallel computation. In: Chen MC, Halstead R (eds) Proceedings of the Fourth ACM SIGPLAN symposium on principles & practice of parallel programming (PPOPP), San Diego, California, USA, May 19–22, 1993, pp 1–12, https://doi.org/10.1145/155332.155333
Culler DE, Karp RM, Patterson DA, Sahay A, Santos EE, Schauser KE, Subramonian R, von Eicken T (1996) Logp: a practical model of parallel computation. Commun ACM 39(11):78–85. https://doi.org/10.1145/240455.240477
Article Google Scholar
Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre JC, Barkai D, Berthou JY, Boku T, Braunschweig B et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60
Article Google Scholar
Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. Proceedings, 11th European PVM/MPI users’ group meeting. The Open MPI Project, Budapest, Hungary, pp 97–104
Hasanov K, Lastovetsky AL (2017) Hierarchical redesign of classic MPI reduction algorithms. J Supercomput 73(2):713–725. https://doi.org/10.1007/s11227-016-1779-7
Article Google Scholar
Hasanov K, Quintin J, Lastovetsky AL (2015) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):3991–4014. https://doi.org/10.1007/s11227-014-1133-x
Article Google Scholar
Hasanov K, Quintin J, Lastovetsky AL (2015) Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms. Simul Model Pract Theory 58:30–39. https://doi.org/10.1016/j.simpat.2015.03.005
Article Google Scholar
Hockney RW (1994) The communication challenge for MPP: intel paragon and meiko CS-2. Parallel Comput 20(3):389–398. https://doi.org/10.1016/S0167-8191(06)80021-9
Article Google Scholar
Hoefler T, Cerquetti L, Mehlan T, Mietke F, Rehm W (2005) A practical approach to the rating of barrier algorithms using the log P model and open MPI. In: 34th International Conference on Parallel Processing Workshops (ICPP 2005 Workshops), 14–17 June 2005, Oslo, Norway, IEEE Computer Society, pp 562–569, https://doi.org/10.1109/ICPPW.2005.14
Ino F, Fujimoto N, Hagihara K (2001) Loggps: a parallel computational model for synchronization analysis. In: Heath MT, Lumsdaine A (eds) Proceedings of the 2001 ACM SIGPLAN symposium on principles and practice of parallel programming (PPOPP’01), Snowbird, Utah, USA, June 18–20, 2001, ACM, pp 133–142, https://doi.org/10.1145/379539.379592
Intel (2004) Intel MPI Benchmarks. https://software.intel.com/en-us/articles/intel-mpi-benchmarks, URL https://software.intel.com/en-us/articles/intel-mpi-benchmarks
Kielmann T, Bal HE, Verstoep K (2000) Fast measurement of logp parameters for message passing platforms. In: Rolim JDP (ed) Parallel and distributed processing, 15 IPDPS 2000 Workshops, Cancun, Mexico, May 1–5, 2000, Proceedings, Springer, Lecture Notes in Computer Science, vol 1800, pp 1176–1183, https://doi.org/10.1007/3-540-45591-4_162
Lastovetsky AL, Manumachu RR (2017) New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters. IEEE Trans Parallel Distrib Syst 28(4):1119–1133. https://doi.org/10.1109/TPDS.2016.2608824
Article Google Scholar
Lastovetsky AL, Szustak L, Wyrzykowski R (2017) Model-based optimization of EULAG kernel on intel xeon phi through load imbalancing. IEEE Trans Parallel Distrib Syst 28(3):787–797. https://doi.org/10.1109/TPDS.2016.2599527
Article Google Scholar
Rico-Gallego J, Martín JCD (2015) $\tau $-lop: modeling performance of shared memory MPI. Parallel Comput 46:14–31. https://doi.org/10.1016/j.parco.2015.02.006
Article Google Scholar
Rico-Gallego J, Martín JCD, Lastovetsky AL (2016) Extending $\tau $-lop to model concurrent MPI communications in multicore clusters. Future Gener Comput Syst 61:66–82. https://doi.org/10.1016/j.future.2016.02.021
Article Google Scholar
Rico-Gallego J, Lastovetsky AL, Martín JCD (2017) Model-based estimation of the communication cost of hybrid data-parallel applications on heterogeneous clusters. IEEE Trans Parallel Distrib Syst 28(11):3215–3228. https://doi.org/10.1109/TPDS.2017.2715809
Article Google Scholar
Rico-Gallego JA, Martín JCD, Manumachu RR, Lastovetsky AL (2019) A survey of communication performance models for high-performance computing. ACM Comput Surv 51(6):126:1–126:36
Article Google Scholar
Rico-Gallego JA, Moreno-Álvarez S, Martín JCD, Lastovetsky AL (2020) A tool to assess the communication cost of parallel kernels on heterogeneous platforms. J Supercomput 76(6):4629–4644. https://doi.org/10.1007/s11227-019-02919-1
Article Google Scholar
Tu B, Fan J, Zhan J, Zhao X (2012) Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput 60(1):141–162. https://doi.org/10.1007/s11227-009-0296-3
Article Google Scholar
Yan B, Zhou Y, Xiao L, Huo J, Wang Z (2019) Loggopsc: A parallel computation model extending network contention into log GOPS. In: 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019, Albuquerque, NM, USA, September 23–26, 2019, IEEE, pp 1–2, https://doi.org/10.1109/CLUSTER.2019.8891035
Yuan L, Zhang Y, Tang Y, Rao L, Sun X (2010) Loggph: A parallel computational model with hierarchical communication awareness. In: 13th IEEE International Conference on Computational Science and Engineering, CSE 2010, Hong Kong, China, December 11–13, 2010, IEEE Computer Society, pp 268–274, https://doi.org/10.1109/CSE.2010.40

Download references

Acknowledgments

This work was supported by the National Key Research and Development Program of China under Grant number 2016YFB0200902.

Author information

Authors and Affiliations

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
Ziheng Wang, Heng Chen, Xiaoshe Dong, Weilin Cai, Yan Kang & Xingjun Zhang

Authors

Ziheng Wang
View author publications
You can also search for this author inPubMed Google Scholar
Heng Chen
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoshe Dong
View author publications
You can also search for this author inPubMed Google Scholar
Weilin Cai
View author publications
You can also search for this author inPubMed Google Scholar
Yan Kang
View author publications
You can also search for this author inPubMed Google Scholar
Xingjun Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Heng Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Z., Chen, H., Dong, X. et al. Extending $\tau $-Lop to model MPI blocking primitives on shared memory. J Supercomput 78, 12046–12069 (2022). https://doi.org/10.1007/s11227-022-04352-3

Download citation

Accepted: 02 February 2022
Published: 25 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11227-022-04352-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extending \(\tau \)-Lop to model MPI blocking primitives on shared memory

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Communication-Aware Hardware-Assisted MPI Overlap Engine

Finepoints: Partitioned Multithreaded MPI Communication

The Design of Advanced Communication to Reduce Memory Usage for Exa-scale Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Extending \(\tau \)-Lop to model MPI blocking primitives on shared memory

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Communication-Aware Hardware-Assisted MPI Overlap Engine

Finepoints: Partitioned Multithreaded MPI Communication

The Design of Advanced Communication to Reduce Memory Usage for Exa-scale Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now