Abstract
The new MPI 4.0 standard includes a new chapter about partitioned point-to-point communication operations. These partitioned operations allow multiple actors of one MPI process (e.g. multiple threads) to contribute data to one communication operation. These operations are designed to mitigate current problems in multithreaded MPI programs, with some work suggesting a substantial performance benefit (up to 26%) when using these operations compared to their existing non-blocking counterparts.
In this work, we explore the possibility for the compiler to automatically partition sending operations across multiple OpenMP threads. For this purpose, we developed an LLVM compiler pass that partitions MPI sending operations across the different iterations of OpenMP for loops. We demonstrate the feasibility of this approach by applying it to 2D stencil codes, observing very little overhead while the correctness of the codes is sustained. Therefore, this approach facilitates the usage of these new additions to the MPI standard for existing codes.
Our code is available on github: https://github.com/tudasc/CommPart.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Meaning that the modification of this partition the sending operation is forbidden until the sending operation has completed locally.
- 2.
Note that this only illustrates the transformation, as the transformation happen on the LLVM IR, no source code is being output.
- 3.
For example a type created by MPI_Type_contiguous.
- 4.
This is a valid implementation according to the MPI standard.
- 5.
For a receive operation, reading and writing is forbidden, though.
- 6.
False positives in the MPI implementation or the application itself are not filtered.
- 7.
On the Lichtenberg cluster equipped with Intel Xeon Platinum 9242 CPUs, the execution of unaltered version compiled with clang 11.1 took 614 s on average, while the execution of the automatically partitioned version took 619 s.
References
Ahmed, H., Skjellumh, A., Bangalore, P., Pirkelbauer, P.: Transforming blocking MPI collectives to non-blocking and persistent operations. In: Proceedings of the 24th European MPI Users’ Group Meeting, pp. 1–11 (2017)
Danalis, A., Pollock, L., Swany, M.: Automatic MPI application transformation with ASPhALT. In: 2007 IEEE International Parallel and Distributed Processing Symposium, pp. 1–8. IEEE (2007)
Danalis, A., Pollock, L., Swany, M., Cavazos, J.: MPI-aware compiler optimizations for improving communication-computation overlap. In: Proceedings of the 23rd International Conference on Supercomputing, pp. 316–325 (2009)
Grant, R., Skjellum, A., Bangalore, P.V.: Lightweight threading with MPI using Persistent Communications Semantics. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM (United States) (2015)
Grant, R.E., Dosanjh, M.G.F., Levenhagen, M.J., Brightwell, R., Skjellum, A.: Finepoints: partitioned multithreaded MPI communication. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds.) ISC High Performance 2019. LNCS, vol. 11501, pp. 330–350. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20656-7_17
Guo, J., Yi, Q., Meng, J., Zhang, J., Balaji, P.: Compiler-assisted overlapping of communication and computation in MPI applications. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 60–69. IEEE (2016)
Jammer, T., Iwainsky, C., Bischof, C.: Automatic detection of MPI assertions. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 34–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_3
Laguna, I., Marshall, R., Mohror, K., Ruefenacht, M., Skjellum, A., Sultana, N.: A large-scale study of MPI usage in open-source HPC applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. ACM (2019). https://doi.org/10.1145/3295500.3356176
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 4.0 (2021). https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf
Nguyen, V.M., Saillard, E., Jaeger, J., Barthou, D., Carribault, P.: Automatic code motion to extend MPI nonblocking overlap window. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 43–54. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_4
Schonbein, W., Dosanjh, M.G.F., Grant, R.E., Bridges, P.G.: Measuring multithreaded message matching misery. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 480–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96983-1_34
Seward, J., et al.: Memcheck: a memory error detector (2020). https://valgrind.org/docs/manual/mc-manual.html
Squar, J., Jammer, T., Blesel, M., Kuhn, M., Ludwig, T.: Compiler assisted source transformation of openmp kernels. In: 2020 19th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 44–51 (2020). https://doi.org/10.1109/ISPDC51135.2020.00016
Acknowlegements
We especially want to thank Dr. Christian Iwainsky (TU Darmstadt) for fruitful discussion. This work was supported by the Hessian Ministry for Higher Education, Research and the Arts through the Hessian Competence Center for High-Performance Computing. Measurements for this work were conducted on the Lichtenberg high performance computer of the TU Darmstadt. Some of the code analyzing the OpenMP parallel regions originated from CATO [13] (https://github.com/JSquar/cato).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Jammer, T., Bischof, C. (2021). Automatic Partitioning of MPI Operations in MPI+OpenMP Applications. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-90539-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)