Vectorized Barrier and Reduction in LLVM OpenMP Runtime

Farooqi, Muhammad Nufail; Pericàs, Miquel

doi:10.1007/978-3-030-85262-7_2

Muhammad Nufail Farooqi¹¹ &
Miquel Pericàs¹¹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12870))

Included in the following conference series:

International Workshop on OpenMP

666 Accesses
1 Citations

Abstract

Barrier synchronization is a well known operation in parallel processing that can be an obstacle for getting performance in parallel programs, particularly for high thread counts. Similarly, reduction is a collective communication pattern frequently used in parallel applications and needs to be optimized for applications to achieve their best performance. With the introduction of multi-core and many-core processors several new barrier and reduction implementations have been proposed. As the number of cores per node continues to grow, implementation of these primitives need to be revisited and adapted for upcoming architectures. We see an opportunity to improve synchronization by exploiting vector units present in modern and future CPU designs based on vector ISAs such as ARM’s Scalable Vector Extension and the RISC-V Vector extension. In this work we propose vectorized barriers and reductions using the vector length agnostic paradigm and implement them in the LLVM OpenMP runtime. Our barrier implementation achieves up to 2.2\(\times \) and 1.4\(\times \) speedup over the default LLVM OpenMP implementation on Intel KNL and Fujitsu A64FX, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arenstorf, N.S., Jordan, H.F.: Comparing barrier algorithms. Parallel Comput. 12(2), 157–170 (1989)
Article Google Scholar
Bellard, F.: Qemu, a fast and portable dynamic translator. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2005, p. 41, USENIX Association, USA (2005)
Google Scholar
Bull, J.M., Reid, F., McDonnell, N.: A microbenchmark suite for OpenMP tasks. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 271–274. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_24
Chapter Google Scholar
Caballero, D.: SIMD@OpenMP: a programming model approach to leverage SIMD features. Ph.D. Thesis, Universitat Politecnica de Catalunya (2015)
Google Scholar
Caballero, D., Duran, A., Martorell, X.: An OpenMP* Barrier Using SIMD instructions for intel® Xeon PhiTM coprocessor. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 99–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_8
Chapter Google Scholar
Doodi, T., et al.: OpenMP runtime instrumentation for optimization. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) Scaling OpenMP for Exascale Performance and Portability. Lecture Notes in Computer Science, pp. 281–295. Springer International Publishing, Cham (2017)
Chapter Google Scholar
https://people.eecs.berkeley.edu/~demmel/cs267-1995/lecture10/lecture10.html
https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment
Han, Y., Finkel, R.A.: An optimal scheme for disseminating information. In: Proceedings of the International Conference on Parallel Processing, ICPP ’88, The Pennsylvania State University, University Park, PA, USA, August 1988. Volume 2: Software, pages 198–203. Pennsylvania State University Press (1988)
Google Scholar
Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17, 1–17 (1988). https://doi.org/10.1007/BF01379320
Article MATH Google Scholar
Hetland, C., et al.: Paths to fast barrier synchronization on the node. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019, pp. 109–120. Association for Computing Machinery, New York, NY, USA, (2019)
Google Scholar
Lin, H., Sips, H.: Parallel vector reduction algorithms and architectures. J. Parallel Distrib. Comput. 5(2), 103–130 (1988)
Article Google Scholar
Lubachevsky, B.D.: An approach to automating the verification of compact parallel coordination programs. I. Acta Inf. 21(2), 125–169 (1984)
Article MathSciNet Google Scholar
Lubachevsky, B.D.: Synchronization barrier and related tools for shared memory parallel programming. Int. J. Parallel Prog. 19(3), 225–250 (1991)
Article MathSciNet Google Scholar
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Article Google Scholar
Mondal, H.K., Cataldo, R.C., Marcon, C.A.M., Martin, K., Deb, S., Diguet, J.-P.: Broadcast-and power-aware wireless NoC for barrier synchronization in parallel computing. In: 2018 31st IEEE International System-on-Chip Conference (SOCC), pp. 1–6 (2018)
Google Scholar
Nanjegowda, R., Hernandez, O., Chapman, B., Jin, H.H.: Scalability evaluation of barrier algorithms for OpenMP. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 42–52. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_4
Chapter Google Scholar
Pfister, G.F., Norton, V.A.: “hot spot’’ contention and combining in multistage interconnection networks. IEEE Trans. Comput. C 34(10), 943–948 (1985)
Article Google Scholar
Sampson, J., Gonzalez, R., Collard, J.-F., Jouppi, N.P., Schlansker, M., Calder, B.: Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2006), pp. 235–246 (2006)
Google Scholar
Sartori, J., Kumar, R.: Low-overhead, high-speed multi-core barrier synchronization. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 18–34. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11515-8_4
Chapter Google Scholar
Sato, M., et al.: Co-Design for A64FX Manycore Processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, November 2020
Google Scholar
Satoh, S., Kusano, K., Sato, M.: Compiler optimization techniques for OpenMP programs. Sci. Program. 9(2–3), 131–142 (2001)
Google Scholar
Tang, P., Yew, P.: Processor self-scheduling for multiple-nested parallel loops. In: Hwang, K., Jacobs, S., Swartzlander, E. (eds) Proceedings of the International Conference on Parallel Processing, Proceedings of the International Conference on Parallel Processing, pp. 528–535. IEEE, December 1986
Google Scholar
Tang, P., Yew, P.-C.: Software combining algorithms for distributing hot-spot addressing. J. Parallel Distrib. Comput. 10(2), 130–139 (1990)
Article Google Scholar
Tatebe, O., Sato, M., Sekiguchi, S.: Impact of OpenMP optimizations for the MGCG method. In: Valero, M., Joe, K., Kitsuregawa, M., Tanaka, H. (eds.) High Performance Computing. Lecture Notes in Computer Science, pp. 471–481. Springer, Berlin, Heidelberg (2000). https://doi.org/10.1007/3-540-39999-2_44
Chapter Google Scholar
Yew, P.-C., Tzeng, N.-F.: Lawrie: distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C 36(4), 388–395 (1987)
Google Scholar

Download references

Acknowledgements

This work has been done as part of the European Processor Initiative project. The European Processor Initiative (EPI) (FPA: 800928) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement EPI-SGA1: 826647. The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at HPC2N, partially funded by the Swedish Research Council through grant agreement no. 2018–05973. We thank Barcelona Supercomputer Center (BSC-CNS) for their support and providing access to the CTE-ARM cluster.

Author information

Authors and Affiliations

Chalmers University of Technology, Gothenburg, Sweden
Muhammad Nufail Farooqi & Miquel Pericàs

Authors

Muhammad Nufail Farooqi
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Pericàs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Nufail Farooqi .

Editor information

Editors and Affiliations

University of Bristol, Bristol, UK
Simon McIntosh-Smith
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
RWTH Aachen University, Aachen, Germany
Jannis Klinkenberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Farooqi, M.N., Pericàs, M. (2021). Vectorized Barrier and Reduction in LLVM OpenMP Runtime. In: McIntosh-Smith, S., de Supinski, B.R., Klinkenberg, J. (eds) OpenMP: Enabling Massive Node-Level Parallelism. IWOMP 2021. Lecture Notes in Computer Science(), vol 12870. Springer, Cham. https://doi.org/10.1007/978-3-030-85262-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-85262-7_2
Published: 08 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85261-0
Online ISBN: 978-3-030-85262-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics