Abstract
Barrier synchronization is a well known operation in parallel processing that can be an obstacle for getting performance in parallel programs, particularly for high thread counts. Similarly, reduction is a collective communication pattern frequently used in parallel applications and needs to be optimized for applications to achieve their best performance. With the introduction of multi-core and many-core processors several new barrier and reduction implementations have been proposed. As the number of cores per node continues to grow, implementation of these primitives need to be revisited and adapted for upcoming architectures. We see an opportunity to improve synchronization by exploiting vector units present in modern and future CPU designs based on vector ISAs such as ARM’s Scalable Vector Extension and the RISC-V Vector extension. In this work we propose vectorized barriers and reductions using the vector length agnostic paradigm and implement them in the LLVM OpenMP runtime. Our barrier implementation achieves up to 2.2\(\times \) and 1.4\(\times \) speedup over the default LLVM OpenMP implementation on Intel KNL and Fujitsu A64FX, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arenstorf, N.S., Jordan, H.F.: Comparing barrier algorithms. Parallel Comput. 12(2), 157–170 (1989)
Bellard, F.: Qemu, a fast and portable dynamic translator. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2005, p. 41, USENIX Association, USA (2005)
Bull, J.M., Reid, F., McDonnell, N.: A microbenchmark suite for OpenMP tasks. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 271–274. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_24
Caballero, D.: SIMD@OpenMP: a programming model approach to leverage SIMD features. Ph.D. Thesis, Universitat Politecnica de Catalunya (2015)
Caballero, D., Duran, A., Martorell, X.: An OpenMP* Barrier Using SIMD instructions for intel® Xeon PhiTM coprocessor. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 99–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_8
Doodi, T., et al.: OpenMP runtime instrumentation for optimization. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) Scaling OpenMP for Exascale Performance and Portability. Lecture Notes in Computer Science, pp. 281–295. Springer International Publishing, Cham (2017)
https://people.eecs.berkeley.edu/~demmel/cs267-1995/lecture10/lecture10.html
https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment
Han, Y., Finkel, R.A.: An optimal scheme for disseminating information. In: Proceedings of the International Conference on Parallel Processing, ICPP ’88, The Pennsylvania State University, University Park, PA, USA, August 1988. Volume 2: Software, pages 198–203. Pennsylvania State University Press (1988)
Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17, 1–17 (1988). https://doi.org/10.1007/BF01379320
Hetland, C., et al.: Paths to fast barrier synchronization on the node. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019, pp. 109–120. Association for Computing Machinery, New York, NY, USA, (2019)
Lin, H., Sips, H.: Parallel vector reduction algorithms and architectures. J. Parallel Distrib. Comput. 5(2), 103–130 (1988)
Lubachevsky, B.D.: An approach to automating the verification of compact parallel coordination programs. I. Acta Inf. 21(2), 125–169 (1984)
Lubachevsky, B.D.: Synchronization barrier and related tools for shared memory parallel programming. Int. J. Parallel Prog. 19(3), 225–250 (1991)
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Mondal, H.K., Cataldo, R.C., Marcon, C.A.M., Martin, K., Deb, S., Diguet, J.-P.: Broadcast-and power-aware wireless NoC for barrier synchronization in parallel computing. In: 2018 31st IEEE International System-on-Chip Conference (SOCC), pp. 1–6 (2018)
Nanjegowda, R., Hernandez, O., Chapman, B., Jin, H.H.: Scalability evaluation of barrier algorithms for OpenMP. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 42–52. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_4
Pfister, G.F., Norton, V.A.: “hot spot’’ contention and combining in multistage interconnection networks. IEEE Trans. Comput. C 34(10), 943–948 (1985)
Sampson, J., Gonzalez, R., Collard, J.-F., Jouppi, N.P., Schlansker, M., Calder, B.: Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2006), pp. 235–246 (2006)
Sartori, J., Kumar, R.: Low-overhead, high-speed multi-core barrier synchronization. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 18–34. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11515-8_4
Sato, M., et al.: Co-Design for A64FX Manycore Processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, November 2020
Satoh, S., Kusano, K., Sato, M.: Compiler optimization techniques for OpenMP programs. Sci. Program. 9(2–3), 131–142 (2001)
Tang, P., Yew, P.: Processor self-scheduling for multiple-nested parallel loops. In: Hwang, K., Jacobs, S., Swartzlander, E. (eds) Proceedings of the International Conference on Parallel Processing, Proceedings of the International Conference on Parallel Processing, pp. 528–535. IEEE, December 1986
Tang, P., Yew, P.-C.: Software combining algorithms for distributing hot-spot addressing. J. Parallel Distrib. Comput. 10(2), 130–139 (1990)
Tatebe, O., Sato, M., Sekiguchi, S.: Impact of OpenMP optimizations for the MGCG method. In: Valero, M., Joe, K., Kitsuregawa, M., Tanaka, H. (eds.) High Performance Computing. Lecture Notes in Computer Science, pp. 471–481. Springer, Berlin, Heidelberg (2000). https://doi.org/10.1007/3-540-39999-2_44
Yew, P.-C., Tzeng, N.-F.: Lawrie: distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C 36(4), 388–395 (1987)
Acknowledgements
This work has been done as part of the European Processor Initiative project. The European Processor Initiative (EPI) (FPA: 800928) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement EPI-SGA1: 826647. The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at HPC2N, partially funded by the Swedish Research Council through grant agreement no. 2018–05973. We thank Barcelona Supercomputer Center (BSC-CNS) for their support and providing access to the CTE-ARM cluster.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Farooqi, M.N., Pericàs, M. (2021). Vectorized Barrier and Reduction in LLVM OpenMP Runtime. In: McIntosh-Smith, S., de Supinski, B.R., Klinkenberg, J. (eds) OpenMP: Enabling Massive Node-Level Parallelism. IWOMP 2021. Lecture Notes in Computer Science(), vol 12870. Springer, Cham. https://doi.org/10.1007/978-3-030-85262-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-85262-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85261-0
Online ISBN: 978-3-030-85262-7
eBook Packages: Computer ScienceComputer Science (R0)