Skip to main content

Vectorized Barrier and Reduction in LLVM OpenMP Runtime

  • Conference paper
  • First Online:
Book cover OpenMP: Enabling Massive Node-Level Parallelism (IWOMP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12870))

Included in the following conference series:

Abstract

Barrier synchronization is a well known operation in parallel processing that can be an obstacle for getting performance in parallel programs, particularly for high thread counts. Similarly, reduction is a collective communication pattern frequently used in parallel applications and needs to be optimized for applications to achieve their best performance. With the introduction of multi-core and many-core processors several new barrier and reduction implementations have been proposed. As the number of cores per node continues to grow, implementation of these primitives need to be revisited and adapted for upcoming architectures. We see an opportunity to improve synchronization by exploiting vector units present in modern and future CPU designs based on vector ISAs such as ARM’s Scalable Vector Extension and the RISC-V Vector extension. In this work we propose vectorized barriers and reductions using the vector length agnostic paradigm and implement them in the LLVM OpenMP runtime. Our barrier implementation achieves up to 2.2\(\times \) and 1.4\(\times \) speedup over the default LLVM OpenMP implementation on Intel KNL and Fujitsu A64FX, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arenstorf, N.S., Jordan, H.F.: Comparing barrier algorithms. Parallel Comput. 12(2), 157–170 (1989)

    Article  Google Scholar 

  2. Bellard, F.: Qemu, a fast and portable dynamic translator. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2005, p. 41, USENIX Association, USA (2005)

    Google Scholar 

  3. Bull, J.M., Reid, F., McDonnell, N.: A microbenchmark suite for OpenMP tasks. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 271–274. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30961-8_24

    Chapter  Google Scholar 

  4. Caballero, D.: SIMD@OpenMP: a programming model approach to leverage SIMD features. Ph.D. Thesis, Universitat Politecnica de Catalunya (2015)

    Google Scholar 

  5. Caballero, D., Duran, A., Martorell, X.: An OpenMP* Barrier Using SIMD instructions for intel® Xeon PhiTM coprocessor. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 99–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40698-0_8

    Chapter  Google Scholar 

  6. Doodi, T., et al.: OpenMP runtime instrumentation for optimization. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) Scaling OpenMP for Exascale Performance and Portability. Lecture Notes in Computer Science, pp. 281–295. Springer International Publishing, Cham (2017)

    Chapter  Google Scholar 

  7. https://people.eecs.berkeley.edu/~demmel/cs267-1995/lecture10/lecture10.html

  8. https://repo.hca.bsc.es/gitlab/epi-public/risc-v-vector-simulation-environment

  9. Han, Y., Finkel, R.A.: An optimal scheme for disseminating information. In: Proceedings of the International Conference on Parallel Processing, ICPP ’88, The Pennsylvania State University, University Park, PA, USA, August 1988. Volume 2: Software, pages 198–203. Pennsylvania State University Press (1988)

    Google Scholar 

  10. Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. Int. J. Parallel Program. 17, 1–17 (1988). https://doi.org/10.1007/BF01379320

    Article  MATH  Google Scholar 

  11. Hetland, C., et al.: Paths to fast barrier synchronization on the node. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019, pp. 109–120. Association for Computing Machinery, New York, NY, USA, (2019)

    Google Scholar 

  12. Lin, H., Sips, H.: Parallel vector reduction algorithms and architectures. J. Parallel Distrib. Comput. 5(2), 103–130 (1988)

    Article  Google Scholar 

  13. Lubachevsky, B.D.: An approach to automating the verification of compact parallel coordination programs. I. Acta Inf. 21(2), 125–169 (1984)

    Article  MathSciNet  Google Scholar 

  14. Lubachevsky, B.D.: Synchronization barrier and related tools for shared memory parallel programming. Int. J. Parallel Prog. 19(3), 225–250 (1991)

    Article  MathSciNet  Google Scholar 

  15. Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)

    Article  Google Scholar 

  16. Mondal, H.K., Cataldo, R.C., Marcon, C.A.M., Martin, K., Deb, S., Diguet, J.-P.: Broadcast-and power-aware wireless NoC for barrier synchronization in parallel computing. In: 2018 31st IEEE International System-on-Chip Conference (SOCC), pp. 1–6 (2018)

    Google Scholar 

  17. Nanjegowda, R., Hernandez, O., Chapman, B., Jin, H.H.: Scalability evaluation of barrier algorithms for OpenMP. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 42–52. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_4

    Chapter  Google Scholar 

  18. Pfister, G.F., Norton, V.A.: “hot spot’’ contention and combining in multistage interconnection networks. IEEE Trans. Comput. C 34(10), 943–948 (1985)

    Article  Google Scholar 

  19. Sampson, J., Gonzalez, R., Collard, J.-F., Jouppi, N.P., Schlansker, M., Calder, B.: Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2006), pp. 235–246 (2006)

    Google Scholar 

  20. Sartori, J., Kumar, R.: Low-overhead, high-speed multi-core barrier synchronization. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 18–34. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11515-8_4

    Chapter  Google Scholar 

  21. Sato, M., et al.: Co-Design for A64FX Manycore Processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, November 2020

    Google Scholar 

  22. Satoh, S., Kusano, K., Sato, M.: Compiler optimization techniques for OpenMP programs. Sci. Program. 9(2–3), 131–142 (2001)

    Google Scholar 

  23. Tang, P., Yew, P.: Processor self-scheduling for multiple-nested parallel loops. In: Hwang, K., Jacobs, S., Swartzlander, E. (eds) Proceedings of the International Conference on Parallel Processing, Proceedings of the International Conference on Parallel Processing, pp. 528–535. IEEE, December 1986

    Google Scholar 

  24. Tang, P., Yew, P.-C.: Software combining algorithms for distributing hot-spot addressing. J. Parallel Distrib. Comput. 10(2), 130–139 (1990)

    Article  Google Scholar 

  25. Tatebe, O., Sato, M., Sekiguchi, S.: Impact of OpenMP optimizations for the MGCG method. In: Valero, M., Joe, K., Kitsuregawa, M., Tanaka, H. (eds.) High Performance Computing. Lecture Notes in Computer Science, pp. 471–481. Springer, Berlin, Heidelberg (2000). https://doi.org/10.1007/3-540-39999-2_44

    Chapter  Google Scholar 

  26. Yew, P.-C., Tzeng, N.-F.: Lawrie: distributing hot-spot addressing in large-scale multiprocessors. IEEE Trans. Comput. C 36(4), 388–395 (1987)

    Google Scholar 

Download references

Acknowledgements

This work has been done as part of the European Processor Initiative project. The European Processor Initiative (EPI) (FPA: 800928) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement EPI-SGA1: 826647. The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at HPC2N, partially funded by the Swedish Research Council through grant agreement no. 2018–05973. We thank Barcelona Supercomputer Center (BSC-CNS) for their support and providing access to the CTE-ARM cluster.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Nufail Farooqi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Farooqi, M.N., Pericàs, M. (2021). Vectorized Barrier and Reduction in LLVM OpenMP Runtime. In: McIntosh-Smith, S., de Supinski, B.R., Klinkenberg, J. (eds) OpenMP: Enabling Massive Node-Level Parallelism. IWOMP 2021. Lecture Notes in Computer Science(), vol 12870. Springer, Cham. https://doi.org/10.1007/978-3-030-85262-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85262-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85261-0

  • Online ISBN: 978-3-030-85262-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics