An OpenMP* Barrier Using SIMD Instructions for Intel® Xeon PhiTM Coprocessor

Caballero, Diego; Duran, Alejandro; Martorell, Xavier

doi:10.1007/978-3-642-40698-0_8

Diego Caballero^19,20,
Alejandro Duran²¹ &
Xavier Martorell^19,20

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 8122))

Included in the following conference series:

International Workshop on OpenMP

1537 Accesses
1 Altmetric

Abstract

Barrier synchronisation is a widely-studied topic since the supercomputer era due to its significant impact on the overall performance of parallel applications. With the current shift to many-core architectures, such as the Intel^® Many Integrated Core Architecture, software barriers need to be revisited from an on-chip point of view to exploit their new specific resources. In this paper, we propose a tree-based barrier that takes advantage of SIMD instructions and the inter-thread cache locality provided by the 4-way SMT of the Intel^® Xeon Phi^TM coprocessor. Our SIMD approach shows a speed-up of up to 2.84x over the default Intel OpenMP* barrier in the EPCC barrier microbenchmark. It also improves by up to 60% and 21% the Livermore Loop kernel number six and the NAS MG benchmark, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

Vectorized Barrier and Reduction in LLVM OpenMP Runtime

Exploiting and Evaluating OpenSHMEM on KNL Architecture

References

Balanced affinity type. Intel^® C++ Compiler XE 13.1 User and Reference Guides, http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/ (accessed: May 09,2013)
Intel^® Xeon Phi^TM Coprocessor - The Architecture, http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner (accessed: May 09, 2013)
Intel^® Xeon Phi^TM Coprocessor Instruction Set Architecture Reference Manual (2012)
Google Scholar
Abellán, J.L., Fernández, J., Acacio, M.E.: Efficient and scalable barrier synchronization for many-core CMPs. In: Proceedings of the 7th ACM International Conference on Computing Frontiers, CF 2010, pp. 73–74 (2010)
Google Scholar
Almási, G., Heidelberger, P., Archer, C.J., Martorell, X., Erway, C.C., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: Proc. of the 19th Int. Conf. on Supercomp., ICS 2005 (2005)
Google Scholar
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks - summary and preliminary results. In: Proc. of the 1991 ACM/IEEE Conf. on Supercomp., SC 1991, pp. 158–165 (1991)
Google Scholar
Bull, J.M., Reid, F., McDonnell, N.: A microbenchmark suite for openMP tasks. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 271–274. Springer, Heidelberg (2012)
Chapter Google Scholar
Eichenberger, A.E., Abraham, S.G.: Impact of load imbalance on the design of software barriers. In: Proc. of the 1995 Int. Conf. on Parallel Processing, pp. 63–72 (1995)
Google Scholar
Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The NYU ultracomputer. designing an MIMD shared memory parallel computer. IEEE Transactions on Computers C-32(2), 175–189 (1983)
Article Google Scholar
Gupta, R., Hill, C.R.: A scalable implementation of barrier synchronization using an adaptive combining tree. Internat. Journal of Parallel Programming 18(3), 161–180 (1989)
Article Google Scholar
Gupta, R.: The fuzzy barrier: a mechanism for high speed synchronization of processors. SIGARCH Comput. Archit. News 17(2), 54–63 (1989)
Article Google Scholar
Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: A survey of barrier algorithms for coarse grained supercomputers chemnitzer informatik berichte (2004)
Google Scholar
Huang, W., Stant, M.R., Sankaranarayanan, K., Ribando, R.J., Skadron, K.: Many-core design from a thermal perspective. In: Proceed. of the 45th Annual Design Automation Conference, DAC 2008, pp. 746–749. ACM, New York (2008)
Chapter Google Scholar
McMahon, F.H.: The Livermore Fortran kernels: A computer test of the numerical performance range (1986)
Google Scholar
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Article Google Scholar
Nanjegowda, R., Hernandez, O., Chapman, B., Jin, H.H.: Scalability evaluation of barrier algorithms for openMP. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 42–52. Springer, Heidelberg (2009)
Chapter Google Scholar
Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC 2003, p. 55 (2003)
Google Scholar
Pfister, G.F., Norton, V.A.: Hot-spot contention and combining in multistage interconnection networks. IEEE Transactions on Computers C-34(10), 943–948 (1985)
Article Google Scholar
Sampson, J., Gonzalez, R., Collard, J., Jouppi, N.P., Schlansker, M., Calder, B.: Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In: Proc. of the 39th Annual IEEE/ACM Int. Symp. on Microarchitecture, MICRO 39, pp. 235–246 (2006)
Google Scholar
Sartori, J., Kumar, R.: Low-overhead, high-speed multi-core barrier synchronization. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 18–34. Springer, Heidelberg (2010)
Chapter Google Scholar
Scott, M.L., Mellor-Crummey, J.M.: Fast, contention-free combining tree barriers for shared-memory multiprocessors. Int. Journal of Parallel Prog. 22(4), 449–481 (1994)
Article Google Scholar
Scott, S.L.: Synchronization and communication in the T3E multiprocessor. SIGPLAN Not. 31(9), 26–36 (1996)
Article Google Scholar
Villa, O., Palermo, G., Silvano, C.: Efficiency and scalability of barrier synchronization on NoC based many-core architectures. In: Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 81–90 (2008)
Google Scholar
Yew, P., Tzeng, N., Lawrie, D.H.: Distributing hot-spot addressing in large-scale multiprocessors. IEEE Transactions on Computers C-36(4), 388–395 (1987)
Google Scholar
Zhang, G., Martínez, F., Tal, A., Blainey, B.: Busy-wait barrier synchronization using distributed counters with local sensor. In: Proc. of the WOMPAT, pp. 84–98 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Barcelona Supercomputing Center, Spain
Diego Caballero & Xavier Martorell
Universitat Politecnica de Catalunya, Spain
Diego Caballero & Xavier Martorell
Intel Corporation, USA
Alejandro Duran

Authors

Diego Caballero
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Duran
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Martorell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research School of Computer Science, The Australian National University, Australia
Alistair P. Rendell
Dept. of Computer Science, University of Houston Oak Ridge National Laboratory, USA
Barbara M. Chapman
Lehrstuhl für Hochleistungsrechnen und Rechen- und Kommunikationszentrum, RWTH Aachen University, Seffenter Weg 23, 52074, Aachen, Germany
Matthias S. Müller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Caballero, D., Duran, A., Martorell, X. (2013). An OpenMP* Barrier Using SIMD Instructions for Intel^® Xeon Phi^TM Coprocessor. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds) OpenMP in the Era of Low Power Devices and Accelerators. IWOMP 2013. Lecture Notes in Computer Science, vol 8122. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40698-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-40698-0_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40697-3
Online ISBN: 978-3-642-40698-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics