Skip to main content

Hardware Support for OpenMP Collective Operations

  • Conference paper
Book cover Languages and Compilers for Parallel Computing (LCPC 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5898))

Abstract

Efficient implementation of OpenMP collective operations (e.g. barriers and reductions) is essential for good performance from OpenMP programs. State-of-the-art on-chip networks and block-based cache coherence protocols used in shared memory Chip MultiProcessors (CMPs) are inefficient for implementing these collective operations. The performance of CMPs can be seriously degraded by the multitude of memory requests and coherence messages required to implement collective operations. To provide efficient support for OpenMP collective operations, this paper presents a CMP-AFN architecture and Instruction Set Architecture (ISA) extensions that augment a conventional shared-memory CMP with a tightly-integrated Aggregate Function Network (AFN) that implements low-latency collectives without using or interfering with the memory hierarchy. For a modest increase in circuit complexity, traffic within a CMP’s internal network is dramatically reduced, improving the performance of caches and reducing power consumption. Full system simulations of 16-core CMPs show a CMP-AFN outperforms the reference design significantly, eliminating more than 60% of memory accesses and more than 70% of private L1 data cache misses in both the EPCC OpenMP microbenchmarks and SPEC OMP benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)

    Google Scholar 

  2. Aslot, V., Domeika, M.J., Eigenmann, R., Gaertner, G., Jones, W.B., Parady, B.: SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In: Eigenmann, R., Voss, M.J. (eds.) WOMPAT 2001. LNCS, vol. 2104, pp. 1–10. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  3. Aslot, V., Eigenmann, R.: Performance Characteristics of the SPEC OMP 2001 Benchmarks. SIGARCH Comput. Archit. News 29(5), 31–40 (2001)

    Article  Google Scholar 

  4. Mark Bull, J., O’Neill, D.: A Microbenchmark Suite for OpenMP 2.0. SIGARCH Comput. Archit. News 29(5), 41–48 (2001)

    Article  Google Scholar 

  5. Dietz, H.G., Chung, T.M., Mattox, T.I.: A Parallel Processing Support Library Based on Synchronized Aggregate Communication. In: Huang, C.-H., Sadayappan, P., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D.A. (eds.) LCPC 1995. LNCS, vol. 1033, pp. 254–268. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  6. Dietz, H.G., Mattox, T.I., Krishnamurthy, G.: The Aggregate Function API: It’s Not Just for PAPERS Anymore. In: Huang, C.-H., Sadayappan, P., Sehr, D. (eds.) LCPC 1997. LNCS, vol. 1366, pp. 277–291. Springer, Heidelberg (1997)

    Chapter  Google Scholar 

  7. Franke, H., Russell, R., Kirkwood, M.: Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux. In: Proceedings of the 2002 Ottawa Linux Summit (2002)

    Google Scholar 

  8. Fürlinger, K., Gerndt, M., Dongarra, J.: Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4488, pp. 815–822. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  9. Gajski, D., Kuck, D., Lawrie, D., Sameh, A.: CEDAR: A Large Scale Multiprocessor. SIGARCH Comput. Archit. News 11(1), 7–11 (1983)

    Article  Google Scholar 

  10. Gara, A., Blumrich, M.A., Chen, D., Chiu, G.L.-T., Coteus, P., Giampapa, M., Haring, R.A., Heidelberger, P., Hoenicke, D., Kopcsay, G.V., Liebsch, T.A., Ohmacht, M., Steinmacher-Burow, B.D., Takken, T., Vranas, P.: Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development 49(2-3), 195–212 (2005)

    Article  Google Scholar 

  11. GOMP (GNU OpenMP) – An OpenMP Implementation for GCC., http://gcc.gnu.org/projects/gomp

  12. Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The NYU Ultracomputer Designing a MIMD Shared Memory Parallel Computer. IEEE Trans. on Computers C-32, 175–189 (1983)

    Article  Google Scholar 

  13. Gupta, R., Tipparaju, V., Nieplocha, J., Panda, D.: Efficient barrier using remote memory operations on via-based clusters. In: IEEE International Conference on Cluster Computing, vol. 0, p. 83 (2002)

    Google Scholar 

  14. Hoare, R., Dietz, H.G., Mattox, T.I., Kim, S.P.: Bitwise Aggregate Networks. In: SPDP 1996: Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP 1996), Washington, DC, USA, p. 306. IEEE Computer Society, Los Alamitos (1996)

    Chapter  Google Scholar 

  15. Hoefler, T.: Evaluation of publicly available Barrier-Algorithms and Improvement of the Barrier-Operation for large-scale Cluster-Systems with special Attention on InfiniBand Networks. Master Thesis, Technical University of Chemnitz, Germany (April 2005)

    Google Scholar 

  16. Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture. Santa Clara, CA, order number: 253665-023us edition (May 2007)

    Google Scholar 

  17. Jerger, N.E., Peh, L.-S., Lipasti, M.: Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support. In: ISCA 2008: Proceedings of the 35th International Symposium on Computer Architecture, Washington, DC, USA, pp. 229–240. IEEE Computer Society, Los Alamitos (2008)

    Google Scholar 

  18. Kessler, R.E., Schwarzmeier, J.L.: Cray T3D: A New Dimension for Cray Research. In: The 38th IEEE Computer Society International Conference (COMPCON), pp. 176–182 (1993)

    Google Scholar 

  19. Kim, S.P.: Chip Multiprocessors with On-Chip Aggregate Function Network. PhD thesis, Purdue University, West Lafayette, Indiana 47906 USA (August 2009)

    Google Scholar 

  20. Leiserson, C.E., Abuhamdeh, Z.S., Douglas, D.C., Feynman, C.R., Ganmukhi, M.N., Hill, J.V., Hillis, D., Kuszmaul, B.C., St. Pierre, M.A., Wells, D.S., Wong, M.C., Yang, S.-W., Zak, R.: The Network Architecture of the Connection Machine CM-5 (extended abstract). In: SPAA 1992: Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, pp. 272–285. ACM Press, New York (1992)

    Chapter  Google Scholar 

  21. Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A Full System Simulation Platform. Computer 35(2), 50–58 (2002)

    Article  Google Scholar 

  22. Martin, M.M.K., Sorin, D.J., Beckmann, B.M., Marty, M.R., Xu, M., Alameldeen, A.R., Moore, K.E., Hill, M.D., Wood, D.A.: Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset. SIGARCH Computer Architecture News 33(4), 92–99 (2005)

    Article  Google Scholar 

  23. Novillo, D.: OpenMP and automatic parallelization in GCC. GCC Developers’ Summit, Ottawa, Ontario CANADA (June 2006)

    Google Scholar 

  24. Pfister, G.F., Brantley, W.C., George, D.A., Harvey, S.L., Kleinfelder, W.J., McAuliffe, K.P., Melton, E.S., Norton, V.A., Weiss, J.: The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In: ICPP, pp. 764–771 (1985)

    Google Scholar 

  25. Saito, H., Gaertner, G., Jones, W., Eigenmann, R., Iwashita, H., Lieberman, R., van Waveren, M., Whitney, B.: Large system performance of SPEC OMP benchmark suites. Int. J. Parallel Program. 31(3), 197–209 (2003)

    Article  MATH  Google Scholar 

  26. Sampson, J., Gonzalez, R., Collard, J.-F., Jouppi, N.P., Schlansker, M., Calder, B.: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. In: MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, pp. 235–246. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  27. Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S.W., Moore, C.R.: Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture. In: ISCA 2003: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 422–433. ACM, New York (2003)

    Google Scholar 

  28. Scott, S.L.: Synchronization and Communication in the T3E Multiprocessor. In: ASPLOS-IX: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26–36 (1996)

    Google Scholar 

  29. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Trans. Graph. 27(3), 1–15 (2008)

    Article  Google Scholar 

  30. Standard Performance Evaluation Corporation. SPEC OMP (OpenMP Benchmark Suite) (2001), http://www.spec.org/hpg/omp2001

  31. Swanson, S., Michelson, K., Schwerin, A., Oskin, M.: Wavescalar. In: MICRO 36: Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, pp. 291–302. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  32. Taylor, M.B., Lee, W., Amarasinghe, S., Agarwal, A.: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures. In: HPCA 2003: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, Washington, DC, USA, pp. 341–353. IEEE Computer Society, Los Alamitos (2003)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, S.P., Midkiff, S.P., Dietz, H.G. (2010). Hardware Support for OpenMP Collective Operations. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds) Languages and Compilers for Parallel Computing. LCPC 2009. Lecture Notes in Computer Science, vol 5898. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13374-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13374-9_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13373-2

  • Online ISBN: 978-3-642-13374-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics