Skip to main content
Log in

Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of computational science teams using these new systems. In this regard, Global Address Space (GAS) programming models provide a straightforward and easy to use addressing model, which can lead to improved productivity. However, the scalability of GAS depends directly on the design and implementation of the runtime system on the target petascale distributed-memory architecture. In this paper, we describe the design, implementation, and optimization of the Aggregate Remote Memory Copy Interface (ARMCI) runtime library on the Cray XT5 2.3 PetaFLOPs computer at Oak Ridge National Laboratory. We optimized our implementation with the flow intimation technique that we have introduced in this paper. Our optimized ARMCI implementation improves scalability of both the Global Arrays programming model and a real-world chemistry application—NWChem—from small jobs up through 180,000 cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Apra, E., Harrison, R.J., de Jong, W., Rendell, A., Tipparaju, V., Xantheas, S., Olsen, R.: Liquid water: obtaining the right answer for the right reasons. In: Supercomputing, 2009. SC’09. Proceedings of the ACM/IEEE SC 2009 Conference (2009)

  2. Barrett, B.W., Shipman, G.M., Lumsdaine, A.: Analysis of implementation options for MPI-2 one-sided. In: Proceedings, Euro PVM/MPI, Paris, France (2007)

  3. Bell, C., Bonachea, D.: A new DMA registration strategy for pinning-based high performance networks. In: Proceedings of the 17th International Symposium on Parallel and Distributed Processing, IPDPS’03, IEEE Computer Society, Washington, DC, USA, p. 198.1. http://dl.acm.org/citation.cfm?id=838237.838681 (2003)

  4. Bonachea, D.: GASNet specification, v1.1. Tech. rep., Berkeley, CA, USA (2002)

  5. Brightwell, R., Riesen, R., Lawry, B., Maccabe, A.: Portals 3.0: protocol building blocks for low overhead communication. In: Parallel and Distributed Processing Symposium, Proceedings International, IPDPS 2002, Abstracts and CD-ROM, pp. 164–173 (2002)

  6. Bylaska, E., et al.: NWChem, a computational chemistry package for parallel computers, version 5.1 (2007)

  7. Blocksome, M., Archer, C., Inglett, T., McCarthy, P., Mundy, M., Ratterman, J., Sidelnik, A., Smith, B., Almási, G., Castaños, J., Lieber, D., Moreira, J., Krishnamoorthy, S., Tipparaju, V., Nieplocha, J.: Design and implementation of a one-sided communication interface for the IBM eServer Blue Gene supercomputer. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC’06. ACM, New York, NY, USA, http://doi.acm.org/10.1145/1188455.1188580 (2006)

  8. Chapel language specifications, v0.780. http://chapel.cs.washington.edu/spec-0.780.pdf (2006)

  9. Dotsenko, Y., Coarfa, C., Mellor-Crummey, J.: A multi-platform co-array Fortran compiler. In: Parallel Architecture and Compilation Techniques, 2004. PACT 2004. Proceedings. 13th International Conference on, pp. 29–40. doi:10.1109/PACT.2004.1342539 (2004)

  10. Dunning, T.H.: Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen. J. Chem. Phys. 90(2), 1007–1023 (1989). doi:10.1063/1.456153, http://link.aip.org/link/?JCP/90/1007/1

  11. Dunning, T.H., Peterson, K.A., Woon, D.E., Wilson, A.K.: Quantifying quantum chemistry. In: American Conference on Theoretical Chemistry, unpublished (1999)

  12. de Jong, W.A., Krishnamoorthy, S.: Private communication (2008)

  13. GA: Global Arrays Toolkit. http://www.emsl.pnl.gov/docs/global (2010)

  14. GASNet Portals Conduit Docs. http://gasnet.cs.berkeley.edu/dist/portals-conduit/README (2008)

  15. Kelly, S.M., Brightwell, R.: Software architecture of the light weight kernel, Catamount. In: In Cray user group, pp. 16–19 (2005)

  16. Kobayashi R., Rendell A.P.: A direct coupled cluster algorithm for massively parallel computers. Chem. Phys. Lett. 265(1–2), 1–11 (1997). doi:10.1016/S0009-2614(96)01387-5

    Article  Google Scholar 

  17. Koop, M., Sridhar, J., Panda, D.: Scalable MPI design over InfiniBand using eXtended reliable connection. In: Cluster Computing, 2008 IEEE International Conference on, pp. 203–212. doi:10.1109/CLUSTR.2008.4663773 (2008)

  18. Krishnan, M., Nieplocha, J., Blocksome, M., Smith, B.: Evaluation of remote memory access communication on the IBM Blue Gene/P Supercomputer. In: Parallel Processing—Workshops, 2008. ICPP-W ’08. International Conference on, pp. 109–115. doi:10.1109/ICPP-W.2008.34 (2008)

  19. Kumar, S., Dozsa, G., Almasi, G., Heidelberger, P., Chen, D., Giampapa, M.E., Michael, B., Faraj, A., Parker, J., Ratterman, J., Smith, B., Archer, C.J.: The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P Supercomputer. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS’08. ACM, New York, NY, USA, pp. 94–103. http://doi.acm.org/10.1145/1375527.1375544 (2008)

  20. Nieplocha, J., Ju, J., Apra, E.: One-sided Communication on the Myrinet-based SMP Clusters using the GM Message-Passing Library. In: In Proceedings of the Workshop on Communication Architecture for Clusters (CAC) held in conjunction with IPDPS 01 (2001)

  21. Nieplocha, J., Tipparaju, V., Saify, A., Panda, D.: Protocols and strategies for optimizing performance of remote memory operations on clusters. In: Parallel and Distributed Processing Symposium, Proceedings International, IPDPS 2002, Abstracts and CD-ROM, pp. 164–173 (2002a)

  22. Nieplocha, J., Tipparaju, V., Saify, A., Panda, D.: Protocols and strategies for optimizing performance of remote memory operations on clusters. In: In: Proceedings Workshop Communication Architecture for Clusters (CAC02) of IPDPS’02, Ft (2002b)

  23. Nieplocha, J., Tipparaju, V., Apra, E.: An evaluation of two implementation strategies for optimizing one-sided atomic reduction. In: IPDPS’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05)—Workshop 9, IEEE Computer Society, Washington, DC, USA, p. 215.2. doi:10.1109/IPDPS.2005.96 (2005a)

  24. Nieplocha, J., Tipparaju, V., Krishnan, M.: Optimizing strided remote memory access operations on the quadrics QsNetII network interconnect. In: HPCASIA’05: Proceedings of the 8th International Conference on High-Performance Computing in Asia-Pacific Region, IEEE Computer Society, Washington, DC, USA, p. 28. doi:10.1109/HPCASIA.2005.62 (2005b)

  25. Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., Apra, E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20(2), 203–231 (2006a) http://hpc.sagepub.com/cgi/content/abstract/20/2/203

  26. Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.K.: High performance remote memory access communication: the ARMCI approach. Int. J. High Perform. Comput. Appl. 20(2), 233–253 (2006b) http://hpc.sagepub.com/cgi/content/abstract/20/2/233

  27. Nishtala, R., Hargrove, P., Bonachea, D., Yelick, K.: Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. In: Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pp. 1–12. doi:10.1109/IPDPS.2009.5161076 (2009)

  28. Parzyszek, K.: Generalized portable shmem library for high performance computing. PhD thesis, Ames, IA, USA, co-Major Professor-Kendall, Ricky A. and Co-Major Professor-Lutz, Robyn R. (2003)

  29. Pollack, L., Windus, T.L., de Jong, W.A., Dixon, D.A.: Thermodynamic properties of the C5, C6, and C8 n-alkanes from ab initio electronic structure theory. J. Phys. Chem. A 109(31), 6934–6938 (2005). doi:10.1021/jp044564r, http://pubs.acs.org/doi/abs/10.1021/jp044564r, http://pubs.acs.org/doi/pdf/10.1021/jp044564r

    Google Scholar 

  30. Quadrics Supercomputer Company. http://en.wikipedia.org/wiki/Quadrics (1996)

  31. Report on experimental language X10. http://dist.codehaus.org/x10/documentation/languagespec/x10-170.pdf (2008)

  32. Shet, A., Tipparaju, V., Harrison, R.: Asynchronous programming in UPC: a case study and potential for improvement. In: Workshop on Asynchrony in the PGAS Programming Model Collocated with ICS 2009 (2009)

  33. Tipparaju, V., Santhanaraman, G., Nieplocha, J., Panda, D.K.: Host-assisted zero-copy remote memory access communication on InfiniBand. In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, p. 31. doi:10.1109/IPDPS.2004.1302943 (2004)

  34. Tipparaju, V., Kot, A., Nieplocha, J., Bruggencate, M., Chrisochoides, N.: Evaluation of remote memory access communication on the Cray XT3. In: Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pp. 1–7. doi:10.1109/IPDPS.2007.370478 (2007)

  35. Top500 list. http://www.top500.org (2010)

  36. UPC specifications, v1.2. http://www.gwu.edu/~upc/publications/LBNL-59208.pdf (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vinod Tipparaju.

Additional information

This paper was authored by at least one employee of UT-Battelle, LLC, under contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the paper for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this contribution, or allow others to do so, for United States Government purposes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tipparaju, V., Apra, E., Yu, W. et al. Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing. Int J Parallel Prog 40, 633–655 (2012). https://doi.org/10.1007/s10766-012-0214-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-012-0214-9

Keywords

Navigation