Abstract
Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, and provide a foundation for high-productivity parallel programming using lightweight one-sided communications. The OpenSHMEM programming interface has recently begun gaining popularity as a lightweight library-based approach for developing PGAS applications, in part through its use of a symmetric heap to realize more efficient implementations of global pointers than in other PGAS systems. However, current approaches to hybrid inter-node and intra-node parallel programming in OpenSHMEM rely on the use of multithreaded programming models (e.g., pthreads, OpenMP) that harness intra-node parallelism but are opaque to the OpenSHMEM runtime. This OpenSHMEM+X approach can encounter performance challenges such as bottlenecks on shared resources, long pause times due to load imbalances, and poor data locality. Furthermore, OpenSHMEM+X requires the expertise of hero-level programmers, compared to the use of just OpenSHMEM. All of these are hard challenges to mitigate with incremental changes. This situation will worsen as computing nodes increase their use of accelerators and heterogeneous memories.
In this paper, we introduce the AsyncSHMEM PGAS library which supports a tighter integration of shared and distributed memory parallelism than past OpenSHMEM implementations. AsyncSHMEM integrates the existing OpenSHMEM reference implementation with a thread-pool-based, intra-node, work-stealing runtime. It aims to prepare OpenSHMEM for future generations of HPC systems by enabling the use of asynchronous computation to hide data transfer latencies, supporting tight interoperability of OpenSHMEM with task parallel programming, improving load balance (both of communication and computation), and enhancing locality. In this paper we present the design of AsyncSHMEM, and demonstrate the performance of our initial AsyncSHMEM implementation by performing a scalability analysis of two benchmarks on the Titan supercomputer. These early results are promising, and demonstrate that AsyncSHMEM is more programmable than the OpenSHMEM+OpenMP model, while delivering comparable performance for a regular benchmark (ISx) and superior performance for an irregular benchmark (UTS).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
OpenSHMEM context extension proposal draft. https://github.com/jdinan/openshmem-contexts
OpenSHMEM Redmine Issue #218 - Thread Safety Proposal. http://www.openshmem.org/redmine/issues/218
Thread-safe SHMEM Extensions. http://www.csm.ornl.gov/workshops/openshmem2014/documents/Thred-safeSHMEM_Extensions.pdf
Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: SC, pp. 41:1–41:12. ACM (2013)
Cavé, V., Zhao, J., Shirako, J., Sarkar, V.: Habanero-Java: the new adventures of old X10. In: PPPJ 2011: Proceedings of the 9th International Conference on the Principles and Practice of Programming in Java (2011)
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)
Chapman, B., Curtis, T., Pophale, S., Poole, S., Kuehn, J., Koelbel, C., Smith, L.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, p. 2. ACM (2010)
Chatterjee, S.: Integrating asynchronous task parallelism with MPI. In: IPDPS 2013: Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing. IEEE Computer Society (2013)
Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Ebcioglu, K., Saraswat, V., Sarkar, V.: X10: an experimental language for high productivity programming of scalable systems. In: Proceedings of the Second Workshop on Productivity and Performance in High-End Computing, pp. 45–52. Citeseer (2005)
El-Ghazawi, T., Smith, L.: UPC: unified parallel C. In: SC (2006)
Frigo, M.: Multithreaded programming in Cilk. In: PASCO 2007, pp. 13–14 (2007)
Grossman, M., Shirako, J., Sarkar, V.: OpenMP as a high-level specification language for parallelism. In: IWOMP 2016 (2016)
Hanebutte, U., Hemstad, J.: ISx: a scalable integer sort for co-design in the exascale era. In: 2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), pp. 102–104, September 2015
Kessler, R.E., Schwarzmeier, J.L.: Cray T3D: a new dimension for Cray research. In: COMPCON Spring 1993, Digest of Papers, pp. 176–182. IEEE (1993)
Kowalke, O.: Boost C++ Libraries. https://olk.github.io/libs/fiber/doc/html/
Kumar, V., Zheng, Y., Cavé, V., Budimlić, Z., Sarkar, V.: HabaneroUPC++: a compiler-free PGAS library. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014, pp. 5:1–5:10. ACM, New York (2014). http://doi.acm.org/10.1145/2676870.2676879
Numrich, R.W., Reid, J.: Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)
Olivier, S., Huan, J., Liu, J., Prins, J., Dinan, J., Sadayappan, P., Tseng, C.-W.: UTS: an unbalanced tree search benchmark. In: Almási, G., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 235–250. Springer, Heidelberg (2007). doi:10.1007/978-3-540-72521-3_18
PGAS: Partitioned Global Address Space (2011). http://www.pgas.org/
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc., Sebastopol (2010)
Habanero-C Overview. Rice University (2013) https://wiki.rice.edu/confluence/display/HABANERO/Habanero-C
Snir, M., Otto, S.W., Walker, D.W., Dongarra, J., Huss-Lederman, S.: MPI: The Complete Reference. MIT Press, Cambridge (1995)
Yan, Y., Zhao, J., Guo, Y., Sarkar, V.: Hierarchical place trees: a portable abstraction for task parallelism and data movement. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 172–187. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13374-9_12
Yelick, K. et al.: Productivity and performance using partitioned global address space languages. In: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, PASCO 2007, pp. 24–32. ACM (2007)
Yelick, K., Semenzato, L., Pike, G., Miyamoto, C., Liblit, B., Krishnamurthy, A., Hilfinger, P., Graham, S., Gay, D., Colella, P., Aiken, A.: Titanium: a high-performance Java dialect. In: ACM, pp. 10–11 (1998)
Zheng, Y., Kamil, A., Driscoll, M.B., Shan, H., Yelick, K.: UPC++: a PGAS extension for C++. In: 2014 IEEE 28th International Conference on Parallel and Distributed Processing Symposium, pp. 1105–1114. IEEE (2014)
Acknowledgments
This research was funded in part by the United States Department of Defense, and was supported by resources at Los Alamos National Laboratory.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Grossman, M., Kumar, V., Budimlić, Z., Sarkar, V. (2016). Integrating Asynchronous Task Parallelism with OpenSHMEM. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T. (eds) OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments. OpenSHMEM 2016. Lecture Notes in Computer Science(), vol 10007. Springer, Cham. https://doi.org/10.1007/978-3-319-50995-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-50995-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50994-5
Online ISBN: 978-3-319-50995-2
eBook Packages: Computer ScienceComputer Science (R0)