Skip to main content

Asynchronous Nested Parallelism for Dynamic Applications in Distributed Memory

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9519))

Abstract

Nested parallelism is of increasing interest for both expressivity and performance. Many problems are naturally expressed with this divide-and-conquer software design approach. In addition, programmers with target architecture knowledge employ nested parallelism for performance, imposing a hierarchy in the application to increase locality and resource utilization, often at the cost of implementation complexity.

While dynamic applications are a natural fit for the approach, support for nested parallelism in distributed systems is generally limited to well-structured applications engineered with distinct phases of intra-node computation and inter-node communication. This model makes expressing irregular applications difficult and also hurts performance by introducing unnecessary latency and synchronizations. In this paper we describe an approach to asynchronous nested parallelism which provides uniform treatment of nested computation across distributed memory. This approach allows efficient execution while supporting dynamic applications which cannot be mapped onto the machine in the rigid manner of regular applications. We use several graph algorithms as examples to demonstrate our library’s expressivity, flexibility, and performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. The grapph 500 list. (2011). http://www.graph500.org

  2. Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci. Program. 20(2), 115–128 (2012)

    Google Scholar 

  3. Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11, November 2012

    Google Scholar 

  4. Blelloch, G.: NESL: A Nested Data-Parallel Language. Technical report CMU-CS-93-129, Carnegie Mellon University (1993)

    Google Scholar 

  5. Blumofe, R.D., et al.: Cilk: An efficient multithreaded runtime system. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programme (PPoPP), vol. 30, pp. 207–216. ACM, New York, July 1995

    Google Scholar 

  6. Buluç, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 65:1–65:12. ACM, New York (2011)

    Google Scholar 

  7. Buss, A., et al.: The STAPL pView. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 261–275. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  8. Buss, A., Harshvardhan, Papadopoulos, I., Pearce, O., Smith, T., Tanase, G., Thomas, N., Xu, X., Bianco, M., Amato, N.M., Rauchwerger, L.: STAPL: Standard template adaptive parallel library. In: Proceedings of Annual Haifa Experimental Systems Conference (SYSTOR), pp. 1–10. ACM, New York (2010)

    Google Scholar 

  9. Callahan, D., Chamberlain, B.L., Zima, H.P.: The cascade high productivity language. In: The Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, vol. 26, pp. 52–60, Los Alamitos (2004)

    Google Scholar 

  10. Cappello, F., Etiemble, D.: MPI versus MPI+OpenMp on IBM SP for the NAS benchmarks. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2000. IEEE Computer Society, Washington, DC (2000)

    Google Scholar 

  11. Cavé, V., Zhao, J., Shirako, J., Sarkar, V.: Habanero-Java: The new adventures of old X10. In: Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ 2011, pp. 51–61. ACM, New York (2011)

    Google Scholar 

  12. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 519–538. ACM Press, New York (2005)

    Google Scholar 

  13. Chitnis, L., et al.: Finding connected components in map-reduce in logarithmic rounds. In: Proceedings of the 2013 IEEE International Conference on Data Engineering, ICDE 2013, pp. 50–61. IEEE Computer Society, Washington, DC (2013)

    Google Scholar 

  14. Consortium, U.: UPC Language Specifications V1.2, (2005). http://www.gwu.edu/~upc/publications/LBNL-59208.pdf

  15. Duran, A., Silvera, R., Corbalán, J., Labarta, J.: Runtime adjustment of parallel nested loops. In: Chapman, B.M. (ed.) WOMPAT 2004. LNCS, vol. 3349, pp. 137–147. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  16. Fatahalian, K., et al.: Sequoia: programming the memory hierarchy. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2006. ACM, New York (2006)

    Google Scholar 

  17. Gonzalez, J.E., et al.: Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI 2012, pp. 17–30. USENIX Association, Berkeley (2012)

    Google Scholar 

  18. Harshvardhan, A.F., Amato, N.M., Rauchwerger, L.: The STAPL parallel graph library. In: Kasahara, H., Kimura, K. (eds.) LCPC 2012. LNCS, vol. 7760, pp. 46–60. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  19. Hartley, T.D.R., et al.: Improving performance of adaptive component-based dataflow middleware. Parallel Comput. 38(6–7), 289–309 (2012)

    Article  Google Scholar 

  20. Heller, T., et al.: Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale System, ScalA 2013, pp. 1:1–1:8. ACM, New York (2013)

    Google Scholar 

  21. Steele Jr., G.L., et al.: Fortress (Sun HPCS Language). In: Padua, D.A. (ed.) Encyclopedia of Parallel Computing, pp. 718–735. Springer, Heidelberg (2011)

    Google Scholar 

  22. Kamil, A., Yelick, K.: Hierarchical computation in the SPMD programming model. In: Caṣcaval, C., Montesinos-Ortego, P. (eds.) LCPC 2013 - Testing. LNCS, vol. 8664, pp. 3–19. Springer, Heidelberg (2014)

    Google Scholar 

  23. Keßler, C.W.: NestStep: nested parallelism and virtual shared memory for the BSP model. J. Supercomput. 17(3), 245–262 (2000)

    Article  MATH  Google Scholar 

  24. Mellor-Crummey, J., et al.: A new vision for coarray Fortran. In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, PGAS 2009, pp. 5:1–5:9. ACM, New York (2009)

    Google Scholar 

  25. MPI forum. MPI: A Message-Passing Interface Standard Version 3.1 (2015). http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

  26. Musser, D., Derge, G., Saini, A.: STL Tutorial and Reference Guide, 2nd edn. Addison-Wesley, Boston (2001)

    Google Scholar 

  27. OpenMP Architecture Review Board. OpenMP Application Program Interface Specification (2011)

    Google Scholar 

  28. Page, L., et al.: The pagerank citation ranking: bringing order to the web (1998)

    Google Scholar 

  29. Papadopoulos, I., et al.: STAPL-RTS: An application driven runtime system. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, Newport Beach/Irvine, CA, USA, pp. 425–434, June 2015

    Google Scholar 

  30. Pearce, R., Gokhale, M., Amato, N.M.: Scaling techniques for massive scale-free graphs in distributed (external) memory. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS 2013, pp. 825–836. IEEE Computer Society, Washington (2013)

    Google Scholar 

  31. Pearce, R., Gokhale, M., Amato, N.M.: Faster parallel traversal of scale free graphs at extreme scale with vertex delegates. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 549–559. IEEE Press, Piscataway (2014)

    Google Scholar 

  32. Reinders, J.: Intel Threading Building Blocks. O’Reilly & Associates Inc., Sebastopol (2007)

    Google Scholar 

  33. Sillero, J., Borrell, G., Jiménez, J., Moser, R.D.: Hybrid OpenMP-MPI turbulent boundary layer code over 32k cores. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 218–227. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  34. Tanase, G., Buss, A., Fidel, A., Harshvardhan, Papadopoulos, I., Pearce, O., Smith, T., Thomas, N., Xu, X., Mourad, N., Vu, J., Bianco, M., Amato, N.M., Rauchwerger, L.: The STAPL parallel container framework. In: Proceedings of ACM SIGPLAN Symposium Principles and Practice Parallel Programming (PPoPP), San Antonio, pp. 235–246 (2011)

    Google Scholar 

  35. Thomas, N., et al.: ARMI: a high level communication library for STAPL. Parallel Process. Lett. 16(2), 261–280 (2006)

    Article  MathSciNet  Google Scholar 

  36. Zandifar, M., Abdul Jabbar, M., Majidi, A., Keyes, D., Amato, N.M., Rauchwerger, L.: Composing algorithmic skeletons to express high-performance scientific applications. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015, pp. 415–424. ACM, New York (2015)

    Google Scholar 

  37. Zhao, J., et al.: Isolation for nested task parallelism. In: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, pp. 571–588. ACM, New York (2013)

    Google Scholar 

  38. Zheng, Y., et al.: UPC++: A PGAS extension for C++. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1105–1114, May 2014

    Google Scholar 

Download references

Acknowledgments

This research is supported in part by NSF awards CNS-0551685, CCF-0702765, CCF-0833199, CCF-1439145, CCF-1423111, CCF-0830753, IIS-0916053, IIS-0917266, EFRI–1240483, RI-1217991, by NIH NCI R25 CA090301-11, by DOE awards DE-AC02-06CH11357, DE-NA0002376, B575363, by Samsung, IBM, Intel, and by Award KUS-C1-016-04, made by King Abdullah University of Science and Technology (KAUST). This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioannis Papadopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Papadopoulos, I., Thomas, N., Fidel, A., Hoxha, D., Amato, N.M., Rauchwerger, L. (2016). Asynchronous Nested Parallelism for Dynamic Applications in Distributed Memory. In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29778-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29777-4

  • Online ISBN: 978-3-319-29778-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics