Abstract
Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific computing. One possible solution for improving efficiency and scalability in applications on this class of machines is the use of a many-tasking runtime system employing many lightweight, concurrent threads. Yet a priori estimation of the potential performance and scalability impact of such runtime systems on existing applications developed around the bulk synchronous parallel (BSP) model is not well understood. In this work, we present a case study of a BSP particle-in-cell benchmark code which has been ported to a many-tasking runtime system. The 3-D Gyrokinetic Toroidal code (GTC) is examined in its original MPI form and compared with a port to the High Performance ParalleX 3 (HPX-3) runtime system. Phase overlap, oversubscription behavior, and work rebalancing in the implementation are explored. Results for GTC using the SST/macro simulator complement the implementation results. Finally, an analytic performance model for GTC is presented in order to guide future implementation efforts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
http://cilkplus.org/ (2012)
Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.-W., Ryu, S., Steele Jr., G.L., Tobin-Hochstadt, S.: The Fortress language specification, version 1.0 (March 2008)
Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance analysis of mpi collective operations. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDP 2005) - Workshop 15 (2005)
Antypas, K., Shalf, J., Wasserman, H.: Nersc-6 workload analysis and benchmark selection process. Technical Report LBNL 1014E, National Energy Research Scientific Computing Center Division Ernest Orlando Lawrence Berkeley National Laboratory (August 2008)
Appeltaue, M., Hirschfeld, R., Haupt, M., Lincke, J., Perscheid, M.: A comparison of context-oriented programming languages. In: International Workshop on Context-Oriented Programming, COP 2009, pp. 6:1–6:6. ACM, New York (2009)
Cappello, F., Etiemble, D.: Mpi versus mpi+openmp on the ibm sp for the nas benchmarks. In: ACM/IEEE 2000 Conference on Supercomputing, p. 12 (2000)
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. SIGPLAN Not. 40, 519–538 (2005)
Dekate, C., Anderson, M., Brodowicz, M., Kaiser, H., Adelstein-Lelbach, B., Sterling, T.: Improving the scalability of parallel N-body applications with an event-driven constraint-based execution model. International Journal of High Performance Computing Applications 26(3), 319–332 (2012)
Dinan, J., Balaji, E., Lusk, E., Sadayappan, P., Thakur, R.: Hybrid parallel programming with mpi and unified parallel c. In: Proceedings of the 7th ACM International Conference on Computing Frontiers, CF 2010, pp. 177–186. ACM, New York (2010)
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: Proceedings of the 2009 International Conference on Parallel Processing, ICPP 2009, pp. 124–131. IEEE Computer Society, Washington, DC (2009)
El-Ghazawi, T., Cantonnet, F., Yao, Y.: Evaluations of UPC on the Cray X1. In: CUG 2005 Proceedings, New York, NY, USA, p. 10 (2005)
Ethier, S., Tang, W.M., Lin, Z.: Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms. Journal of Physics: Conference Series 16(1), 1 (2005)
Gao, G. Sterling, T., Stevens, R. Hereld, M., Zhu, W.: Parallex: A study of a new parallel computation model. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 1–6 (2007)
Gautier, T., Lima, J.V.F., Maillard, N., Raffin, B.: Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. In: Proc. of the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2013)
Gilmanov, T., Anderson, M., Brodowicz, M., Sterling, T.: Application characteristics of many-tasking execution models. In: Proc. of the 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (2013)
Hendry, G.: Decreasing Network Power with On-Off Links Informed by Scientific Applications. In: The Ninth Workshop on High-Performance, Power Aware Computing (May 2013)
Hendry, G., Rodrigues, A.: Simulator for exascale co-design, http://sst.sandia.gov/publications.html
Hendry, G., Rodrigues, A.: Sst: A simulator for exascale co-design. In: Proc. of the ASCR/ASC Exascale Research Conference (2012)
Hewitt, C., Baker, H.G.: Actors and continuous functionals. Technical report, Cambridge, MA, USA (1978)
Hockney, R.W.: The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Comput. 20(3), 389–398 (1994)
Hoefler, T., Gropp, W., Snir, M., Kramer, W.: Performance Modeling for Systematic Performance Tuning. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2011), SotP Session (November 2011)
Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim - simulating large-scale applications in the LogGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597–604. ACM (June 2010)
HPC University and the Ohio Supercomputer Center. Report on high performance computing training and education survey, http://www.teragridforum.org/mediawiki/images/5/5d/HPCSurveyResults.FINAL.pdf
Iancu, C., Hofmeyr, S., Blagojevic, F., Zheng, Y.: Oversubscription on multicore processors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–11 (April 2010)
Kaiser, H., Brodowicz, M., Sterling, T.: ParalleX an advanced parallel execution model for scaling-impaired applications. In: International Conference on Parallel Processing Workshops, ICPPW 2009, pp. 394–401 (September 2009)
Kale, L.V., Krishnan, S.: Charm++: Parallel Programming with Message-Driven Objects. In: Wilson, G.V., Lu, P. (eds.) Parallel Programming Using C++, pp. 175–213. MIT Press (1996)
Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B.L., Cohen, J., DeVito, Z., Haque, R., Laney, D., Luke, E., Wang, F., Richards, D. Schulz, M., Still, C.H.: Exploring traditional and emerging parallel programming models using a proxy application. In: Proc. of the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2013)
Koniges, A., Preissl, R., Kim, J., Eder, D., Fisher, A., Masters, N., Mlaker, V., Ethier, S., Wang, W., Head-Gordon, M., Wichmann, N.: Application Acceleration on Current and Future Cray Platforms. In: CUG 2010, the Cray User Group Meeting (May 2010)
Madduri, K., Ibrahim, K.Z., Williams, S., Im, E.-J., Ethier, S., Shalf, J., Oliker, L.: Gyrokinetic toroidal simulations on leading multi- and manycore hpc systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 23:1–23:12. ACM, New York (2011)
Mathis, M.M., Kerbyson, D.J., Hoisie, A.: A performance model of non-deterministic particle transport on large-scale systems. Future Gener. Comput. Syst. 22(3), 324–335 (2006)
McCool, M.D., Robison, A.D., Reinders, J.: Structured parallel programming patterns for efficient computation (2012)
Olivier, S., Prins, J.F.: Comparison of OpenMP 3.0 and other task parallel frameworks on unbalanced task graphs. International Journal of Parallel Programming 38(5–6), 341–360 (2010)
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism, 1st edn. O’Reilly Media (July 2007)
Robert, J., Halstead, H.: Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7(4), 501–538 (1985)
Stitt, T., Robinson, T.: A survey on training and education needs for petascale computing, http://www.prace-project.eu/IMG/pdf/D3-3-1_document_final.pdf
Tskhakaya, D.: The particle-in-cell method. In: Fehske, H., Schneider, R., Weie, A. (eds.) Computational Many-Particle Physics. Lecture Notes in Physics, vol. 739, pp. 161–189. Springer, Heidelberg (2008)
Wheeler, K., Murphy, R., Thain, D.: Qthreads: An API for Programming with Millions of Lightweight Threads. In: International Parallel and Distributed Processing Symposium. IEEE Press (2008)
Wu, X., Taylor, V.: Performance modeling of hybrid mpi/openmp scientific applications on large-scale multicore cluster systems. In: 2011 IEEE 14th International Conference on Computational Science and Engineering (CSE), pp. 181–190 (2011)
Yang, C., Murthy, K., Mellor-Crummey, J.: Managing asynchronous operations in coarray fortran 2.0. In: Proc. of the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Anderson, M., Brodowicz, M., Kulkarni, A., Sterling, T. (2014). Performance Modeling of Gyrokinetic Toroidal Simulations for a Many-Tasking Runtime System. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. PMBS 2013. Lecture Notes in Computer Science(), vol 8551. Springer, Cham. https://doi.org/10.1007/978-3-319-10214-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-10214-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10213-9
Online ISBN: 978-3-319-10214-6
eBook Packages: Computer ScienceComputer Science (R0)