Abstract
Programmers are faced with many challenges for obtaining performance on machines with increasingly capable, yet increasingly complex hardware. A trend towards task-parallel and asynchronous many-task programming models aim to alleviate the burden of parallel programming on a vast array of current and future platforms. One such model, Concurrent Collections (CnC), provides a programming paradigm that emphasizes the separation of the concerns–domain experts concentrate on their algorithms and correctness, whereas performance experts handle mapping and tuning to a target platform. Deep understanding of parallel constructs and behavior is not necessary to write parallel applications that will run on various multi-threaded and multi-core platforms when using the CnC model. However, performance can vary greatly depending on the granularity of tasks and data declared by the programmer. These program-specific decisions are not part of the CnC tuning capabilities and must be tuned in the program. We analyze the performance behavior based on tuning various elements in each collection for the LULESH application using CnC. We demonstrate the effects of different techniques to modify task and data granularity in CnC collections. Our fully tiled CnC implementation outperforms the OpenMP counterpart by 3\(\times \) for 48 processors. Finally, we propose guidelines to emulate the techniques used to obtain high performance while improving programmability.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Karlin, I., Bhatele, A., Chamberlain, B.L., Cohen, J., Devito, Z., Gokhale, M., Haque, R., Hornung, R., Keasler, J., Laney, D., et al.: Lulesh programming model and performance ports overview. Technical report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA (2012)
Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Livermore, CA, August 2013
OpenMP C and C++ Application Program Interface (2002)
Budimlić, Z., Burke, M., Cavé, V., Knobe, K., Lowney, G., Newton, R., Palsberg, J., Peixotto, D., Sarkar, V., Schlimbach, F., et al.: Concurrent collections. Sci. Program. 18(3–4), 203–217 (2010)
Burke, M.G., Knobe, K., Newton, R., Sarkar, V.: Concurrent collections programming model. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 364–371. Springer, Heidelberg (2011). doi:10.1007/978-0-387-09766-4_238
Chatterjee, S., Vrvilo, N., Budimlić, Z., Knobe, K., Sarkar, V.: Declarative tuning for locality in parallel programs. In: Proceedings of the 45th International Conference on Parallel Processing, ICPP 2016, August 2016, to appear
Sbîrlea, A., Zou, Y., Budimlíc, Z., Cong, J., Sarkar, V.: Mapping a data-flow programming model onto heterogeneous platforms. In: ACM SIGPLAN Notices, vol. 47, pp. 61–70. ACM (2012)
Habanero-Rice: Concurrent collections on OCR (2015)
Frank Schlimbach, I.C.: Intel concurrent collections for C++ for Windows and Linux (2015)
Liu, C., Kulkarni, M.: Optimizing the LULESH stencil code using concurrent collections. In: Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, p. 5. ACM (2015)
Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 66. IEEE Computer Society Press (2012)
Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++, vol. 28. ACM (1993)
Kasahara, H., Obata, M., Ishizaka, K.: Automatic coarse grain task parallel processing on SMP using OpenMP. In: Midkiff, S.P., Moreira, J.E., Gupta, M., Chatterjee, S., Ferrante, J., Prins, J., Pugh, W., Tseng, C.-W. (eds.) LCPC 2000. LNCS, vol. 2017, pp. 189–207. Springer, Heidelberg (2001). doi:10.1007/3-540-45574-4_13
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: Pluto: a practical and fully automatic polyhedral program optimization system. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 2008), Tucson, AZ. Citeseer, June 2008
Kong, M., Pop, A., Pouchet, L.N., Govindarajan, R., Cohen, A., Sadayappan, P.: Compiler/runtime framework for dynamic dataflow parallelization of tiled programs. ACM Trans. Archit. Code Optim. 11(4), 61:1–61:30 (2015)
Sbirlea, A., Pouchet, L.N., Sarkar, V.: DFGR an intermediate graph representation for macro-dataflow programs. In: 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM), pp. 38–45. IEEE (2014)
Acknowledgments
This research is supported by the Department of Energy under contract DE-FC02-12ER26104. We would also like to thank Ellen Porter, Kath Knobe, Nick Vrvilo, and Zoran Budimlic for their comments and feedback during discussions regarding CnC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Liu, C., Kulkarni, M. (2017). Evaluating Performance of Task and Data Coarsening in Concurrent Collections. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-52709-3_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52708-6
Online ISBN: 978-3-319-52709-3
eBook Packages: Computer ScienceComputer Science (R0)