research-article

A framework for load balancing of tensor contraction expressions via dynamic task partitioning

Authors:

Samyam Rajbhandari,

Sriram Krishnamoorthy,

P. SadayappanAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 13, Pages 1 - 10

https://doi.org/10.1145/2503210.2503290

Published: 17 November 2013 Publication History

Abstract

In this paper, we introduce the Dynamic Load-balanced Tensor Contractions (DLTC), a domain-specific library for efficient task parallel execution of tensor contraction expressions, a class of computation encountered in quantum chemistry and physics. Our framework decomposes each contraction into smaller unit of tasks, represented by an abstraction referred to as iterators. We exploit an extra level of parallelism by having tasks across independent contractions executed concurrently through a dynamic load balancing runtime. We demonstrate the improved performance, scalability, and flexibility for the computation of tensor contraction expressions on parallel computers using examples from Coupled Cluster (CC) methods.

References

[1]

MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE. http://mvapich.cse.ohio-state.edu/.

[2]

Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 104(2):211--228, 2006.

[3]

Oliver Bastert and Christian Matuszewski. Layered drawings of digraphs. In Michael Kaufmann and Dorothea Wagner, editors, Drawing Graphs, volume 2025 of Lecture Notes in Computer Science, pages 87--120. Springer Berlin Heidelberg, 2001.

Digital Library

[4]

G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X Gao, R. J. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, C. Lam, Q Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276--292, February 2005.

[5]

D. Cociorva, J. W. Wilkins, C. Lam, G. Baumgartner, J. Ramanujam, and P. Sadayappan. Loop optimization for a class of memory-constrained computations. In Proceedings of the 15th international conference on Supercomputing, ICS '01, pages 103--113, New York, NY, USA, 2001. ACM.

Digital Library

[6]

Daniel Cociorva, Gerald Baumgartner, Chi-Chung Lam, P. Sadayappan, J. Ramanujam, Marcel Nooijen, David E. Bernholdt, and Robert Harrison. Space-time trade-off optimization for a class of electronic structure calculations. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, PLDI '02, pages 177--186, New York, NY, USA, 2002. ACM.

Digital Library

[7]

E. G. Coffman, Jr. and R. L. Graham. Optimal scheduling for two-processor systems. Acta Informatica, 1(3):200--213, 1972.

Digital Library

[8]

T. D. Crawford and H. F. Schaefer, III. An Introduction to Coupled Cluster Theory for Computational Chemists. In Reviews in Computational Chemistry, volume 14, pages 33--136. John Wiley and Sons, Inc., 2000.

[9]

James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and P. Sadayappan. Scioto: A framework for global-view task parallelism. In Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP '08, pages 586--593, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[10]

James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11, New York, NY, USA, 2009. ACM.

Digital Library

[11]

Albert Hartono, Qingda Lu, Xiaoyang Gao, Sriram Krishnamoorthy, Marcel Nooijen, Gerald Baumgartner, David E. Bernholdt, Venkatesh Choppella, Russell M. Pitzer, J. Ramanujam, Atanas Rountev, and P. Sadayappan. Identifying cost-effective common subexpressions to reduce operation count in tensor contraction evaluations. In Proceedings of the 6th International Conference on Computational Science - Volume Part I, ICCS'06, pages 267--275, Berlin, Heidelberg, 2006. Springer-Verlag.

Digital Library

[12]

Albert Hartono, Alexander Sibiryakov, Marcel Nooijen, Gerald Baumgartner, David E. Bernholdt, So Hirata, Chi-Chung Lam, Russell M. Pitzer, J. Ramanujam, and P. Sadayappan. Automated operation minimization of tensor contraction expressions in electronic structure calculations. In Proceedings of the 5th International Conference on Computational Science - Volume Part I, ICCS'05, pages 155--164, Berlin, Heidelberg, 2005. Springer-Verlag.

Digital Library

[13]

So Hirata. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. The Journal of Physical Chemistry A, 107(46):9887--9897, 2003.

[14]

Karol Kowalski, Sriram Krishnamoorthy, Ryan M. Olson, Vinod Tipparaju, and E. Aprà. Scalable implementations of accurate excited-state coupled cluster theories: application of high-level methods to porphyrin-based systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 72:1--72:10, New York, NY, USA, 2011. ACM.

Digital Library

[15]

Pai-Wei Lai, Huaijian Zhang, Samyam Rajbhandari, Edward Valeev, Karol Kowalski, and P. Sadayappan. Effective utilization of tensor symmetry in operation optimization of tensor contraction expressions. In Proceedings of the 12th International Conference on Computational Science, volume 9 of ICCS'12, pages 412--421, 2012.

[16]

Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V. Kale. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 137--148, New York, NY, USA, 2012. ACM.

Digital Library

[17]

Kurt Mehlhorn. Data Structures and Algorithms 2: Graph Algorithms and NP-Completeness, volume 2 of Monographs in Theoretical Computer Science. An EATCS Series. Springer, 1984.

Digital Library

[18]

J. Nieplocha, V. Tipparaju, M. Krishnan, and D. K. Panda. High performance remote memory access communication: The ARMCI approach. International Journal High Performance Computing Applications, 20(2):233--253, May 2006.

Digital Library

[19]

Jarek Nieplocha, Bruce Palmer, Vinod Tipparaju, Manojkumar Krishnan, Harold Trease, and Edoardo Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. International Journal of High Performance Computing Applications, 20(2):203--231, May 2006.

Digital Library

[20]

David Ozog, Sameer Shende, Allen Malony, Jeff R. Hammond, James Dinan, and Pavan Balaji. Inspector-executor load balancing algorithms for block-sparse tensor contractions. In Proceedings of the 27th International Conference on Supercomputing, ICS '13, pages 483--484, New York, NY, USA, 2013. ACM.

Digital Library

[21]

Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops tensor framework: reducing communication and eliminating load imbalance in massively parallel contractions. Technical Report UCB/EECS-2012-210, EECS Department, University of California, Berkeley, November 2012.

[22]

Kozo Sugiyama, Shojiro Tagawa, and Mitsuhiko Toda. Methods for visual understanding of hierarchical system structures. IEEE Transactions on Systems, Man and Cybernetics, 11(2):109--125, 1981.

[23]

M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. deJong. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477--1489, 2010.

[24]

Robert A. van de Geijn and Jerrell Watts. SUMMA: scalable universal matrix multiplication algorithm. Concurrency - Practice and Experience, 9(4):255--274, 1997.

Cited By

Xiao GYin CChen YDuan MLi K(2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3391254
Gao YHelms PChan GSolomonik E(2023)Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetrySciPost Physics Codebases10.21468/SciPostPhysCodeb.10Online publication date: 24-Feb-2023
https://doi.org/10.21468/SciPostPhysCodeb.10
Ma YLi ZChen XDing BLi NLu TZhang BSuo BJin Z(2023)Machine‐learning assisted scheduling optimization and its application in quantum chemical calculationsJournal of Computational Chemistry10.1002/jcc.2707544:12(1174-1188)Online publication date: 17-Jan-2023
https://doi.org/10.1002/jcc.27075
Show More Cited By

Recommendations

Dynamic load balancing for parallel program execution on a message-passing multicomputer
SPDP '90: Proceedings of the 1990 IEEE Second Symposium on Parallel and Distributed Processing

Dynamic load balancing solves the remapping problem in a multicomputer system at run time, where many processes need to be allocated evenly to multiple processor nodes. The mean is to migrate processes from busy to idle nodes in order to achieve higher ...
Inspector/executor load balancing algorithms for block-sparse tensor contractions
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling ...
Dynamic Task Scheduling and Load Balancing on Cell Processors
PDP '10: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing

The shift to multicore processors demands efficient parallel programming on a diversity of architectures, including homogeneous and heterogeneous chip multiprocessors (CMPs). Task parallel programming is one approach that maps well to CMPs. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
406
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiao GYin CChen YDuan MLi K(2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3391254
Gao YHelms PChan GSolomonik E(2023)Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetrySciPost Physics Codebases10.21468/SciPostPhysCodeb.10Online publication date: 24-Feb-2023
https://doi.org/10.21468/SciPostPhysCodeb.10
Ma YLi ZChen XDing BLi NLu TZhang BSuo BJin Z(2023)Machine‐learning assisted scheduling optimization and its application in quantum chemical calculationsJournal of Computational Chemistry10.1002/jcc.2707544:12(1174-1188)Online publication date: 17-Jan-2023
https://doi.org/10.1002/jcc.27075
Liu JLi DGioiosa RLi JZhou HMoreira JMueller FEtsion Y(2021)AthenaProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460355(190-202)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460355
Liu JRen JGioiosa RLi DLi JLee JPetrank E(2021)SpartaProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441581(318-333)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441581
Levy RSolomonik EClark B(2020)Distributed-Memory DMRG via Sparse and Dense Parallel Tensor ContractionsSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00028(1-14)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00028
Jagode HDanalis ADongarra J(2018)Accelerating NWChem Coupled Cluster through dataflow-based executionInternational Journal of High Performance Computing Applications10.1177/109434201667254332:4(540-551)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1177/1094342016672543
Kim JSukumaran-Rajam AHong CPanyala ASrivastava RKrishnamoorthy SSadayappan P(2018)Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205296(96-106)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205296
Peng BKowalski K(2017)Highly Efficient and Scalable Compound Decomposition of Two-Electron Integral Tensor and Its Application in Coupled Cluster CalculationsJournal of Chemical Theory and Computation10.1021/acs.jctc.7b0060513:9(4179-4192)Online publication date: 5-Sep-2017
https://doi.org/10.1021/acs.jctc.7b00605
Ozog DKamil AZheng YHargrove PHammond JMalony AJong WYelick K(2016)A Hartree-Fock Application Using UPC++ and the New DArray Library2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.108(453-462)Online publication date: May-2016
https://doi.org/10.1109/IPDPS.2016.108
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten