Exploiting GPUs with the Super Instruction Architecture

Jindal, Nakul; Lotrich, Victor; Deumens, Erik; Sanders, Beverly A.

doi:10.1007/s10766-014-0319-4

Exploiting GPUs with the Super Instruction Architecture

Published: 20 August 2014

Volume 44, pages 309–324, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Nakul Jindal¹,
Victor Lotrich²,
Erik Deumens² &
…
Beverly A. Sanders¹

231 Accesses
7 Citations
6 Altmetric
Explore all metrics

Abstract

The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. In this paper, we describe how the SIA was enhanced to exploit GPUs, obtaining speedups ranging from two to nearly four for computational chemistry calculations, thus saving hours of elapsed time on large-scale computations. The results provide evidence that the “programming-with-blocks” approach embodied in the SIA will remain successful in modern, heterogeneous computing environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

A Brief History and Introduction to GPGPU

Notes

Tensor contraction operations occur frequently in the domain and are defined as follows: Let \(\alpha , \beta , \gamma \) be mutually disjoint, possibly empty lists of indices of multidimensional arrays representing the tensors. Then the contraction of \(A[\alpha ,\beta ]\) with \(B[\beta ,\gamma ]\) yields \(C[\alpha ,\gamma ] = \sum _{\beta } A[\alpha ,\beta ] * B[\beta ,\gamma ]\). Typically, contractions are implemented by (possibly) permuting one of the arrays and then performing a DGEMM.
We refer to “the” segment size for convenience. It is not required that all segments within a rank be the same size. The way an index is segmented is part of its type and is fixed during program initialization. There are several segment index types corresponding to domain specific concepts. For example, aoindex and moindex represent atomic orbital and molecular orbital. This allows the type system to perform useful checks on the consistent use of index variables.
The syntax has been slightly simplified.
As can be seen from the SIAL code fragment in Fig. 2, data transfers between nodes do not overlap with GPU instructions.

References

Aces III. http://www.qtp.ufl.edu/ACES/
Beyer, J.C., Stotzer, E.J., Hart, A., de Supinski, B.R.: OpenMP for accelerators. In: Proceedings of the 7th International Conference on OpenMP in the Petascale Era, IWOMP’11, pp. 108–121. Springe, Berlin, Heidelberg (2011). http://dl.acm.org/citation.cfm?id=2023025.2023037
Bhaskaran-Nair, K., Ma, W., Krishnamoorthy, S., Villa, O., van Dam, H.J.J., Apr, E., Kowalski, K.: Noniterative multireference coupled cluster methods on heterogeneous CPU–GPU systems. J. Chem. Theory Comput. 9(4), 1949–1957 (2013). doi:10.1021/ct301130u
Article Google Scholar
DePrince, A.E., Hammond, J.R.: Coupled cluster theory on graphics processing units. I. The coupled cluster doubles method. J. Chem. Theory Comput. 7(5), 1287–1295 (2011). doi:10.1021/ct100584w
Article Google Scholar
Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011). doi:10.1109/TPDS.2010.62
Article Google Scholar
Jindal, N., Lotrich, V., Deumens, E., Sanders, B.A.: SIPMaP: A tool for modeling irregular parallel computations in the Super Instruction Architecture. In: 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2013) (2013)
Lee, S., Eigenmann, R.: OpenMPC: Extended openmp programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/SC.2010.36.
Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, IEEE press, Salt Lake City, Utah, USA (2012). doi:10.1109/SC.2012.51. http://dl.acm.org/citation.cfm?id=2388996.2389028
Lotrich, V.F., Ponton, J.M., Perera, A.S., Deumens, E., Bartlett, R.J., Sanders, B.A.: Super Instruction Architecture for petascale electronic structure software: the story. Mol. Phys. (2010). Special issue: Electrons, Molecules, Solids, and Biosystems: Fifty Years of the Quantum Theory Project. (conditionally accepted)
Lotrich, V., Flocke, N., Ponton, M., Yau, A.D., Perera, A., Deumens, E., Bartlett, R.J.: Parallel implementation of electronic structure energy, gradient and Hessian calculations. J. Chem. Phys. 128, 194104 (2008)
Article Google Scholar
Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011). doi:10.1021/ct1007247
Article Google Scholar
Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid CPU–GPU execution. Clust. Comput. 16(1), 131–155 (2013). doi:10.1007/s10586-011-0179-2
Article Google Scholar
NVIDIA developer zone. https://developer.nvidia.com/category/zone/cuda-zone
OpenACC: Directives for accelerators. http://www.openacc-standard.org
Sanders, B.A., Bartlett, R., Deumens, E., Lotrich, V., Ponton, M.: A block-oriented language and runtime system for tensor algebra with very large arrays. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/SC.2010.3

Download references

Acknowledgments

Shawn McDowell provided the CUDA implementation of the contraction operator. This work was supported by the National Science Foundation Grant OCI-0725070 and the Office of Science of the U.S. Department of Energy under grant DE-SC0002565. The development of the SIA and ACES III has been also been supported by the US Department of Defense’s High Performance Computing Modernization Program (HPCMP) under the two programs, Common High Performance Computing Software Initiative (CHSSI), Project CBD-03, and User Productivity Enhancement and Technology Transfer (PET). We also thank the University of Florida High Performance Computing Center for use of its facilities.

Author information

Authors and Affiliations

Department of Computer and Information Science, University of Florida, Gainesville, FL, USA
Nakul Jindal & Beverly A. Sanders
Department of Chemistry, University of Florida, Gainesville, FL, USA
Victor Lotrich & Erik Deumens

Authors

Nakul Jindal
View author publications
You can also search for this author in PubMed Google Scholar
Victor Lotrich
View author publications
You can also search for this author in PubMed Google Scholar
Erik Deumens
View author publications
You can also search for this author in PubMed Google Scholar
Beverly A. Sanders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Beverly A. Sanders.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jindal, N., Lotrich, V., Deumens, E. et al. Exploiting GPUs with the Super Instruction Architecture. Int J Parallel Prog 44, 309–324 (2016). https://doi.org/10.1007/s10766-014-0319-4

Download citation

Received: 21 June 2013
Accepted: 02 August 2014
Published: 20 August 2014
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10766-014-0319-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting GPUs with the Super Instruction Architecture

Abstract

Access this article

Similar content being viewed by others

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

A Brief History and Introduction to GPGPU

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting GPUs with the Super Instruction Architecture

Abstract

Access this article

Similar content being viewed by others

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

A Brief History and Introduction to GPGPU

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation