Skip to main content
Log in

Exploiting GPUs with the Super Instruction Architecture

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. In this paper, we describe how the SIA was enhanced to exploit GPUs, obtaining speedups ranging from two to nearly four for computational chemistry calculations, thus saving hours of elapsed time on large-scale computations. The results provide evidence that the “programming-with-blocks” approach embodied in the SIA will remain successful in modern, heterogeneous computing environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Tensor contraction operations occur frequently in the domain and are defined as follows: Let \(\alpha , \beta , \gamma \) be mutually disjoint, possibly empty lists of indices of multidimensional arrays representing the tensors. Then the contraction of \(A[\alpha ,\beta ]\) with \(B[\beta ,\gamma ]\) yields \(C[\alpha ,\gamma ] = \sum _{\beta } A[\alpha ,\beta ] * B[\beta ,\gamma ]\). Typically, contractions are implemented by (possibly) permuting one of the arrays and then performing a DGEMM.

  2. We refer to “the” segment size for convenience. It is not required that all segments within a rank be the same size. The way an index is segmented is part of its type and is fixed during program initialization. There are several segment index types corresponding to domain specific concepts. For example, aoindex and moindex represent atomic orbital and molecular orbital. This allows the type system to perform useful checks on the consistent use of index variables.

  3. The syntax has been slightly simplified.

  4. As can be seen from the SIAL code fragment in Fig. 2, data transfers between nodes do not overlap with GPU instructions.

References

  1. Aces III. http://www.qtp.ufl.edu/ACES/

  2. Beyer, J.C., Stotzer, E.J., Hart, A., de Supinski, B.R.: OpenMP for accelerators. In: Proceedings of the 7th International Conference on OpenMP in the Petascale Era, IWOMP’11, pp. 108–121. Springe, Berlin, Heidelberg (2011). http://dl.acm.org/citation.cfm?id=2023025.2023037

  3. Bhaskaran-Nair, K., Ma, W., Krishnamoorthy, S., Villa, O., van Dam, H.J.J., Apr, E., Kowalski, K.: Noniterative multireference coupled cluster methods on heterogeneous CPU–GPU systems. J. Chem. Theory Comput. 9(4), 1949–1957 (2013). doi:10.1021/ct301130u

    Article  Google Scholar 

  4. DePrince, A.E., Hammond, J.R.: Coupled cluster theory on graphics processing units. I. The coupled cluster doubles method. J. Chem. Theory Comput. 7(5), 1287–1295 (2011). doi:10.1021/ct100584w

    Article  Google Scholar 

  5. Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011). doi:10.1109/TPDS.2010.62

    Article  Google Scholar 

  6. Jindal, N., Lotrich, V., Deumens, E., Sanders, B.A.: SIPMaP: A tool for modeling irregular parallel computations in the Super Instruction Architecture. In: 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2013) (2013)

  7. Lee, S., Eigenmann, R.: OpenMPC: Extended openmp programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/SC.2010.36.

  8. Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, IEEE press, Salt Lake City, Utah, USA (2012). doi:10.1109/SC.2012.51. http://dl.acm.org/citation.cfm?id=2388996.2389028

  9. Lotrich, V.F., Ponton, J.M., Perera, A.S., Deumens, E., Bartlett, R.J., Sanders, B.A.: Super Instruction Architecture for petascale electronic structure software: the story. Mol. Phys. (2010). Special issue: Electrons, Molecules, Solids, and Biosystems: Fifty Years of the Quantum Theory Project. (conditionally accepted)

  10. Lotrich, V., Flocke, N., Ponton, M., Yau, A.D., Perera, A., Deumens, E., Bartlett, R.J.: Parallel implementation of electronic structure energy, gradient and Hessian calculations. J. Chem. Phys. 128, 194104 (2008)

    Article  Google Scholar 

  11. Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011). doi:10.1021/ct1007247

    Article  Google Scholar 

  12. Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid CPU–GPU execution. Clust. Comput. 16(1), 131–155 (2013). doi:10.1007/s10586-011-0179-2

    Article  Google Scholar 

  13. NVIDIA developer zone. https://developer.nvidia.com/category/zone/cuda-zone

  14. OpenACC: Directives for accelerators. http://www.openacc-standard.org

  15. Sanders, B.A., Bartlett, R., Deumens, E., Lotrich, V., Ponton, M.: A block-oriented language and runtime system for tensor algebra with very large arrays. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/SC.2010.3

Download references

Acknowledgments

Shawn McDowell provided the CUDA implementation of the contraction operator. This work was supported by the National Science Foundation Grant OCI-0725070 and the Office of Science of the U.S. Department of Energy under grant DE-SC0002565. The development of the SIA and ACES III has been also been supported by the US Department of Defense’s High Performance Computing Modernization Program (HPCMP) under the two programs, Common High Performance Computing Software Initiative (CHSSI), Project CBD-03, and User Productivity Enhancement and Technology Transfer (PET). We also thank the University of Florida High Performance Computing Center for use of its facilities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Beverly A. Sanders.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jindal, N., Lotrich, V., Deumens, E. et al. Exploiting GPUs with the Super Instruction Architecture. Int J Parallel Prog 44, 309–324 (2016). https://doi.org/10.1007/s10766-014-0319-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-014-0319-4

Keywords

Navigation