Skip to main content
Log in

Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Compute Capability defines the hardware configuration of the GPU, e.g. amount of shared memory, registers, presence of implicit caching etc.

References

  1. Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. In: Proceedings of ACM Transactions on Graphics, pp. 917–924 (2003)

  2. Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 102–111 (2003)

  3. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6, 40–53 (2008)

    Article  Google Scholar 

  4. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28, 39–55 (2008)

    Google Scholar 

  5. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113 (2007)

    Article  Google Scholar 

  6. Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28, 13–27 (2008)

    Google Scholar 

  7. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, Reading (2010)

    Google Scholar 

  8. Kirk, D., Hwu, W., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, Los Altos (2010)

    Google Scholar 

  9. Stock, F., Koch, A.: A fast GPU implementation for solving sparse ill-posed linear equation systems. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 457–466 (2010)

  10. Wozniak, M., Olas, T., Wyrzykowski, R.: Parallel implementation of conjugate gradient method on graphics processors. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 125–135 (2010)

  11. Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. ACM SIGPLAN Notices 45, 127–136 (2010)

    Article  Google Scholar 

  12. Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. 22, 22–32 (2011)

    Article  Google Scholar 

  13. Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. J. Comput. Phys. 227, 10,148–10,161 (2008)

    Google Scholar 

  14. Feng, Z., Li, P.: Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms. In: ICCAD 2008. IEEE/ACM International Conference on, Computer-Aided Design, pp. 647–654 (2008)

  15. Czapiński, M., Barnes, S.: Tabu search with two approaches to parallel flowshop evaluation on CUDA platform. J. Parallel Distrib. Comput. 71, 802–811 (2011)

    Article  Google Scholar 

  16. Czapiński, M.: An effective parallel multistart tabu search for quadratic assignment problem on CUDA platform. J. Parallel Distrib. Comput. 73, 1461–1468 (2013)

    Article  Google Scholar 

  17. Lawlor, O.: Message passing for GPGPU clusters: CudaMPI. In: Cluster Computing and Workshops, 2009. CLUSTER ’09. IEEE International Conference on, pp. 1–8 (2009)

  18. Cevahir, A., Nukada, A., Matsuoka, S.: Fast conjugate gradients with multiple GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 893–903 (2009)

  19. Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)

    Article  MATH  Google Scholar 

  20. Yang, C.T., Huang, C.L., Lin, C.F.: Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Comput. Phys. Commun. 182, 266–269 (2011)

    Article  Google Scholar 

  21. Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19, 103–117 (2005)

    Article  Google Scholar 

  22. Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33, 624–633 (2007)

    Article  MathSciNet  Google Scholar 

  23. Shet, A., Sadayappan, P., Bernholdt, D., Nieplocha, J., Tipparaju, V.: A framework for characterizing overlap of communication and computation in parallel applications. Clust. Comput. 11, 75–90 (2008)

    Article  Google Scholar 

  24. Thakur, R., Gropp, W.: Test suite for evaluating performance of multithreaded MPI communication. Parallel Comput. 35, 608–617 (2009)

    Article  Google Scholar 

  25. NVidia: NVIDIA CUDA C Programming Guide. http://developer.nvidia.com/cuda-toolkit-40 (2011). Accessed 10 July 2013

  26. White III, J., Dongarra, J.: Overlapping computation and communication for advection on hybrid parallel computers. In: International Parallel and Distributed Processing, Symposium, pp. 59–67 (2011)

  27. Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84 (2009)

  28. Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia (1997)

    Book  MATH  Google Scholar 

  29. Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2000)

    Book  MATH  Google Scholar 

Download references

Acknowledgments

The authors wish to thank Dr. Mark Stillwell for proof-reading the original manuscript and his valuable and constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michał Czapiński.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Czapiński, M., Thompson, C. & Barnes, S. Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation. Int J Parallel Prog 42, 1032–1047 (2014). https://doi.org/10.1007/s10766-013-0293-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0293-2

Keywords

Navigation