Skip to main content
Log in

Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance computing. These platforms provide a Software Development Kit (SDK) to maximize performance at the expense of dealing with complex and low-level architectural details which makes the software development a daunting task. This paper explores stencil computations in several heterogeneous programming models like Cell SDK, CellSs, ALF and CUDA to optimize the Jacobi method for solving Laplace’s differential equation. We describe the programming techniques to extract the maximum performance on the Cell BE and the GPU, and compare their computing paradigms. Experimental results are shown on two Nvidia Teslas and one IBM BladeCenter QS20 blade which incorporates two 3.2 GHz Cell BEs v 5.1. The speed-up factor for our set of GPU optimizations reaches 3–4×, and the execution times defeat those of the Cell BE by an order of magnitude, also showing great scalability when moving towards newer GPU generations and/or more demanding problem sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abellán JL, Fernández J, Acacio ME (2008) Characterizing the basic synchronization and communication operations in dual cell-based blades. In: International conference on computational science, Krakow, Poland.

    Google Scholar 

  2. Amorim R, Haase G, Liebmann M, Weber dos Santos R (2009) Comparing CUDA and OpenGL implementations for a Jacobi iteration. In: Smari WW (ed) Proceedings of the 2009 high performance computing & simulation conference (HPCS’09), IEEE, New Jersey. Logos Verlag, Berlin, pp 22–32

    Chapter  Google Scholar 

  3. Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. Tech rep UCB/EECS-2006-183, EECS Department, University of California, Berkeley

  4. Christen M, Schenk O, Neufeld E, Messmer P, Burkhart H (2009) Parallel data-locality aware stencil computations on modern micro-architectures. In: Proceedings of the 2009 IEEE international symposium on parallel & distributed processing (IPDPS ’09). IEEE Computer Society, Washington, pp 1–10

    Google Scholar 

  5. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC ’08). IEEE Press, Piscataway, pp 1–12

    Google Scholar 

  6. Demmel JW (1997) Applied numerical linear algebra. In: Society for industrial and applied mathematics. SIAM, Philadelphia

    Google Scholar 

  7. Fang X, Tang Y, Wang G, Tang T, Zhang Y (2010) Optimizing stencil application on multi-thread GPU architecture using stream programming model. In: Proceedings of 23rd international conference (ARCS), Hannover, Germany, pp 234–245

    Google Scholar 

  8. Gaona E, Fernández J, Acacio ME (2009) Fast and efficient synchronization and communication collective primitives for dual cell-based blades. In: Euro-Par, pp 900–911

    Google Scholar 

  9. Hill J (2007) Scientific programming on the cell using ALF. Tech rep, HPCx consortium

  10. Systems IBM Technology Group (2007) Cell broadband engine programming tutorial version 2.1

  11. IBM Systems and Technology Group (2007) SPE runtime management library version 2.1

  12. Intel: Array building blocks (2012). http://software.intel.com/en-us/articles/intel-array-building-blocks/

  13. Kahle J, Day M, Hofstee H, Johns C, Maeurer T, Shippy D (2005) Introduction to the cell multiprocessor. IBM J Res Dev 49(4/5):589–604

    Article  Google Scholar 

  14. Lester BP (1993) The art of parallel programming. Prentice-Hall, Upper Saddle River

    Google Scholar 

  15. Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: a unified graphics and computing architecture. IEEE MICRO 28(2):39–55. http://doi.ieeecomputersociety.org/10.1109/MM.2008.31

    Article  Google Scholar 

  16. Maruyama N, Nomura T, Sato K, Matsuoka S (2011) Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis (SC ’11), New York, USA, pp 11:1–11:12

    Google Scholar 

  17. McCool MD (2008) Scalable programming models for massively multicore processors. IEEE MICRO 96(5):816–831

    Google Scholar 

  18. NVIDIA: (2008) NVIDIA CUDA programming guide 2.0

  19. Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) Gpu computing. Proc IEEE 96(5):879–899

    Article  Google Scholar 

  20. Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell T (2007) A survey of general-purpose computation on graphics hardware. Comput Graph Forum 26(1):80–113

    Article  Google Scholar 

  21. Renganarayana L, Harthikote-matha M, Dewri R, Rajopadhye S (2007) Towards optimal multi-level tiling for stencil computations. In Proceedings of 21st IEEE international parallel and distributed processing symposium (IPDPS), Long Beach, CA, USA

    Google Scholar 

  22. Stone JE, Gohara D, Shi G (2010) Opencl: A parallel programming standard for heterogeneous computing systems. IEEE Des Test Comput 12(3):66–73. http://dx.doi.org/10.1109/MCSE.2010.69

    Google Scholar 

  23. Unat D, Cai X, Baden SB (2011) Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the international conference on supercomputing (ICS ’11). ACM, New York, pp 214–224

    Google Scholar 

  24. Venkatasubramanian S, Vuduc RW, None N (2009) Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In: Proceedings of the 23rd international conference on supercomputing (ICS ’09). ACM, New York, pp 244–255

    Chapter  Google Scholar 

Download references

Acknowledgements

This work has been jointly supported by the Fundación Séneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under projects 00001/CS/2007, 15290/PI/2010 and under the fellowship 12461/FPI/09, by the Spanish MICINN and European Commission FEDER funds under projects Consolider Ingenio-2010 CSD2006-00046 and TIN2009-14475-C04. We also thank NVIDIA for hardware donation under Professor Partnership 2008–2010 and CUDA Teaching Center Award 2011–2012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José M. Cecilia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cecilia, J.M., Abellán, J.L., Fernández, J. et al. Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE. J Supercomput 62, 787–803 (2012). https://doi.org/10.1007/s11227-012-0749-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-012-0749-y

Keywords

Navigation