Skip to main content
Log in

An (almost) direct deployment of the Fast Multipole Method on the Cell processor

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper presents the first deployment of the Fast Multipole Method on the Cell processor (PowerXCell 8i). We rely on the matrix formulation with BLAS routines of the FMB code (Fast Multipole with BLAS) in order to directly and efficiently offload the most time consuming operators of both far field and near field computations on the Cell heterogeneous cores. We detail the difficulties that had to be solved first, and we finally obtain a deployment in single and double precisions, which scales linearly on several Cell blades and which is able to handle both uniform and non-uniform distributions of particles. We also present our performance results and comparisons with multicore CPUs, as well as the limitations of our deployment on the Cell processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Cheng H, Greengard L, Rokhlin V (1999) A fast adaptive multipole algorithm in three dimensions. J Comput Phys 155:468–498

    Article  MathSciNet  MATH  Google Scholar 

  2. Dongarra J, Sullivan F (2000) Guest editors’ introduction: the top 10 algorithms. Comput Sci Eng 2(1):22–23

    Article  Google Scholar 

  3. Lashuk I, Chandramowlishwaran A, Langston H, Nguyen TA, Sampath R, Shringarpure A, Vuduc R, Ying L, Zorin D, Biros G (2009) A massively parallel adaptive fast-multipole method on heterogeneous architectures. In: SC’09, 58

    Google Scholar 

  4. Arora N, Shringarpure A, Vuduc R (2009) Direct N-body kernels for multicore platforms. In: ICPP’09, pp 379–387

    Google Scholar 

  5. Knight TJ, Park JY, Ren M, Houston M, Erez M, Fatahalian K, Aiken A, Dally WJ, Hanrahan P (2007) Compilation for explicitly managed memory hierarchies. In: PPoPP’07, pp 226–236

    Google Scholar 

  6. De Fabritiis G (2007) Performance of the cell processor for biomolecular simulations. Comput Phys Commun 176:660–664

    Article  Google Scholar 

  7. Luttmann E, Ensign D, Vaidyanathan V, Houston M, Rimon N, Øland J, Jayachandran G, Friedrichs M, Pande V (2009) Accelerating molecular dynamic simulation on the cell processor and Playstation 3. J Comput Chem 30(2):268–274

    Article  Google Scholar 

  8. Swaminarayan S, Kadau K, Germann TC, Fossum GC (2008) 369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer. In: SC’08

    Google Scholar 

  9. Gumerov NA, Duraiswami R (2008) Fast multipole methods on graphics processors. J Comput Phys 227:8290–8313

    Article  MathSciNet  MATH  Google Scholar 

  10. Yokota R, Bardhan JP, Knepley MG, Barba LA, Hamada T (2011) Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns. Comput Phys Commun 182(6):1272–1283

    Article  MATH  Google Scholar 

  11. Chandramowlishwaran A, Williams S, Oliker L, Lashuk I, Biros G, Vuduc R (2010) Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In: IPDPS’10

    Google Scholar 

  12. Hu Q, Gumerov NA, Duraiswami R (2011) Scalable fast multipole methods on distributed heterogeneous architectures. In: SC’11

    Google Scholar 

  13. Hu Q, Gumerov NA, Duraiswami R (2012) Scalable distributed fast multipole methods. In: HPCC’12

    Google Scholar 

  14. Yokota R, Barba L (2012) Hierarchical N-body simulations with autotuning for heterogeneous systems. Comput Sci Eng 14(3):30–39

    Article  Google Scholar 

  15. Coulaud O, Fortin P, Roman J (2008) High performance BLAS formulation of the multipole-to-local operator in the fast multipole method. J Comput Phys 227(3):1836–1862

    Article  MathSciNet  MATH  Google Scholar 

  16. Coulaud O, Fortin P, Roman J (2010) High-performance BLAS formulation of the adaptive fast multipole method. Math Comput Model 51(3–4):177–188

    Article  MathSciNet  MATH  Google Scholar 

  17. Takahashi T, Cecka C, Fong W, Darve E (2012) Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int J Numer Methods Eng 89(1):105–133

    Article  MATH  Google Scholar 

  18. Nyland L, Harris M, Prins J (2007) Fast N-body simulation with CUDA. GPU Gems 3:677–695

    Google Scholar 

  19. Fortin P, Lamotte JL (2009) Fast multipole method on the cell broadband engine: the near field part. In: ParCo’2009, vol 19, pp 323–330

    Google Scholar 

  20. IBM (2008) Basic linear algebra subprograms library programmer’s guide and API reference, software development kit for multicore acceleration version 3.1

  21. Bourgerie Q, Fortin P, Lamotte JL (2010) Efficient complex matrix multiplication on the synergistic processing element of the CEll processor. In: PPAAC’10

    Google Scholar 

  22. Fortin P, Lamotte JL (2013) The fast multipole method on the cell processor. Research report hal-00770484, LIP6. http://hal.archives-ouvertes.fr/hal-00770484

  23. Coulaud O, Fortin P, Roman J (2007) Hybrid MPI-thread parallelization of the fast multipole method. In: ISPDC’07, pp 391–398

    Google Scholar 

  24. Arevalo A, Matinata RM, Pandian M, Peri E, Ruby K, Thomas F, Almond C (2008) Programming the cell broadband engine architecture, examples and best practices. In: IBM redbook, SG24-SG7575

    Google Scholar 

  25. IBM (2008) Cell broadband engine programming handbook, including the PowerXCell 8i processor. Version 1.11

  26. Williams SW, Shalf J, Oliker L, Husbands P, Yelick K (2005) Dense and sparse matrix operations on the cell processor. LBNL paper LBNL-58253

  27. Kurzak J, Buttari A, Dongarra J (2008) Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Trans Parallel Distrib Syst 19(9):1175–1186

    Article  Google Scholar 

  28. Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector SIMD architecture—CELL processor. Parallel Comput 35(3):138–150

    Article  Google Scholar 

  29. Kistler M, Gunnels J, Brokenshire D, Benton B (2009) Programming the Linpack benchmark for the IBM PowerXCell 8i processor. Sci Program 17(1–2):43–57

    Google Scholar 

  30. Hamada T, Narumi T, Yokota R, Yasuoka K, Nitadori K, Taiji M (2009) 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In: SC’09, 62

    Google Scholar 

  31. Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exper 23(2):87–198

    Article  Google Scholar 

Download references

Acknowledgements

This work was carried out with partial support from HPC@LR, a Competence Center in High-Performance Computing from the Languedoc-Roussillon region, funded by the Languedoc-Roussillon region, the European Union, and the Université Montpellier 2 Sciences et Techniques. The authors would like to cordially thank the system teams at HPC@LR and at Polytech’Paris-UPMC, as well as B. Cirou at CINES, for helpful assistance during the performance tests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Fortin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fortin, P., Lamotte, JL. An (almost) direct deployment of the Fast Multipole Method on the Cell processor. J Supercomput 65, 1205–1222 (2013). https://doi.org/10.1007/s11227-013-0877-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-0877-z

Keywords

Navigation