Abstract
This paper presents the first deployment of the Fast Multipole Method on the Cell processor (PowerXCell 8i). We rely on the matrix formulation with BLAS routines of the FMB code (Fast Multipole with BLAS) in order to directly and efficiently offload the most time consuming operators of both far field and near field computations on the Cell heterogeneous cores. We detail the difficulties that had to be solved first, and we finally obtain a deployment in single and double precisions, which scales linearly on several Cell blades and which is able to handle both uniform and non-uniform distributions of particles. We also present our performance results and comparisons with multicore CPUs, as well as the limitations of our deployment on the Cell processor.







Similar content being viewed by others
References
Cheng H, Greengard L, Rokhlin V (1999) A fast adaptive multipole algorithm in three dimensions. J Comput Phys 155:468–498
Dongarra J, Sullivan F (2000) Guest editors’ introduction: the top 10 algorithms. Comput Sci Eng 2(1):22–23
Lashuk I, Chandramowlishwaran A, Langston H, Nguyen TA, Sampath R, Shringarpure A, Vuduc R, Ying L, Zorin D, Biros G (2009) A massively parallel adaptive fast-multipole method on heterogeneous architectures. In: SC’09, 58
Arora N, Shringarpure A, Vuduc R (2009) Direct N-body kernels for multicore platforms. In: ICPP’09, pp 379–387
Knight TJ, Park JY, Ren M, Houston M, Erez M, Fatahalian K, Aiken A, Dally WJ, Hanrahan P (2007) Compilation for explicitly managed memory hierarchies. In: PPoPP’07, pp 226–236
De Fabritiis G (2007) Performance of the cell processor for biomolecular simulations. Comput Phys Commun 176:660–664
Luttmann E, Ensign D, Vaidyanathan V, Houston M, Rimon N, Øland J, Jayachandran G, Friedrichs M, Pande V (2009) Accelerating molecular dynamic simulation on the cell processor and Playstation 3. J Comput Chem 30(2):268–274
Swaminarayan S, Kadau K, Germann TC, Fossum GC (2008) 369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer. In: SC’08
Gumerov NA, Duraiswami R (2008) Fast multipole methods on graphics processors. J Comput Phys 227:8290–8313
Yokota R, Bardhan JP, Knepley MG, Barba LA, Hamada T (2011) Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns. Comput Phys Commun 182(6):1272–1283
Chandramowlishwaran A, Williams S, Oliker L, Lashuk I, Biros G, Vuduc R (2010) Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In: IPDPS’10
Hu Q, Gumerov NA, Duraiswami R (2011) Scalable fast multipole methods on distributed heterogeneous architectures. In: SC’11
Hu Q, Gumerov NA, Duraiswami R (2012) Scalable distributed fast multipole methods. In: HPCC’12
Yokota R, Barba L (2012) Hierarchical N-body simulations with autotuning for heterogeneous systems. Comput Sci Eng 14(3):30–39
Coulaud O, Fortin P, Roman J (2008) High performance BLAS formulation of the multipole-to-local operator in the fast multipole method. J Comput Phys 227(3):1836–1862
Coulaud O, Fortin P, Roman J (2010) High-performance BLAS formulation of the adaptive fast multipole method. Math Comput Model 51(3–4):177–188
Takahashi T, Cecka C, Fong W, Darve E (2012) Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int J Numer Methods Eng 89(1):105–133
Nyland L, Harris M, Prins J (2007) Fast N-body simulation with CUDA. GPU Gems 3:677–695
Fortin P, Lamotte JL (2009) Fast multipole method on the cell broadband engine: the near field part. In: ParCo’2009, vol 19, pp 323–330
IBM (2008) Basic linear algebra subprograms library programmer’s guide and API reference, software development kit for multicore acceleration version 3.1
Bourgerie Q, Fortin P, Lamotte JL (2010) Efficient complex matrix multiplication on the synergistic processing element of the CEll processor. In: PPAAC’10
Fortin P, Lamotte JL (2013) The fast multipole method on the cell processor. Research report hal-00770484, LIP6. http://hal.archives-ouvertes.fr/hal-00770484
Coulaud O, Fortin P, Roman J (2007) Hybrid MPI-thread parallelization of the fast multipole method. In: ISPDC’07, pp 391–398
Arevalo A, Matinata RM, Pandian M, Peri E, Ruby K, Thomas F, Almond C (2008) Programming the cell broadband engine architecture, examples and best practices. In: IBM redbook, SG24-SG7575
IBM (2008) Cell broadband engine programming handbook, including the PowerXCell 8i processor. Version 1.11
Williams SW, Shalf J, Oliker L, Husbands P, Yelick K (2005) Dense and sparse matrix operations on the cell processor. LBNL paper LBNL-58253
Kurzak J, Buttari A, Dongarra J (2008) Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Trans Parallel Distrib Syst 19(9):1175–1186
Kurzak J, Alvaro W, Dongarra J (2009) Optimizing matrix multiplication for a short-vector SIMD architecture—CELL processor. Parallel Comput 35(3):138–150
Kistler M, Gunnels J, Brokenshire D, Benton B (2009) Programming the Linpack benchmark for the IBM PowerXCell 8i processor. Sci Program 17(1–2):43–57
Hamada T, Narumi T, Yokota R, Yasuoka K, Nitadori K, Taiji M (2009) 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In: SC’09, 62
Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exper 23(2):87–198
Acknowledgements
This work was carried out with partial support from HPC@LR, a Competence Center in High-Performance Computing from the Languedoc-Roussillon region, funded by the Languedoc-Roussillon region, the European Union, and the Université Montpellier 2 Sciences et Techniques. The authors would like to cordially thank the system teams at HPC@LR and at Polytech’Paris-UPMC, as well as B. Cirou at CINES, for helpful assistance during the performance tests.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fortin, P., Lamotte, JL. An (almost) direct deployment of the Fast Multipole Method on the Cell processor. J Supercomput 65, 1205–1222 (2013). https://doi.org/10.1007/s11227-013-0877-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-0877-z