Abstract
Heterogeneous multicore chipsets with many levels of parallelism are becoming increasingly common in high-performance computing systems. Effective use of parallelism in these new chipsets constitutes the challenge facing a new generation of large scale scientific computing applications. This study examines methods for improving the performance of two-dimensional and three-dimensional atmospheric constituent transport simulation on the Cell Broadband Engine Architecture (CBEA). A function offloading approach is used in a 2D transport module, and a vector stream processing approach is used in a 3D transport module. Two methods for transferring incontiguous data between main memory and accelerator local storage are compared. By leveraging the heterogeneous parallelism of the CBEA, the 3D transport module achieves performance comparable to two nodes of an IBM BlueGene/P, or eight Intel Xeon cores, on a single PowerXCell 8i chip. Module performance on two CBEA systems, an IBM BlueGene/P, and an eight-core shared-memory Intel Xeon workstation are given.
Similar content being viewed by others
References
Ainsworth TW, Pinkston TM (2007) On characterizing performance of the Cell Broadband Engine Element Interconnect Bus. In: Proceedings of the first international symposium on networks-on-chip (NOCS ’07), Princeton, NJ, pp 18–29. doi:10.1109/NOCS.2007.34
Alam SR, Agarwal PK (2007) On the path to enable multi-scale biomolecular simulations on petaFLOPS supercomputer with multi-core processors. In: Proceedings of the IEEE international parallel and distributed processing symposium (IPDPS ’07), Long Beach, CA, pp 1–8. doi:10.1109/IPDPS.2007.370443
Bader DA, Patel S (2008) High performance MPEG-2 software decoder on the Cell Broadband E. In: Proceedings of the IEEE international symposium on parallel and distributed processing (IPDPS ’08), Miami, FL, pp 1–10. doi:10.1109/IPDPS.2008.4536234
Baik H, Sihn KH, Il Kim Y, Bae S, Han N, Song HJ (2007) Analysis and parallelization of H.264 decoder on Cell Broadband Engine Architecture. In: Proceedings of the IEEE international symposium on signal processing and information technology, Giza, pp 791–795. doi:10.1109/ISSPIT.2007.4458128
Benthin C, Wald I, Scherbaum M, Friedrich H (2006) Ray tracing on the cell processor. In: Proceedings of the IEEE symposium on interactive ray tracing, Salt Lake City, UT, pp 15–23. doi:10.1109/RT.2006.280210
Blagojevic F, Nikolopoulos DS, Stamatakis A, Antonopoulos, CD (2007) Dynamic multigrain parallelization on the Cell Broadband Engine. In: Proceedings of the 12th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP ’07), ACM, New York, NY, USA, pp 90–100. doi:10.1145/1229428.1229445
Buttari A, Dongarra J, Langou J, Langou J, Luszczek P, Kurzak J (2007) Mixed precision iterative refinement techniques for the solution of dense linear systems. J High Perform Comput Appl 21(4):457–466. doi:10.1177/1094342007084026
Carmichael GR, Peters LK, Kitada T (1986) A second generation model for regional scale transport/chemistry/deposition. Atmos Environ 20:173–188
Carter WPL (1990) A detailed mechanism for the gas-phase atmospheric reactions of organic compounds. Atmos Environ 24A:481–518
Chen L, Hu Z, Lin J, Gao GR (2007) Optimizing the fast Fourier transform on a multi-core architecture. In: Proceedings of the IEEE international parallel and distributed processing symposium (IPDPS ’07), Long Beach, CA, pp 1–8. doi:10.1109/IPDPS.2007.370639
Chen T, Raghavan R, Dale J, Iwata E (2006) Cell Broadband Engine Architecture and its first implementation. IBM DeveloperWorks
Dally WJ, Labonte F, Das A, Hanrahan P, Ahn JH, Gummaraju J, Erez M, Jayasena N, Buck I, Knight TJ, Kapasi UJ (2003) Merrimac: Supercomputing with streams. In: Proceedings of the 2003 ACM/IEEE conference on supercomputing (SC ’03), Washington, DC. IEEE Computer Society, Los Alamitos, p 35
Damian V, Sandu A, Damian M, Potra F, Carmichael GR (2002) The Kinetic Preprocessor KPP—a software environment for solving chemical kinetics. Comput Chem Eng 26:1567–1579
Erez M, Ahn JH, Gummaraju J, Rosenblum M, Dally WJ (2007) Executing irregular scientific applications on stream architectures. In: Proceedings of the 21st annual international conference on supercomputing (ICS ’07). ACM, New York, pp 93–104. doi:10.1145/1274971.1274987
Fatahalian K, Knight TJ, Houston M, Erez M, Horn DR, Leem L, Park JY, Ren M, Aiken A, Dally WJ, Hanrahan P (2006) Sequoia: programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE conference on supercomputing (SC ’06)
Flachs B, Asano S, Dhong SH, Hofstee HP, Gervais G, Kim R, Le T et al (2006) The microarchitecture of the synergistic processor for a cell processor. IEEE J Solid State Circuits 41(1):63–70
Gedik B, Andrade H, Wu KL, Yu PS, Doo M (2008) SPADE: The System S declarative stream processing engine. In: Proceedings of the 2008 international conference on management of data (SIGMOD ’08). ACM, New York, pp 1123–1134. doi:10.1145/1376616.1376729
Grell GA, Peckham SE, Schmitz R, McKeen SA, Frost G, Skamarock WC, Eder B (2005) Fully coupled online chemistry within the WRF model. Atmos Environ 39:6957–6975
Hieu NT, Keong KC, Wirawan A, Schmidt B (2008) Applications of heterogeneous structure of Cell Broadband Engine Architecture for biological database similarity search. In: Proceedings of the 2nd international conference on bioinformatics and biomedical engineering (ICBBE ’08), Shanghai, pp 5–8. doi:10.1109/ICBBE.2008.8
Hirsch C (1988) Numerical computation of internal and external flows 1: fundamentals and numerical discretization. Wiley, Chichester
Hundsdorfer W (1996) Numerical solution of advection-diffusion-reaction equations. Tech. rep., Centrum voor Wiskunde en Informatica
IBM (2006) PowerPC microprocessor family: vector/SIMD multimedia extension technology programming environments manual, 2nd edn. International Business Machines Corporation, Raleigh
Ibrahim KZ, Bodin F (2008) Implementing Wilson–Dirac operator on the Cell Broadband Engine. In: Proceedings of the 22nd annual international conference on supercomputing (ICS ’08). ACM, New York, pp 4–14. doi:10.1145/1375527.1375532
Kurzak J, Dongarra J (2007) Implementation of mixed precision in solving systems of linear equations on the cell processor. Concurr Comput: Pract Exp 19(10):1371–1385. http://cscads.rice.edu/Presentations/fulltext-kurzak.pdf
Li B, Jin H, Shao Z, Li Y, Liu X (2008) Optimized implementation of ray tracing on Cell Broadband Engine. In: Proceedings of the international conference on multimedia and ubiquitous engineering (MUE ’08), Busan, pp 438–443. doi:10.1109/MUE.2008.83
Linford JC, Sandu A (2009) Vector stream processing for effective application of heterogeneous parallelism. In: Proceedings of the 24th annual ACM symposium on applied computing (SAC’09), Honolulu, HI
Meng BZ, Gbor P, Wen D, Yang F, Shi C, Aronson J, Sloan J (2007) Models for gas/particle partitioning, transformation and air/water surface exchange of PCBs and PCDD/Fs in CMAQ. Atmos Environ 41(39):9111–9127
Muta H, Doi M, Nakano H, Mori Y (2007) Multilevel parallelization on the Cell/B.E. for a motion JPEG 2000 encoding server. In: Proceedings of the 15th international conference on multimedia (MULTIMEDIA ’07). ACM, New York, pp 942–951. doi:10.1145/1291233.1291442
Rafique MM, Butt AR, Nikolopoulos DS (2008) DMA-based prefetching for I/O-intensive workloads on the cell architecture. In: Proceedings of the 2008 conference on computing frontiers (CF ’08). ACM, New York, pp 23–32. doi:10.1145/1366230.1366236
Ray J, Kennedy C, Lefantzi S, Najm H (2003) High-order spatial discretizations and extended stability methods for reacting flows on structured adaptively refined meshes. In: Proceedings of third joint meeting of the U.S. sections of the combustion institute, Chicago, USA
Sandu A, Daescu D, Carmichael G, Chai T (2005) Adjoint sensitivity analysis of regional air quality models. J Comput Phys 204:222–252
Strang G (1968) On the construction and comparison of difference schemes. SIAM J Numer Anal 5(3):506–517. http://www.jstor.org/stable/2949700
Williams S, Shalf J, Oliker L, Kamil S, Husbands P, Yelick K (2006) The potential of the cell processor for scientific computing. In: Proceedings of the 3rd conference on computing frontiers (CF ’06). ACM, New York, pp 9–20. 10.1145/1128022.1128027
Zhu Z, Wang Q, Feng B, Shao L (2007) Speech codec optimization based on Cell Broadband Engine. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP ’07), Honolulu, HI, vol 2, pp 805–808. doi:10.1109/ICASSP.2007.366358
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Linford, J.C., Sandu, A. Scalable heterogeneous parallelism for atmospheric modeling and simulation. J Supercomput 56, 300–327 (2011). https://doi.org/10.1007/s11227-010-0380-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-010-0380-8