Abstract
Computational Fluid Dynamics (CFD) applications are highly demanding for parallel computing. Many such applications have been shifted from expensive MPP boxes to cost-effective Networks of Workstations (NOW). Auto-CFD-NOW is a pre-compiler that transforms Fortran CFD sequential programs to efficient message-passing parallel programs running on NOW. Our work makes the following three unique contributions. First, this pre-compiler is highly automatic, requiring a minimum number of user directives for parallelization. Second, we have applied a dependency analysis technique for the CFD applications, called analysis after partitioning. We propose a mirror-image decomposition technique to parallelize self-dependent field loops that are hard to parallelize by existing methods. Finally, traditional optimizations of communication focus on eliminating redundant synchronizations. We have developed an optimization scheme which combines all the non-redundant synchronizations in CFD programs to further reduce the communication overhead. The Auto-CFD-NOW has been implemented on networks of workstations and has been successfully used for automatically parallelizing structured CFD application programs. Our experiments show its effectiveness and scalability for parallelizing large CFD applications.
Similar content being viewed by others
References
Anderson JM, Amarasinghe SP, Lam MS (1995) Data and computation transformation for multiprocessors. Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’95), pp. 166–178
Arif Wani M, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multi-ring network. The Journal of Supercomputing, 25(1):43–63
Baden SB, Fink SJ (1998) Communication overlap in multi-tier parallel algorithm. Proceedings of Supercomputing (SC’98)
Banerjee U, Eigenmann R, Nicolau A, Padua DA (1993) Automatic program parallelization. Proceedings of the IEEE, 81(2):211–243
Bhandarkar SM, Arabnia HR (1997) Parallel computer vision on a reconfigurable multiprocessor network. IEEE Transactions on Parallel and Distributed Systems, 8(3):292–310
Blume W, Eigenmann R, Hoeflinger J, Padua D (1994) Automatic detection of parallelism: A grand challenge for high-performance computing. IEEE Parallel and Distributed Technology, 2(3):37–47
Brandes T, Zimmermann F (1994) ADAPTOR—A transformation tool for HPF programs. Programming environments for massively parallel distributed systems. Springer Verlag, pp. 91–96
Brewer EA, Kuszmaul BC (1994) How to get good performance from the CM-5 data network. Proceedings of the 1994 International Parallel Processing Symposium (IPPS), pp. 858–867
Chakrabarti S, Gupta M, Choi J-D (1996) Global communication analysis and optimization. Proceedings of the SIGPLAN ’96 Conference on Programming Language Design and Implementation (PLDI), pp. 68–78.
Chao HY, Harper MP (1995) Minimizing redundant dependencies and interprocessor synchronizations. International Journal of Parallel Programming, 23(3):245–262
Feng B (1999) On the program restructuring in automatic parallelization based on the domain partition, Ph.D. Dissertation, Department of Computer Science and Engineering, Northwestern Polytechnic University, May, 1999
Glikman E, Ioffe L, Kelson I, Pinter SS (1995) Parallel algorithms for molecular dynamics simulation of irradiation effects in crystals. Scientific Programming, 4(3):185–191
Gartel U., Ressel L (1991) Parallel multigrid grid partitioning versus domain decomposition. Arbeispapiere der GMD, Nr. 599
Gropp WD, Smith EB (1990) Computational fluid dynamics on parallel processors. Computers & Fluids, 18:289–304
Heng ACK, Low YH (1997) Loop parallelization tool for message-passing system. Microprocessors and Microsystems, 20(7):409–420
High Performance Fortran. http://www.crpc.rice.edu/HPFF/
Hall, MW, Harvey TJ, Kennedy K, McIntosh N, McKinley KS, Oldham JD, Paleczny MH, Roth G (1993) Experiences using the ParaScope Editor, an interactive parallel programming tool. Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93), pp. 33–43
Kandemir M, Banerjee P, Choudhary A, Ramanujam J, Shenoy N (1999) A global communication optimization technique based on data-flow analysis and linear algebra. ACM Transactions on Programming Languages and Systems, 21(6):1251–1297
Krothapalli VP, Sadayappan P (1991) Removal of redundant dependences in DOACROSS loops with constant dependences. Proceedings of the Third SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’91), pp. 51–60
Lamport L (1974)The parallel execution of DO loops. Communication of the ACM, 17(2):83–93
Letauec P (1994) Domain decomposition methods in computational mechanics North-Holland, Amsterdam
Lim AW, Cheong GI, Lam MS (1999) An affine partitioning algorithm to maximize parallelism and minimize communication. Proceedings of ACM International Conference on Supercomputing (ICS’99), pp. 228–237
Liao SW, Diwan A, Bosch Jr. RP, Ghuloum A, Lam MS (1999) SUIF explorer: An interactive and interprocedural parallelizer. Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99), pp. 37–48
Lee G (1995) Parallelizing iterative loops with conditional branching. IEEE Transactions on Parallel and Distributed Systems, 6(2)
Midkiff S, Padua D (1987) Compiler algorithm for synchronization. IEEE Transactions on Computer, C-36(12): 1485–1495.
Pacific-Sierra Research. http://www.psrv.com/vasthpf.html
Roose D, Van Driessche R (1995) Parallel computers and parallel algorithms for CFD: An introduction, special course on parallel computing in CFD, AGARD R-807, NATO, 1995, ISBN 92-836-1025-3, pp. 1.1–1.23
Rosing MM, Yabusaki S (1999) A programmable preprocessor for parallelizing Fortran-90. Proceedings of Supercomputing’99
Roth G, Mellor-Crummey J, Kennedy K (1997) Compiling stencils in high performance fortran. Proceedings of Supercomputing’97
Sohn A, Simon H (1994) JOVE: A dynamic load balancing framework for adaptive computations on an SP-2 distributed-memory multiprocessor. NJIT CIS Tech Report 94-40
Simon HD (1992) Parallel computational fluid dynamics. MIT Press, Cambridge MA
Tseng CW (1995) Compiler optimizations for eliminating barrier synchronization. Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’95), pp. 144–155
Wolfe W (1986) Loop skewing: The wavefront method revisited. International Journal on Parallel Programming, 15(4):279–293
Wolfe M (1996) Parallelizing compilers. ACM Computing Surveys, 28(1)
Yan Y (1998) Exploiting cache locality on symmetric multiprocessors: A run-time approach, Ph.D. Dissertation, Department of Computer Science, College of William and Mary
Yan Y, Zhang X, Zhang Z (2000) Cacheminer: a runtime approach to exploit cache locality on SMPs. IEEE Transactions on Parallel and Distributed Systems, 11(4):357–374
Author information
Authors and Affiliations
Additional information
This work is supported in part by the China National Aerospace Science Foundation, and by the U.S. National Science Foundation under grants CCR-9812187, CCR-0098055, CCF-0325760, CCF 0514078, and CNS 0549006.
Rights and permissions
About this article
Cite this article
Xiao, L., Zhang, X., Kuang, Z. et al. Auto-CFD-NOW: A pre-compiler for effectively parallelizing CFD applications on networks of workstations. J Supercomput 38, 189–217 (2006). https://doi.org/10.1007/s11227-006-8324-z
Issue Date:
DOI: https://doi.org/10.1007/s11227-006-8324-z