ABSTRACT
To facilitate full chip capacitance extraction, field solvers are typically deployed for characterizing capacitance libraries for various interconnect structures and configurations. In the past decades, various algorithms for accelerating boundary element methods (BEM) have been developed to improve the efficiency of field solvers for capacitance extraction. This paper presents the first massively parallel capacitance extraction algorithm FMMGpu that accelerates the well-known fast multipole methods (FMM) on modern Graphics Processing Units (GPUs). We propose GPU-friendly data structures and SIMD parallel algorithm flows to facilitate the FMM-based 3-D capacitance extraction on GPU. Effective GPU performance modeling methods are also proposed to properly balance the workload of each critical kernel in our FMMGpu implementation, by taking advantage of the latest Fermi GPU's concurrent kernel executions on streaming multiprocessors (SMs). Our experimental results show that FMMGpu brings 22X to 30X speedups in capacitance extractions for various test cases. We also show that even for small test cases that may not well utilize GPU's hardware resources, the proposed cube clustering and workload balancing techniques can bring 20% to 60% extra performance improvements.
- K. Nabors and J. White. FastCap: a multipole accelerated 3-D capacitance extraction program. IEEE Trans. on Computer-Aided Design, 10(11):1447--1459, Nov. 1991.Google ScholarDigital Library
- J. Phillips and J. White. A precorrected-FFT method for electrostatic analysis of complicated 3-D structures. IEEE Trans. on Computer-Aided Design, 16(10):1059--1072, Oct. 1997. Google ScholarDigital Library
- W. Shi, J. Liu, N. Kakani, and T. Yu. A fast hierarchical algorithm for 3-D capacitance extraction. In IEEE/ACM DAC, pages 212--217, June 1998. Google ScholarDigital Library
- F. Gong, H. Yu, and L. He. Picap: A parallel and incremental capacitance extraction considering stochastic process variation. In IEEE/ACM DAC, pages 764--769, Jul. 2009. Google ScholarDigital Library
- R. Iverson and Y. Le Coz. A Stochastic Algorithm for High Speed capacitance Extraction in Integrated Circuits. Solid-State Electronics, 35(7):1005--1012, 1992.Google ScholarCross Ref
- T. El-Moselhy, I. Elfadel, and L. Daniel. A hierarchical floating random walk algorithm for fabric-aware 3D capacitance extraction. In IEEE/ACM ICCAD, pages 752--758, 2009. Google ScholarDigital Library
- NVIDIA Corporation. Fermi compute architecture white paper. {Online}. Available: http://www.nvidia.com/object/fermi_architecture.html, 2010.Google Scholar
- N. Gumerov and R. Duraiswami. Fast multipole methods on graphics processors. J. Comput. Phys., 227(18):8290--8313, 2008. Google ScholarDigital Library
- T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In SC '09, pages 1--12, 2009. Google ScholarDigital Library
- K. Nabors, S. Kim, and J. White. Fast capacitance extraction of general three-dimensional structures. IEEE Trans. on Microwave Theory and Techniques, 40(7):1496--1506, Jul. 1992.Google ScholarCross Ref
- L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys., 73(2):325--348, 1987. Google ScholarDigital Library
- A. Appel. An efficient program for many-body simulation. SIAM Journal on Scientific and Statistical Computing, 6(1):85--103, 1985.Google ScholarDigital Library
- NVIDIA Corporation. NVIDIA CUDA C programming guide. {Online}. Available: http://developer.nvidia.com/object/gpucomputing.html, 2010.Google Scholar
Index Terms
- Fast multipole method on GPU: tackling 3-D capacitance extraction on massively parallel SIMD platforms
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Special issue on Community Analysis and Information RecommendationIn this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Comments