Abstract
The rapidly evolving CUDA environment is well suited for numerical integration of high dimensional integrals, in particular by Monte Carlo or quasi-Monte Carlo methods. With some care, near peak performance can be obtained on important applications. A basis for efficient numerical integration using using CUDA kernels on NVIDIA GPUs is presented, showing several ways to use CUDA features, provide automatic error control and prevention or detection of roundoff errors. This framework allows easy extension to multiple GPUs, clusters and clouds for addressing problems that were impractical to attack in the past, and is the basis for an update to the ParInt numerical integration package.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
CUDA Library. http://www.nvidia.com/getcuda (last accessed May 2014)
Brown, R.: DIEHARDER. http://www.phy.duke.edu/~rgb/General/dieharder.php (last accessed May 2014)
Chan, T.F., Golub, G.H., LeVeque, R.J.: Updating formulae and a pairwise algorithm for computing sample variances. Technical Report STAN-CS-79-773, Stanford University ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf (1979)
Davis, P.J., Rabinowitz, P.: Methods of Numerical Integration. Academic, New York (1975)
de Doncker, E., Assaf, R.: GPU integral computations in stochastic geometry. In: VII Workshop Computational Geometry and Applications (CGA). Lecture Notes in Computer Science, vol. 7972, pp. 129–139 (2013)
de Doncker, E., Kapenga, J., Liou, W.W.: Open source software for Monte Carlo/DSMC applications. In: 55th AIAA/ASMe/ASCE/AHS/SC Structures, Structural Dynamics, and Materials Conference, The American Institute of Aeronautics and Astronautics (AIAA) (2014). doi:10.2514/6.2014-0348
de Doncker, E., Yuasa, F.: Distributed and multi-core computation of 2-loop integrals. In: 15th International Workshop on Adv. Computing and Analysis Techniques in Physics (ACAT 2013), Journal of Physics, Conference Series. To appear (2014).
Dremmel, J., Nguyen, H.D.: Fast reproducible floating-point summations. In: 2013 21st IEEE Symposium on Computer Arithmetic (ARITH), pp. 163–172 (2013)
Genz, A.: MVNPACK. http://www.math.wsu.edu/faculty/genz/software/fort77/mvnpack.f (2010)
Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23(1), 5–48 (1991)
Higham, N.J.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia, Addison-Wesley (2002). ISBN 978-0-898715-21-7
IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard 754-1985. Institute of Electrical and Electronics Engineers, New York (1985). Reprinted in SIGPLAN Notices 22(2), 9–25 (1987)
IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard 754-2008. Institute of Electrical and Electronics Engineers, New York (2008)
Kapenga, J., de Doncker, E.: Compensated summation on multiple NVIDIA GPUs. HPCS Technical Report HPCS-2014-1, Western Michigan University (2014)
Knuth, D.E.: The Art of Computer Programming, Volume 2, Seminumerical Algorithms, 3rd edn. Addison-Wesley (1998)
L’ Equyer, P.: Combined multiple recursive random number generators. Oper. Res. 44, 816–822 (1996)
Laporta, S.: High-precision calculation of multi-loop Feynman integrals by difference equations. Int. J. Mod. Phys. A 15, 5087–5159 (2000). arXiv:hep-ph/0102033v1
L’Equyer, P., Simard, R.: A C library for empirical testing of random number generators. ACM Trans. Math. Softw. 33, 22 (2007)
Manssen, M., Weigel, M., Hartmann, A.K.: Random number generators for massively parallel simulations on GPU (2012). arXiv:1204.6193v1 [physics.comp-ph] 27 April 2012
Marsaglia, G.: DIEHARD: a battery of tests of randomness. http://www.stat.fsu.edu/pub/diehard
Marsaglia, G.: Xorshift RNGs. J. Stat. Softw. 8, 1–6 (2003)
Matsumoto, M., Nishimura, T.: Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8, 3 (2003)
Muller, J.-M., Brisebarre, N., de Dinechin, F., Jeannerod, C.-P., Lefevre, V., Melquiond, G., Revol, N., Stehle, D., Torres, S. Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2010). ACM G.1.0; G.1.2; G.4; B.2.0; B.2.4; F.2.1., ISBN 978-0-8176-4704-9
NVIDIA. Tesla Product Literature. http://www.nvidia.com/object/tesla_product_literature.html (last accessed May 2014)
NVIDIA. http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf (last accessed May 2014)
Rump, S.M., Ogita, T., Oishi, S.: Accurate floating-point summation part i: Faithful rounding. SIAM J. Sci. Comput. 31(1), 189–224 (2008)
Rump, S.M., Ogita, T., Oishi, S.: Accurate floating-point summation part ii: Sign, k-fold faithful and rounding to nearest. SIAM J. Sci. Comput. 31(2), 1269–1302 (2008)
Saito, M., Matsumoto, M.: Variants of Mersenne twister suitable for graphics processors. Trans. Math. Softw. 39(12), 1–20 (2013)
Salmon, J.K., Moraes, M.A.: Random123: a library of counter-based random number generators. http://deshawresearch.com/resources_random123.html, and Random123-1.06 Documentation, http://www.thesalmons.org/john/random123/releases/1.06/docs (last accessed May 2014)
Salmon, J.K., Moraes, M.A., Dror, R.O., Shaw, D.E.: Parallel random numbers: as easy as 1, 2, 3. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC11) (2011)
Sanders, J., Kandrot, E.: CUDA by Example - An Introduction to General-Purpose GPU Programming. Addison-Wesley, Reading (2011). ISBN: 978-0-13-138768-3
SPRNG: The scalable parallel random number generators library. http://www.sprng.org (last accessed May 2014)
Whitehead, N., Fit-Floreas, A.: Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs. http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf Nvidia developers (2011)
Acknowledgements
We acknowledge the support from the National Science Foundation under Award Number 1126438, and from NVIDIA for the award of our CUDA Teaching Center.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
As an application we treated a Feynman (two-)loop integral, which was previously considered in [7, 17]. This type of problem is important for the calculation of the cross section of particle interactions in high energy physics.
Table 13.2 lists the integral approximation, absolute error and error estimate, and the parallel time of a GPU computation on Kepler K20, using the MC kernel with Random123 as the PRNG, and Kahan summation on the GPU. The results in the bottom part of the table (for N ≥ 4 × 108) are obtained using the horizontal dynamic parallelism strategy with a chunk size of 200 million. For the smaller values of N in the top part of the table, the chunk size is set equal to N, so no child kernels are launched. For these values of N, the execution time is compared to that of a corresponding sequential calculation where erand48() is called to generate the pseudo-random sequence on the CPU. Speedups of near full peak performance are observed. Note that the error decreases to 9.9e−08 at N = 4 × 1010. The integrand function is given below.
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
de Doncker, E., Kapenga, J., Assaf, R. (2014). Monte Carlo Automatic Integration with Dynamic Parallelism in CUDA. In: Kindratenko, V. (eds) Numerical Computations with GPUs. Springer, Cham. https://doi.org/10.1007/978-3-319-06548-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-06548-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06547-2
Online ISBN: 978-3-319-06548-9
eBook Packages: Computer ScienceComputer Science (R0)