ABSTRACT
Interest in Graphical Processing Units (GPUs) is skyrocketing due to their potential to yield spectacular performance on many important computing applications. Unfortunately, writing such efficient GPU kernels requires painstaking manual optimization effort which is very error prone. We contribute the first comprehensive symbolic verifier for kernels written in CUDA C. Called the 'Prover of User GPU programs (PUG),' our tool efficiently and automatically analyzes real-world kernels using Satisfiability Modulo Theories (SMT) tools, detecting bugs such as data races, incorrectly synchronized barriers, bank conflicts, and wrong results. PUG's innovative ideas include a novel approach to symbolically encode thread interleavings, exact analysis for correct barrier placement, special methods for avoiding interleaving generation, dividing up the analysis over barrier intervals, and handling loops through three approaches: loop normalization, overapproximation, and invariant finding. PUG has analyzed over a hundred CUDA kernels from public distributions and in-house projects, finding bugs as well as subtle undocumented assumptions.
- Aiken, A., and Gay, D. Barrier inference. In Symposium on the Principles of Programming Languages (POPL) (1998). Google ScholarDigital Library
- Allen, R., and Kennedy, K. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001. Google ScholarDigital Library
- Boyer, M., Skadron, K., and Weimer, W. Automated dynamic analysis of CUDA programs. In Third Workshop on Software Tools for MultiCore Systems (2008).Google Scholar
- Clarke, E. M., Grumberg, O., and Peled, D. A. Model Checking. MIT Press, 2000.Google ScholarDigital Library
- Cobleigh, J. M., Clarke, L. A., and Osterweil, L. J. Flavers: A finite state verification technique for software systems. IBM Systems Journal 41, 1 (2002). Google ScholarDigital Library
- Csallner, C., Tillmann, N., and Smaragdakis, Y. DySy: Dynamic symbolic execution for invariant inference. In International Conference on Software Engineering (ICSE) (2008), pp. 281--290. Google ScholarDigital Library
- Cuda programming guide version 1.1. http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf.Google Scholar
- Emerson, E. A., and Kahlon, V. Reducing model checking of the many to the few. In CADE (2000), pp. 236--254. Google ScholarDigital Library
- Feng, M., and Leiserson, C. E. Efficient detection of determinacy races in cilk programs. In Parallel Algorithms and Architectures (SPAA) (1997). Google ScholarDigital Library
- Fermi. http://www.nvidia.com/object/fermiarchitecture.html.Google Scholar
- Flanagan, C., and Freund, S. N. Type-based race detection for Java. In Programming Language Design and Implementation (PLDI) (2000). Google ScholarDigital Library
- Flanagan, C., and Godefroid, P. Dynamic partial-order reduction for model checking software. In Symposium on the Principles of Programming Languages (POPL) (2005), pp. 110--121. Google ScholarDigital Library
- Gulwani, S. Speed: Symbolic complexity bound analysis. In Computer Aided Verification (CAV) (2009), pp. 51--62. Google ScholarDigital Library
- Kirk, D. B., and mei W. Hwu, W. Programming Massively Parallel Processors. Morgan Kauffman, 2010. Google ScholarDigital Library
- Lahiri, S. K., Qadeer, S., and Rakamaric, Z. Static and precise detection of concurrency errors in systems code using SMT solvers. In Computer Aided Verification (CAV) (2009), pp. 509--524. Google ScholarDigital Library
- Li, G., and Gopalakrishnan, G. Technical Report and PUG Tool Download: http://www.cs.utah.edu/fv/PUG.Google Scholar
- Li, G., Gopalakrishnan, G., Kirby, R. M., and Quinlan, D. A symbolic verifier for CUDA programs. In PPoPP, Poster Session (2010), pp. 357--358. Google ScholarDigital Library
- Li, G., Palmer, R., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M. Formal specification of MPI 2.0: Case study in specifying a practical concurrent programming API. Sci. Comp. Prog. 75 (2010). Google ScholarDigital Library
- Nielson, F., Nielson, H. R., and Hankin, C. Principles of Program Analysis. Springer-Verlag, 1999. Google ScholarDigital Library
- OpenCL. http://www.khronos.org/opencl.Google Scholar
- The ROSE compiler. http://www.rosecompiler.org/.Google Scholar
- Satisfiability Modulo Theories Competition (SMT-COMP). http://www.smtcomp.org/2009.Google Scholar
- Tripakis, S., Stergiou, C., and Lublinerman, R. Checking non-interference in SPMD programs. In 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar) (2010).Google Scholar
- Yices: An SMT solver. http://yices.csl.sri.com.Google Scholar
Index Terms
- Scalable SMT-based verification of GPU kernel functions
Recommendations
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsThe graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
SIMD Monte-Carlo Numerical Simulations Accelerated on GPU and Xeon Phi
The efficiency of a pleasingly parallel application is studied for several computing platforms. A real world problem, i.e., Monte-Carlo numerical simulations of stratospheric balloon envelope drift descent is considered. We detail the optimization of ...
Pervasive massively multithreaded GPU processors
CF '09: Proceedings of the 6th ACM conference on Computing frontiersThis talk presents an overview of NVIDIA's SIMT architecture and some brief insights on how some CUDA programming paradigms map onto it. A brief history of SIMT is provided to explain how NVIDIA ended up implementing a unified SIMT processor core in its ...
Comments