ABSTRACT
Graphics processors (GPUs) are highly parallel devices that promise high performance, and they are now flexible enough to be used for general-purpose computing. A programming language based on implicitly data-parallel collective array operations can permit high-level, effective programming of GPUs. I describe three optimizations for such a language: automatic use of GPU shared memory cache, array fusion, and hoisting of nested parallel constructs. These optimizations are simple to implement because of the design of the language to which they are applied but can result in large run-time speedups.
- G. E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, 38(11):1526--1538, 1989. ISSN 0018-9340. Google ScholarDigital Library
- G. E. Blelloch. Programming parallel algorithms. Communications of the ACM, 39(3):85--97, 1996. ISSN 0001-0782. Google ScholarDigital Library
- G. E. Blelloch, J. C. Hardwick, S. Chatterjee, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. In PPOPP '93: Proceedings of the fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 102--111, New York, NY, USA, 1993. ACM. ISBN 0-89791-589-5. Google ScholarDigital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In SIGGRAPH '04: ACM SIGGRAPH 2004 Papers, pages 777--786, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. Technical Report UCB/EECS-2010-124, EECS Department, University of California, Berkeley, September 2010.Google Scholar
- J. Cheney and R. Hinze. First-class phantom types. Technical Report TR2003-1901, Cornell University, July 2003.Google Scholar
- D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: from lists to streams to nothing at all. In ICFP '07: Proceedings of the 12th ACM SIGPLAN International Conference on Functional Programming, pages 315--326, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-815-2. Google ScholarDigital Library
- S. Edelkamp, D. Sulewski, and C. Yücel. Perfect hashing for state space exploration on the GPU. In R. I. Brafman, H. Geffner, J. Hoffmann, and H. A. Kautz, editors, Proceedings of the 29th International Conference on Automated Planning and Scheduling, ICAPS 2010, Toronto, Ontario, Canada, May 12-16, 2010, pages 57--64. AAAI Press, May 2010.Google Scholar
- C. Elliott, S. Finne, and O. de Moor. Compiling embedded languages. Journal of Functional Programming, 13(3):455--481, May 2003. Google ScholarDigital Library
- E. Elsen, M. Houston, V. Vishal, E. Darve, P. Hanrahan, and V. Pande. N-body simulation on GPUs. In SC '06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, page 188, New York, NY, USA, 2006. ACM. ISBN 0-7695-2700-0. Google ScholarDigital Library
- A. Gill, J. Launchbury, and S. L. Peyton Jones. A short cut to deforestation. In FPCA '93: Proceedings of the Conference on Functional Programming Languages and Computer Architecture, pages 223--232, New York, NY, USA, 1993. ACM. ISBN 0-89791-595-X. Google ScholarDigital Library
- M. Harris. Optimizing parallel reduction in CUDA. PDF, 2008. Provided in the documentation of the CUDA 3.2 SDK.Google Scholar
- K. E. Iverson. A programming language. In AIEE-IRE '62 (Spring): Proceedings of the May 1-3, 1962, spring joint computer conference, pages 345--351, New York, NY, USA, 1962. ACM. Google ScholarDigital Library
- T. Johnsson. Lambda lifting: transforming programs to recursive equations. In Proceedings of a Conference on Functional Programming Languages and Computer Architecture, pages 190--203, New York, NY, USA, 1985. Springer-Verlag New York, Inc. ISBN 3-387-15975-4. Google ScholarDigital Library
- G. Keller, M. M. Chakravarty, R. Leschinskiy, S. P. Jones, and B. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming, ICFP 2010, pages 261--272, New York, NY, USA, September 2010. ACM. ISBN 978-1-60558-794-3. Google ScholarDigital Library
- C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software, 5(3):308--323, 1979. ISSN 0098-3500. Google ScholarDigital Library
- S. Lee, M. M. T. Chakravarty, V. Grover, and G. Keller. GPU kernels as data-parallel array computations in Haskell. Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods, 2009.Google Scholar
- G. Mainland and G. Morrisett. Nikola: Embedding compiled GPU functions in Haskell. In Proceedings of the third ACM Haskell symposium on Haskell, pages 67--78, New York, NY, USA, September 2010. ACM. ISBN 978-1-4503-0252-4. Google ScholarDigital Library
- P. Manolios and Y. Zhang. Implementing survey propagation on graphics processing units. In A. Biere and C. P. Gomes, editors, Theory and Applications of Satisfiability Testing - SAT 2006, 9th International Conference, Seattle, WA, USA, August 12-15, 2006, Proceedings, volume 4121 of Lecture Notes in Computer Science, pages 311--324. Springer, 2006. ISBN 3-540-37206-7 Google ScholarDigital Library
- M. D. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule. Shader algebra. In SIGGRAPH '04: ACM SIGGRAPH 2004 Papers, pages 787--795, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- NVIDIA. NVIDIA CUDA Programming Guide Version 3.2. NVIDIA, 2010.Google Scholar
- F. Pfenning and C. Elliott. Higher-order abstract syntax. ACM SIGPLAN Notices, 23(7):199--208, July 1988. Google ScholarDigital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 97--106, Aire-la-Ville, Switzerland, 2007. Eurographics Association. ISBN 978-1-59593-625-7. Google ScholarDigital Library
- M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on GPUs through software-managed cache. In ICS '08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages 309--318, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-158-3. Google ScholarDigital Library
- T. L. Veldhuizen. Arrays in Blitz++. In D. Caromel, R. Oldehoeft, and M. Tholburn, editors, ISCOPE '98: Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environ-ments, volume 1505 of Lecture Notes in Computer Science, pages 223--230, London, UK, 1998. Springer-Verlag. ISBN 3-540-65387-2. Google ScholarDigital Library
- H. Xi, C. Chen, and G. Chen. Guarded recursive datatype constructors. In POPL '03: Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 224--235, New York, NY, USA, 2003. ACM. ISBN 1-58113-628-5. Google ScholarDigital Library
Index Terms
- Simple optimizations for an applicative array language for graphics processors
Recommendations
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Modular array-based GPU computing in a dynamically-typed language
ARRAY 2017: Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array ProgrammingNowadays, GPU accelerators are widely used in areas with large data-parallel computations such as scientific computations or neural networks. Programmers can either write code in low-level CUDA/OpenCL code or use a GPU extension for a high-level ...
A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array ProgrammingToday, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application ...
Comments