Abstract
With advances in manycore and accelerator architectures, the high performance and embedded spaces are rapidly converging. Emerging architectures feature different forms of parallelism. The Polyhedral Processes Networks (PPNs) are a proven model of choice for automated generation of pipeline and task parallel programs from sequential source code, however data parallelism is not addressed. In this paper, we present asystematic approach for identification and extraction of fine grain data parallelism from the PPN specification. The approach is implemented in a tool, called kpn2gpu, which produces fine-grain data parallel CUDA kernels for graphics processing units (GPUs). First experiments indicate that generated applications have a potential to exploit different forms of parallelism provided by the architecture and that kernels feature a highly regular structure that allows subsequent optimizations.
- ACE Associated Compiler Experts bv. Parallelization using polyhedral analysis. 2008.Google Scholar
- S. Baghdadi, A. Grölinger, and A. Cohen. Putting automatic polyhedral compilation for GPGPU to work. Proc of CPC'10.Google Scholar
- A. Balevic and B. Kienhuis. A Data Parallel View on Polyhedral Process Networks. SCOPES'11. Google ScholarDigital Library
- M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proc. of Compiler Construction (CC 2010). Springer, 2010. Google ScholarDigital Library
- U. Bondhugula et al. PLuTo: a practical and fully automatic polyhedral program optimization system. In Proc. of PLDI'08, Tucson, AZ, 2008.Google ScholarDigital Library
- A. Darte, Y. Robert, and F. Vivien. Scheduling and Automatic Parallelization. Springer, 2000. Google ScholarDigital Library
- P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23--53, 1991.Google ScholarDigital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem. Part I. One-dimensional time. IJPP'92, 21(5):313--347, 1992. Google ScholarDigital Library
- P. Feautrier. Scalable and structured scheduling. IJPP'06, 34(5):459--487, 2006. Google ScholarDigital Library
- G. Kahn and D. MacQueen. Coroutines and Networks of Parallel Processes. In Proceedings of IFIP Congress 77, pages 993--998, 1977.Google Scholar
- B. Kienhuis, E. Rijpkema, and E. Deprettere. Compaan: Deriving process networks from matlab for embedded signal processing architectures. In Proc. of CODES'00, pages 13--17. ACM, 2000. Google ScholarDigital Library
- E. A. Lee and T. M. Parks. Dataflow process networks. Proc. of the IEEE, 83(5):773--801, 2002.Google ScholarCross Ref
- C. Lengauer. Loop parallelization in the polytope model. LECTURE NOTES IN COMPUTER SCIENCE, pages 398--398, 1993. Google ScholarDigital Library
- S. Meijer, H. Nikolov, and T. Stefanov. Combining process splitting and merging transformations for polyhedral process networks. Proc. ESTIMedia'10.Google Scholar
- NVIDIA Corp. NVIDIA CUDA Technical Documentation: Programming and Best Practices Guide V3.2. Technical report, Sept. 2010.Google Scholar
- T. Stefanov et al. System design using Kahn process networks: the Compaan/Laura approach. In Proc. of DATE'04, volume 1, 2004. Google ScholarDigital Library
- S. Verdoolaege. Polyhedral process networks. Handbook of Signal Processing Systems, pages 931--965, 2010.Google ScholarCross Ref
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. ACM SIGPLAN Notices, 45(6):86--97, 2010. Google ScholarDigital Library
Index Terms
- KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments