PRODA: improving parallel programs on GPUs through dependency analysis

Wei, Xiong; Hu, Ming; Peng, Tao; Jiang, Minghua; Wang, Zhiying; Qin, Xiao

doi:10.1007/s10586-017-1295-4

PRODA: improving parallel programs on GPUs through dependency analysis

Published: 22 December 2017

Volume 22, pages 2129–2144, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Xiong Wei^1,3,
Ming Hu¹,
Tao Peng¹,
Minghua Jiang¹,
Zhiying Wang³ &
…
Xiao Qin²

589 Accesses
Explore all metrics

Abstract

GPU’s powerful parallel processing capability has been highly recognized throughout the industry; however, GPU computing environments have not yet been widely used in the field of parallel computing. In this study, we develop a method of parallelization of serial programs for GPU computing. In particular, we propose an approach called PRODA to speedup parallel programs on GPUs through dependency analysis. PRODA provides theoretical underpins of task partitioning in parallel programs running in GPU computing environments. At the heart of PRODA is an analyzer for program workflows as well as data and function dependencies in a GPU program. With the dependency analysis in place, PRODA assigns computing tasks to multiple GPU cores in a way to speedup the performance of parallel program on GPUs. An overarching goal of PRODA is to minimize data communication cost between GPUs and main memory of a host CPU. PRODA achieves this goal by apply deploying two strategies. First, PRODA assigns functions processing the same data to a GPU core. Second, PRODA runs multiple independent functions on separate GPU cores. In doing so, PRODA improves the parallelism of parallel programs. We evaluate the performance of PRODA by running two popular benchmarks (i.e., AES and T26) on an 256-core system, where key length is set to 256 bits. The experimental results show that the speedup ratio of AES governed by PRODA is 5.2. Specifically, PRODA improves the performance of the existing CFM scheme by a factor of 1.39. To measure cost of parallel computing, we test PRODA and the alternative solutions by running AES under the 256-bit key length on 128 cores. The cost of parallel computing in PRODA is 524.8ms, which is 61.2% lower than that of the existing SA solution. The parallel efficiency of PRODA is 2.08, which represents an improvement of the PDM algorithm by a factor of 2.08.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Efficient High-Level Programming in Plain Java

Article 05 December 2022

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

References

Jacob, P., Zia, A., Erdogan, O., Belemjian, P.M., Kim, J.W., Chu, M., Kraft, R.P., Mcdonald, J.F., Bernstein, K.: Mitigating memory wall effects in high-clock-rate and multicore cmos 3-d processor memory stacks. Proc. IEEE 97(1), 108–122 (2009)
Article Google Scholar
Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)
Article Google Scholar
Hennessy, J.L., Patterson, D.A., Arpaci-Dusseau, A.C.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., an imprint of Elsevier (2007)
Koop, M.J., Huang, W., Gopalakrishnan, K., Panda, D.K.: Performance analysis and evaluation of PCIE 2.0 and quad-data rate infiniband. In: Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects, pp. 85–92 (2008)
Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. In: IEEE Des. Test, pp. 66–73 (2010)
Pacheco, P.S.: An Introduction to Parallel Programming, Vol. 5, No. 4, p. 357359 (2011)
Jian-Minga, L.I., Xiang-Peib, H.U., Pang, Z.L., Qian, K.M.: A parallel ant colony optimization algorithm based on fine-grained model with gpu-accelerated. Control Decis. 24(8), 1132–1136 (2009)
MathSciNet Google Scholar
Mohr, E., Kranz, D.A., Halstead, R.H. and Jr.: Lazy task creation: a technique for increasing the granularity of parallel programs. In: IEEE Transactions on Parallel and Distributed Systems, pp. 264–280 (1991)
Levine, B.G., Lebard, D.N., Devane, R., Shinoda, W., Kohlmeyer, A., Klein, M.L.: Micellization studied by gpu-accelerated coarse-grained molecular dynamics. J. Chem. Theory Comput. 7(12), 4135–4145 (2011)
Article Google Scholar
Rauber, T., Rnger, G.: Parallel Programming—for Multicore and Cluster Systems. Springer, Heidelberg (2010)
MATH Google Scholar
Hwu, W.M., Ryoo, S., Ueng, S.Z., Kelm, J.H., Gelado, I., Stone, S.S., Kidd, R.E., Baghsorkhi, S.S., Mahesri, A.A., Tsao, S.C.: Implicitly parallel programming models for thousand-core microprocessors. In: Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, pp. 754–759 (2007)
Lucas, P.: The development of the data-parallel gpu programming language CGIS. In: In International Conference on Computational Science, pp. 200–203 (2006)
Mellorcrummey, J.: Center for programming models for scalable parallel computing. In: Scitech Connect Center for Programming Models for Scalable Parallel Computing (2008)
Bikshandi, G., Guo, J., Hoeflinger, D., Almsi, G., Fraguela, B.B., Garzarn, M.J., Padua, D.A., Praun, C.V.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh Acm Sigplan Symposium on Principles and Practice of Parallel Program, pp. 48–57 (2006)
D’Alberto, P.D., Nicolau, A.: Adaptive Strassen’s matrix multiplication. In: ICs Proceedings of Annual International Conference on Supercomputing, pp. 284–292 (2007)
Wang, Z., Liu, Y., Chiu, S.: An efficient parallel collaborative filtering algorithm on multi-gpu platform. J. Supercomput. 72(6), 2080–2094 (2016)
Article Google Scholar
Cui, S., Großschädl, J., Liu, Z., Xu, Q.: High-speed elliptic curve cryptography on the NVIDIA GT200 graphics processing unit. In: Lecture Notes in Computer Science (2014)
Roujol, S., De Senneville, B.D., Vahala, E., Sørensen, T.S., Moonen, C., Ries, M.: Online real-time reconstruction of adaptive TSENSE with commodity CPU/GPU hardware. Magn. Reson. Med. 62(6), 16581664 (2009)
Article Google Scholar
Tetsuya, O., Minh, T.T., Jinpil, L., Taisuke, B., Mitsuhisa, S.: Extend to GPU for Xcalablemp: a parallel programming language. In: IPSJ Sig. Notes (2011)
Choi, W.H., Liu, X.: Case study: runtime reduction of a buffer insertion algorithm using GPU parallel programming. In: SOC Conference (SOCC), 2010 IEEE International, pp. 121–126 (2010)
Raymond, N., Samuel, T., Olivier, A.: GPU/CPU Work Sharing Mechanism on XMP-dev, High-level Parallel Programming Language for GPU Cluster, Vol. 2014, pp. 87–96 (2013)
Branover, A., Foley, D., Steinman, M.: Amd fusion apu: Llano. IEEE Micro 32(2), 28–37 (2012)
Article Google Scholar
Jr, R.H.H.: Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7(4), 501–538 (1985)
Article MATH Google Scholar
Zhang, C., Huang, K., Cui, X., Chen, Y.: Programming-level power measurement for GPU clusters. In: Green Computing and Communications (GreenCom). IEEE/ACM International Conference on, Vol. 2011, pp. 182–187 (2011)
Wataru, T., Xu, J., Ken, W.: An implementation and evaluation of a compiler for ACTGPU, an actor-based asynchronous parallel programming language. In: IPSJ Sig Notes, vol. 2012 (2012)
Grant, B., Mock, M., Philipose, M., Chambers, C., Eggers, S.J.: DyC: an expressive annotation-directed dynamic compiler for c. Theor. Comput. Sci. 248(12), 147–199 (2000)
Article MATH Google Scholar
Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)
Lattner, C., Adve, V.: Llvm: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, pp. 75–86 (2004)
Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 3–12 (2009)
Chang, C.T., Chen, Y.S., Wu, I.W., Shann, J.J.: A translation framework for automatic translation of annotated llvm ir into opencl kernel function. In: Smart Innovation Systems and Technologies (2013)
Saeed-Akbari, A., Mosecker, L., Schwedt, A., Bleck, W.: Characterization and prediction of flow behavior in high-manganese twinning induced plasticity steels: Part I. Mechanism maps and work-hardening behavior. Metall. Mater. Trans. A 43(5), 1688–1704 (2012)
Article Google Scholar
Lee, J., Sato, M., Boku, T.: Openmpd: a directive-based data parallel language extension for distributed memory systems pp. 121–128 (2008)
Wolf, F., Mohr, B.: Automatic performance analysis of hybrid MPI/OpenMP applications. J. Syst. Archit. 49(3), 421439 (2003)
Google Scholar
Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural, pp. 287–296 (2008)
Lastovetsky, A., Reddy, R.: Heterompi: Towards a message-passing library for heterogeneous networks of computers. Journal of Parallel and Distributed Computing 66, 197220 (2006)
Article MATH Google Scholar
Knobloch, M., Foszczynski, M., Homberg, W., Pleiter, D., Bttiger, H.: Mapping fine-grained power measurements to HPC application runtime characteristics on IBM POWER7. Comput. Sci. Res. Dev. 29(3–4), 211–219 (2013)
Google Scholar
Hoshi, T., Ootsu, K., Ohkawa, T., and Yokota, T.: “Runtime overhead reduction in automated parallel processing system using valgrind,” in International Symposium on Computing and NETWORKING, (2013) pp. 572–576
Guire, N.M.: Linux kernel GCOV-tool analysis (2006)
Wang, G., Tang, T., Fang, X., Ren, X.: Program optimization of array-intensive spec2k benchmarks on multithreaded GPU using CUDA and brook+. In: Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on, pp. 292–299 (2009)
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA ’09 Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152–163 (2009)
Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid cpu-gpu execution. Clust. Comput. 16(1), 131–155 (2013)
Article Google Scholar
Galloy, M.: GPU Accelerated Curve Fitting with IDL. American Geophysical Union, Washington, DC (2012)
Google Scholar
Nakashima, T., Fujiwara, A.: A cost optimal parallel algorithm for patience sorting. Parallel Process. Lett. 16(1), 39–51 (2006)
Article MathSciNet Google Scholar
Akl, S.G.: An adaptive and cost-optimal parallel algorithm for minimum spanning trees. Computing 36(3), 271–277 (1986)
Article MathSciNet MATH Google Scholar
Alonso, P., Cortina, R., Daz, I., Hernndez, V., Ranilla, J.: A simple cost-optimal parallel algorithm to solve linear equation systems. Information 3(3), 297–304 (2003)
MathSciNet Google Scholar
Bahl, A.K., Baltzer, O., Rau-Chaplin, A., Varghese, B., Whiteway, A.: Multi-GPU computing for achieving speedup in real-time aggregate risk analysis. High performance computing on graphics processing units (hgpu.org, Chaplin, 2013)
Zhao, X.D., Liang, S.X., Sun, Z.C., Liu, Z.B., Han, S.L., Ren, X.F.: Foundation and analysis of computational efficiency for hydrodynamic model based on GPU parallel algorithm. J. Dalian Univ. Technol. (2014)
Daemen, J., Rijmen, V.: The Design of Rijndael: AES the Advanced Encryption Standard. Springer, Berlin (2002)
Book MATH Google Scholar

Download references

Acknowledgements

The authors would like to sincerely thank all the anonymous reviewers for their constructive comments in reviewing this paper. Xiong Wei’s work is supported by Postdoctoral Science Foundation of China(Nos. 2013T61007, 2013M542468). Ming Hu’s work is supported in part by Research Program on Underlying Technologies of Textile Internet-of-Things of Wuhan Textile University under Grant No. 52300100314. Tao Peng’s work is supported by the Hubei National Science Foundation under Grant No. 2014CFB764. Xiao Qin’s work was supported by the U.S. National Science Foundation under Grants CCF-0845257, CNS-0917137, and CCF-0742187.

Author information

Authors and Affiliations

School of Mathmatics and Computer, Wuhan Textile University (WTU), Wuhan, Hubei, 430073, People’s Republic of China
Xiong Wei, Ming Hu, Tao Peng & Minghua Jiang
Department of Computer Science and Software Engineering, Shelby Center for Engineering Technology, Samuel Ginn College of Engineering, Auburn University, Auburn, 36849-5347, AL, USA
Xiao Qin
School of Computer, National University of Defense Technology (NUDT), Changsha, Hunan, 410073, People’s Republic of China
Xiong Wei & Zhiying Wang

Authors

Xiong Wei
View author publications
You can also search for this author in PubMed Google Scholar
Ming Hu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Peng
View author publications
You can also search for this author in PubMed Google Scholar
Minghua Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiying Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ming Hu or Tao Peng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, X., Hu, M., Peng, T. et al. PRODA: improving parallel programs on GPUs through dependency analysis. Cluster Comput 22 (Suppl 1), 2129–2144 (2019). https://doi.org/10.1007/s10586-017-1295-4

Download citation

Received: 14 January 2017
Revised: 24 September 2017
Accepted: 26 October 2017
Published: 22 December 2017
Issue Date: 16 January 2019
DOI: https://doi.org/10.1007/s10586-017-1295-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PRODA: improving parallel programs on GPUs through dependency analysis

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

Efficient High-Level Programming in Plain Java

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PRODA: improving parallel programs on GPUs through dependency analysis

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

Efficient High-Level Programming in Plain Java

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation