Skip to main content
Log in

PRODA: improving parallel programs on GPUs through dependency analysis

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

GPU’s powerful parallel processing capability has been highly recognized throughout the industry; however, GPU computing environments have not yet been widely used in the field of parallel computing. In this study, we develop a method of parallelization of serial programs for GPU computing. In particular, we propose an approach called PRODA to speedup parallel programs on GPUs through dependency analysis. PRODA provides theoretical underpins of task partitioning in parallel programs running in GPU computing environments. At the heart of PRODA is an analyzer for program workflows as well as data and function dependencies in a GPU program. With the dependency analysis in place, PRODA assigns computing tasks to multiple GPU cores in a way to speedup the performance of parallel program on GPUs. An overarching goal of PRODA is to minimize data communication cost between GPUs and main memory of a host CPU. PRODA achieves this goal by apply deploying two strategies. First, PRODA assigns functions processing the same data to a GPU core. Second, PRODA runs multiple independent functions on separate GPU cores. In doing so, PRODA improves the parallelism of parallel programs. We evaluate the performance of PRODA by running two popular benchmarks (i.e., AES and T26) on an 256-core system, where key length is set to 256 bits. The experimental results show that the speedup ratio of AES governed by PRODA is 5.2. Specifically, PRODA improves the performance of the existing CFM scheme by a factor of 1.39. To measure cost of parallel computing, we test PRODA and the alternative solutions by running AES under the 256-bit key length on 128 cores. The cost of parallel computing in PRODA is 524.8ms, which is 61.2% lower than that of the existing SA solution. The parallel efficiency of PRODA is 2.08, which represents an improvement of the PDM algorithm by a factor of 2.08.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Jacob, P., Zia, A., Erdogan, O., Belemjian, P.M., Kim, J.W., Chu, M., Kraft, R.P., Mcdonald, J.F., Bernstein, K.: Mitigating memory wall effects in high-clock-rate and multicore cmos 3-d processor memory stacks. Proc. IEEE 97(1), 108–122 (2009)

    Article  Google Scholar 

  2. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)

    Article  Google Scholar 

  3. Hennessy, J.L., Patterson, D.A., Arpaci-Dusseau, A.C.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., an imprint of Elsevier (2007)

  4. Koop, M.J., Huang, W., Gopalakrishnan, K., Panda, D.K.: Performance analysis and evaluation of PCIE 2.0 and quad-data rate infiniband. In: Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects, pp. 85–92 (2008)

  5. Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. In: IEEE Des. Test, pp. 66–73 (2010)

  6. Pacheco, P.S.: An Introduction to Parallel Programming, Vol. 5, No. 4, p. 357359 (2011)

  7. Jian-Minga, L.I., Xiang-Peib, H.U., Pang, Z.L., Qian, K.M.: A parallel ant colony optimization algorithm based on fine-grained model with gpu-accelerated. Control Decis. 24(8), 1132–1136 (2009)

    MathSciNet  Google Scholar 

  8. Mohr, E., Kranz, D.A., Halstead, R.H. and Jr.: Lazy task creation: a technique for increasing the granularity of parallel programs. In: IEEE Transactions on Parallel and Distributed Systems, pp. 264–280 (1991)

  9. Levine, B.G., Lebard, D.N., Devane, R., Shinoda, W., Kohlmeyer, A., Klein, M.L.: Micellization studied by gpu-accelerated coarse-grained molecular dynamics. J. Chem. Theory Comput. 7(12), 4135–4145 (2011)

    Article  Google Scholar 

  10. Rauber, T., Rnger, G.: Parallel Programming—for Multicore and Cluster Systems. Springer, Heidelberg (2010)

    MATH  Google Scholar 

  11. Hwu, W.M., Ryoo, S., Ueng, S.Z., Kelm, J.H., Gelado, I., Stone, S.S., Kidd, R.E., Baghsorkhi, S.S., Mahesri, A.A., Tsao, S.C.: Implicitly parallel programming models for thousand-core microprocessors. In: Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, pp. 754–759 (2007)

  12. Lucas, P.: The development of the data-parallel gpu programming language CGIS. In: In International Conference on Computational Science, pp. 200–203 (2006)

  13. Mellorcrummey, J.: Center for programming models for scalable parallel computing. In: Scitech Connect Center for Programming Models for Scalable Parallel Computing (2008)

  14. Bikshandi, G., Guo, J., Hoeflinger, D., Almsi, G., Fraguela, B.B., Garzarn, M.J., Padua, D.A., Praun, C.V.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh Acm Sigplan Symposium on Principles and Practice of Parallel Program, pp. 48–57 (2006)

  15. D’Alberto, P.D., Nicolau, A.: Adaptive Strassen’s matrix multiplication. In: ICs Proceedings of Annual International Conference on Supercomputing, pp. 284–292 (2007)

  16. Wang, Z., Liu, Y., Chiu, S.: An efficient parallel collaborative filtering algorithm on multi-gpu platform. J. Supercomput. 72(6), 2080–2094 (2016)

    Article  Google Scholar 

  17. Cui, S., Großschädl, J., Liu, Z., Xu, Q.: High-speed elliptic curve cryptography on the NVIDIA GT200 graphics processing unit. In: Lecture Notes in Computer Science (2014)

  18. Roujol, S., De Senneville, B.D., Vahala, E., Sørensen, T.S., Moonen, C., Ries, M.: Online real-time reconstruction of adaptive TSENSE with commodity CPU/GPU hardware. Magn. Reson. Med. 62(6), 16581664 (2009)

    Article  Google Scholar 

  19. Tetsuya, O., Minh, T.T., Jinpil, L., Taisuke, B., Mitsuhisa, S.: Extend to GPU for Xcalablemp: a parallel programming language. In: IPSJ Sig. Notes (2011)

  20. Choi, W.H., Liu, X.: Case study: runtime reduction of a buffer insertion algorithm using GPU parallel programming. In: SOC Conference (SOCC), 2010 IEEE International, pp. 121–126 (2010)

  21. Raymond, N., Samuel, T., Olivier, A.: GPU/CPU Work Sharing Mechanism on XMP-dev, High-level Parallel Programming Language for GPU Cluster, Vol. 2014, pp. 87–96 (2013)

  22. Branover, A., Foley, D., Steinman, M.: Amd fusion apu: Llano. IEEE Micro 32(2), 28–37 (2012)

    Article  Google Scholar 

  23. Jr, R.H.H.: Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst. 7(4), 501–538 (1985)

    Article  MATH  Google Scholar 

  24. Zhang, C., Huang, K., Cui, X., Chen, Y.: Programming-level power measurement for GPU clusters. In: Green Computing and Communications (GreenCom). IEEE/ACM International Conference on, Vol. 2011, pp. 182–187 (2011)

  25. Wataru, T., Xu, J., Ken, W.: An implementation and evaluation of a compiler for ACTGPU, an actor-based asynchronous parallel programming language. In: IPSJ Sig Notes, vol. 2012 (2012)

  26. Grant, B., Mock, M., Philipose, M., Chambers, C., Eggers, S.J.: DyC: an expressive annotation-directed dynamic compiler for c. Theor. Comput. Sci. 248(12), 147–199 (2000)

    Article  MATH  Google Scholar 

  27. Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)

  28. Lattner, C., Adve, V.: Llvm: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, pp. 75–86 (2004)

  29. Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 3–12 (2009)

  30. Chang, C.T., Chen, Y.S., Wu, I.W., Shann, J.J.: A translation framework for automatic translation of annotated llvm ir into opencl kernel function. In: Smart Innovation Systems and Technologies (2013)

  31. Saeed-Akbari, A., Mosecker, L., Schwedt, A., Bleck, W.: Characterization and prediction of flow behavior in high-manganese twinning induced plasticity steels: Part I. Mechanism maps and work-hardening behavior. Metall. Mater. Trans. A 43(5), 1688–1704 (2012)

    Article  Google Scholar 

  32. Lee, J., Sato, M., Boku, T.: Openmpd: a directive-based data parallel language extension for distributed memory systems pp. 121–128 (2008)

  33. Wolf, F., Mohr, B.: Automatic performance analysis of hybrid MPI/OpenMP applications. J. Syst. Archit. 49(3), 421439 (2003)

    Google Scholar 

  34. Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural, pp. 287–296 (2008)

  35. Lastovetsky, A., Reddy, R.: Heterompi: Towards a message-passing library for heterogeneous networks of computers. Journal of Parallel and Distributed Computing 66, 197220 (2006)

    Article  MATH  Google Scholar 

  36. Knobloch, M., Foszczynski, M., Homberg, W., Pleiter, D., Bttiger, H.: Mapping fine-grained power measurements to HPC application runtime characteristics on IBM POWER7. Comput. Sci. Res. Dev. 29(3–4), 211–219 (2013)

    Google Scholar 

  37. Hoshi, T., Ootsu, K., Ohkawa, T., and Yokota, T.: “Runtime overhead reduction in automated parallel processing system using valgrind,” in International Symposium on Computing and NETWORKING, (2013) pp. 572–576

  38. Guire, N.M.: Linux kernel GCOV-tool analysis (2006)

  39. Wang, G., Tang, T., Fang, X., Ren, X.: Program optimization of array-intensive spec2k benchmarks on multithreaded GPU using CUDA and brook+. In: Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on, pp. 292–299 (2009)

  40. Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA ’09 Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152–163 (2009)

  41. Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid cpu-gpu execution. Clust. Comput. 16(1), 131–155 (2013)

    Article  Google Scholar 

  42. Galloy, M.: GPU Accelerated Curve Fitting with IDL. American Geophysical Union, Washington, DC (2012)

    Google Scholar 

  43. Nakashima, T., Fujiwara, A.: A cost optimal parallel algorithm for patience sorting. Parallel Process. Lett. 16(1), 39–51 (2006)

    Article  MathSciNet  Google Scholar 

  44. Akl, S.G.: An adaptive and cost-optimal parallel algorithm for minimum spanning trees. Computing 36(3), 271–277 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  45. Alonso, P., Cortina, R., Daz, I., Hernndez, V., Ranilla, J.: A simple cost-optimal parallel algorithm to solve linear equation systems. Information 3(3), 297–304 (2003)

    MathSciNet  Google Scholar 

  46. Bahl, A.K., Baltzer, O., Rau-Chaplin, A., Varghese, B., Whiteway, A.: Multi-GPU computing for achieving speedup in real-time aggregate risk analysis. High performance computing on graphics processing units (hgpu.org, Chaplin, 2013)

  47. Zhao, X.D., Liang, S.X., Sun, Z.C., Liu, Z.B., Han, S.L., Ren, X.F.: Foundation and analysis of computational efficiency for hydrodynamic model based on GPU parallel algorithm. J. Dalian Univ. Technol. (2014)

  48. Daemen, J., Rijmen, V.: The Design of Rijndael: AES the Advanced Encryption Standard. Springer, Berlin (2002)

    Book  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to sincerely thank all the anonymous reviewers for their constructive comments in reviewing this paper. Xiong Wei’s work is supported by Postdoctoral Science Foundation of China(Nos. 2013T61007, 2013M542468). Ming Hu’s work is supported in part by Research Program on Underlying Technologies of Textile Internet-of-Things of Wuhan Textile University under Grant No. 52300100314. Tao Peng’s work is supported by the Hubei National Science Foundation under Grant No. 2014CFB764. Xiao Qin’s work was supported by the U.S. National Science Foundation under Grants CCF-0845257, CNS-0917137, and CCF-0742187.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ming Hu or Tao Peng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, X., Hu, M., Peng, T. et al. PRODA: improving parallel programs on GPUs through dependency analysis. Cluster Comput 22 (Suppl 1), 2129–2144 (2019). https://doi.org/10.1007/s10586-017-1295-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1295-4

Keywords

Navigation