Skip to main content
Log in

Memory-Optimized Wavefront Parallelism on GPUs

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research of such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. To achieve optimal performance, tiling has been introduced as a popular solution to achieve a balanced workload. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. In addition, shared memory is used in place of the L1 cache. The well-tailored mapping of the operations to the shared memory improves both spatial and temporal locality. We evaluate the performance of our proposed technique for four different applications: sequence alignment, edit distance, summed-area table, and 2D-SOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Balhaf, K., Alsmirat, M.A., Al-Ayyoub, M., Jararweh, Y., Shehab, M.A.: Accelerating levenshtein and damerau edit distance algorithms using gpu with unified memory. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 7–11. IEEE (2017)

  2. Bednárek, D., Brabec, M., Kruliš, M.: Improving matrix-based dynamic programming on massively parallel accelerators. Inf. Syst. 64, 175–193 (2017)

    Article  Google Scholar 

  3. Belviranli, M.E., Deng, P., Bhuyan, L.N., Gupta, R., Zhu, Q.: Peerwave: Exploiting wavefront parallelism on GPUS with peer-SM synchronization. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 25–35. ACM (2015)

  4. Crow, F.C.: Summed-area tables for texture mapping. In: ACM SIGGRAPH Computer Graphics, vol. 18, pp. 207–212. ACM (1984)

  5. Deorowicz, S.: Solving longest common subsequence and related problems on graphical processing units. Softw. Pract. Exp. 40(8), 673–700 (2010)

    Article  Google Scholar 

  6. Di, P., Wu, H., Xue, J., Wang, F., Yang, C.: Parallelizing sor for gpgpus using alternate loop tiling. Parallel Comput. 38(6–7), 310–328 (2012)

    Article  MathSciNet  Google Scholar 

  7. Di, P., Xue, J.: Model-driven tile size selection for doacross loops on GPUS. In: European Conference on Parallel Processing, pp. 401–412. Springer (2011)

  8. Di, P., Ye, D., Su, Y., Sui, Y., Xue, J.: Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUS. In: 2012 41st International Conference on Parallel Processing, pp. 350–359. IEEE (2012)

  9. Grama, A., Kumar, V., Gupta, A., Karypis, G.: Introduction to Parallel Computing. Pearson Education, London (2003)

    MATH  Google Scholar 

  10. Hou, K., Wang, H., Feng, W.c., Vetter, J.S., Lee, S.: Highly efficient compensation-based parallelism for wavefront loops on GPUS. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 276–285. IEEE (2018)

  11. Khajeh-Saeed, A., Poole, S., Perot, J.B.: Acceleration of the smith–waterman algorithm using single and multiple graphics processors. J. Comput. Phys. 229(11), 4247–4258 (2010)

    Article  MathSciNet  Google Scholar 

  12. Li, X., Liang, Y., Yan, S., Jia, L., Li, Y.: A coordinated tiling and batching framework for efficient GEMM on GPUS. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp. 229–241. ACM (2019)

  13. Li, Y., Ghalami, L., Schwiebert, L., Grosu, D.: A GPU parallel approximation algorithm for scheduling parallel identical machines to minimize makespan. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 619–628. IEEE (2018)

  14. Li, Z., Goyal, A., Kimm, H.: Parallel longest common sequence algorithm on multicore systems using openacc, openmp and openmpi. In:2017 IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 158–165. IEEE (2017)

  15. Liu, Y., Wirawan, A., Schmidt, B.: Cudasw++ 3.0: accelerating smith–waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinform. 14(1), 117 (2013)

    Article  Google Scholar 

  16. Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J. Sci. Comput. 37(4), C439–C464 (2015)

    Article  MathSciNet  Google Scholar 

  17. Nvidia Corporation: CUDA Programming Guide (2018)

  18. Rawat, P.S., Hong, C., Ravishankar, M., Grover, V., Pouchet, L.N., Rountev, A., Sadayappan, P.: Resource conscious reuse-driven tiling for GPUS. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pp. 99–111. ACM (2016)

  19. Remmelg, T., Lutz, T., Steuwer, M., Dubach, C.: Performance portable GPU code generation for matrix multiplication. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, pp. 22–31. ACM (2016)

  20. Smith, T.F., Waterman, M.S.: Comparison of biosequences. Adv. Appl. Math. 2(4), 482–489 (1981)

    Article  MathSciNet  Google Scholar 

  21. Striemer, G.M., Akoglu, A.: Sequence alignment with GPU: performance and design challenges (2009)

  22. Tang, Y., You, R., Kan, H., Tithi, J.J., Ganapathi, P., Chowdhury, R.A.: Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In: ACM SIGPLAN Notices, vol. 50, pp. 205–214. ACM (2015)

  23. Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuanzhe Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Schwiebert, L. Memory-Optimized Wavefront Parallelism on GPUs. Int J Parallel Prog 48, 1008–1031 (2020). https://doi.org/10.1007/s10766-020-00658-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-020-00658-y

Keywords

Navigation