Skip to main content
Log in

Fast filter bank convolution for three-dimensional wavelet transform by shared memory on mobile GPU computing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Mobile GPU applications usually constrain by the real-time requirement. However, FLOPS of mobile GPU is limited by the size and power supply of the SoC systems. Same to desktop GPUs, the mobile GPU consists of an on-chip memory hierarchy, and proper usage of memory hierarchy accelerates mobile GPU applications such as Discrete Wavelet Transform (DWT) to satisfy the real-time requirement. In this paper, by taking advantage of GPU shared memory in Tegra K1, a mobile GPU from Nvidia, we develop Bank Conflict Free Shared Memory Parallel DWT for mobile GPU applications. Computational results show that, with the display resolution of \(640 \times 350\) (EGA), Bank Conflict Free Shared Memory Parallel DWT is significantly faster than SoC CPU-based DWT. Computational results also show that, with the display resolution of \(320\times 200\) (CGA), \(640\times 480\) (VGA), \(800\times 600\) (SVGA) and \(1024\times 768\) (XGA), Bank Conflict Free Shared Memory Parallel DWT can generally satisfy the real-time requirement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Bordawekar R, Bondhugula U, Rao R (2010) Believe it or not: mult-core CPUs can match GPU performance for a FLOP-intensive application! In: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, 2010. ACM, Vienna, Austria, pp. 537–538

  2. Huang Q et al (2008) GPU as a general purpose computing resource. In: Ninth international conference on parallel and distributed computing, applications and technologies, 2008. PDCAT 2008

  3. Suda R et al (2009) Aspects of GPU for general purpose high performance computing. In: Proceedings of the 2009 Asia and South Pacific Design Automation Conference. 2009. IEEE Press, Yokohama, Japan, pp 216–223

  4. Collange S, Defour D, Tisserand A (2009) Power consumption of GPUs from a software perspective. In: Allen G et al (eds) Computational science—ICCS 2009. Springer, Berlin, pp 914–923

  5. Sanders J, Kandrot E (2010) CUDA by example: an introduction to general-purpose GPU programming. Pearson education, Boston

    Google Scholar 

  6. Gou C, Gaydadjiev GN (2013) Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. Int J Parallel Program 41(3):400–429

    Article  Google Scholar 

  7. Yuen DA et al (2013) GPU solutions to multi-scale problems in science and engineering. Springer, Berlin

    Book  Google Scholar 

  8. Lobeiras J, Amor M, Doallo R (2011) Performance evaluation of GPU memory hierarchy using the FFT. In: The 11th international conference on computational and mathematical methods in science and engineering, CMMSE 2011

  9. Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput Arch News 37(3):152–163

    Article  MathSciNet  Google Scholar 

  10. Ryoo S et al (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, Salt Lake City, UT, USA, pp 73–82

  11. Luebke D (2008) CUDA: scalable parallel programming for high-performance scientific computing. In: 5th IEEE international symposium on biomedical imaging: from nano to macro, 2008. ISBI 2008

  12. Ryoo S et al (2008) Program optimization space pruning for a multithreaded gpu. In: Proceedings of the 6th annual IEEE/ACM international symposium on code generation and optimization, 2008. ACM, Boston, MA, USA, pp 195–204

  13. Baghsorkhi SS et al (2010) An adaptive performance modeling tool for GPU architectures. SIGPLAN Not 45(5):105–114

    Article  Google Scholar 

  14. Zhao D, Yu J (2015) Efficiently solving tri-diagonal system by chunked cyclic reduction and single-GPU shared memory. J Supercomput 71(2):369–390

  15. Shi L et al (2012) vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans Comput 61(6):804–816

    Article  MathSciNet  Google Scholar 

  16. Gou C, Gaydadjiev GN (2011) Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM international conference on computing frontiers, 2011. ACM, Ischia, Italy, pp 1–11

  17. Yang Y et al (2010) A GPGPU compiler for memory optimization and parallelism management. SIGPLAN Not 45(6):86–97

    Article  Google Scholar 

  18. Che S et al (2008) A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68(10):1370–1380

    Article  Google Scholar 

  19. Han TD, Abdelrahman TS (2009) hiCUDA: a high-level directive-based language for GPU programming. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, 2009. ACM, Washington, DC, pp 52–61

  20. Mei C, Jiang H, Jenness J (2010) CUDA-based AES parallelization with fine-tuned GPU memory utilization. In: IEEE international symposium on parallel and distributed processing, workshops and Phd forum (IPDPSW), 2010

  21. Govindaraju NK et al (2006) A memory model for scientific algorithms on graphics processors. In: SC 2006 Conference, Proceedings of the ACM/IEEE

  22. Gupta V et al (2009) GViM: GPU-accelerated virtual machines. In: Proceedings of the 3rd ACM workshop on system-level virtualization for high performance computing, 2009. ACM, Nuremburg, Germany, pp 17–24

  23. Chen D, Chen W, Zheng W (2012) CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs. Sci China Inf Sci 55(3):663–676

    Article  Google Scholar 

  24. Karantasis KI, Polychronopoulos ED, Ekaterinaris JA (2014) High order accurate simulation of compressible flows on GPU clusters over software distributed shared memory. Comput Fluids 93:18–29

    Article  MathSciNet  Google Scholar 

  25. Ji F, Ma X (2011) Using shared memory to accelerate MapReduce on graphics processing units. In: 2011 IEEE international parallel and distributed processing symposium (IPDPS), IEEE

  26. Che S, Sheaffer JW, Skadron K (2011) Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, 2011. ACM, Seattle, Washington, pp 1–11

  27. Lee W-J et al (2012) SGRT: a scalable mobile GPU architecture based on ray tracing. In: ACM SIGGRAPH 2012 posters, 2012. ACM, Los Angeles, California

  28. Lee W-J et al (2013) SGRT: a mobile GPU architecture for real-time ray tracing. In: Proceedings of the 5th high-performance graphics conference, 2013. ACM, Anaheim, California, pp 109–119

  29. Nah J-H et al (2010) MobiRT: an implementation of OpenGL ES-based CPU–GPU hybrid ray tracer for mobile devices. In: ACM SIGGRAPH ASIA 2010 sketches, 2010. ACM, Seoul, Republic of Korea, pp 1–2

  30. Singhal N et al (2011) Design and optimization of image processing algorithms on mobile GPU. In: ACM SIGGRAPH 2011 posters, 2011. ACM, Vancouver, British Columbia, Canada, pp 1–1

  31. Abramov A et al (2012) Real-time segmentation of stereo videos on a portable system with a mobile GPU. IEEE Trans Circuits Syst Video Technol 22(9):1292–1305

    Article  Google Scholar 

  32. Singhal N, Yoo JW, Choi HY, Park IK (2010) Implementation and optimization of image processing algorithms on handheld GPU. In: 2010 17th IEEE international conference on image processing (ICIP)

  33. Bachoo A (2010) Using the CPU and GPU for real-time video enhancement on a mobile computer. In: 2010 IEEE 10th international conference on signal processing (ICSP)

  34. López MB et al (2014) Interactive multi-frame reconstruction for mobile devices. Multimed Tools Appl 69(1):31–51

    Article  Google Scholar 

  35. Rister B, Wang G, Wu M, Cavallaro JR (2013) A fast and efficient sift detector using the mobile GPU. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP)

  36. Cheng K-T, Wang Y-C (2011) Using mobile GPU for general-purpose computing—a case study of face recognition on smartphones. In: 2011 international symposium on VLSI design, automation and test (VLSI-DAT)

  37. Wang G et al (2013) Accelerating computer vision algorithms using OpenCL framework on the mobile GPU—a case study. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP)

  38. Wang Y-C, Donyanavard B, Cheng K-T (2012) Energy-aware real-time face recognition system on mobile CPU-GPU platform. In: Kutulakos KN (ed) Trends and topics in computer vision. Springer, Berlin, pp 411–422

  39. Wang Y-C, Cheng K-T (2011) Energy-optimized mapping of application to smartphone platform—a case study of mobile face recognition. In: 2011 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW)

  40. Wang Y-C, Pang S, Cheng K-T (2010) A GPU-accelerated face annotation system for smartphones. In: Proceedings of the international conference on Multimedia, 2010. ACM, Firenze, Italy, pp 1667–1668

  41. Hartl A et al (2011) Rapid reconstruction of small objects on mobile phones. In: 2011 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW)

  42. Nvidia (2014) NVIDIA Tegra K1 A new era in mobile computing. NVIDIA Corporation, San Jose, California

  43. Zhao D et al (2014) Acceleration of l1-regularization MRI reconstruction by lookup table and GPU shared memory based DWT. In: GPU technology conference, 2014, San Jose California

Download references

Acknowledgments

We thank Nvidia for Jetson TK1 development board through the Tegra K1 CUDA Vision Challenge 2014–2015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Di Zhao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, D. Fast filter bank convolution for three-dimensional wavelet transform by shared memory on mobile GPU computing. J Supercomput 71, 3440–3455 (2015). https://doi.org/10.1007/s11227-015-1443-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1443-7

Keywords

Navigation