Abstract
Mobile GPU applications usually constrain by the real-time requirement. However, FLOPS of mobile GPU is limited by the size and power supply of the SoC systems. Same to desktop GPUs, the mobile GPU consists of an on-chip memory hierarchy, and proper usage of memory hierarchy accelerates mobile GPU applications such as Discrete Wavelet Transform (DWT) to satisfy the real-time requirement. In this paper, by taking advantage of GPU shared memory in Tegra K1, a mobile GPU from Nvidia, we develop Bank Conflict Free Shared Memory Parallel DWT for mobile GPU applications. Computational results show that, with the display resolution of \(640 \times 350\) (EGA), Bank Conflict Free Shared Memory Parallel DWT is significantly faster than SoC CPU-based DWT. Computational results also show that, with the display resolution of \(320\times 200\) (CGA), \(640\times 480\) (VGA), \(800\times 600\) (SVGA) and \(1024\times 768\) (XGA), Bank Conflict Free Shared Memory Parallel DWT can generally satisfy the real-time requirement.
Similar content being viewed by others
References
Bordawekar R, Bondhugula U, Rao R (2010) Believe it or not: mult-core CPUs can match GPU performance for a FLOP-intensive application! In: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, 2010. ACM, Vienna, Austria, pp. 537–538
Huang Q et al (2008) GPU as a general purpose computing resource. In: Ninth international conference on parallel and distributed computing, applications and technologies, 2008. PDCAT 2008
Suda R et al (2009) Aspects of GPU for general purpose high performance computing. In: Proceedings of the 2009 Asia and South Pacific Design Automation Conference. 2009. IEEE Press, Yokohama, Japan, pp 216–223
Collange S, Defour D, Tisserand A (2009) Power consumption of GPUs from a software perspective. In: Allen G et al (eds) Computational science—ICCS 2009. Springer, Berlin, pp 914–923
Sanders J, Kandrot E (2010) CUDA by example: an introduction to general-purpose GPU programming. Pearson education, Boston
Gou C, Gaydadjiev GN (2013) Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. Int J Parallel Program 41(3):400–429
Yuen DA et al (2013) GPU solutions to multi-scale problems in science and engineering. Springer, Berlin
Lobeiras J, Amor M, Doallo R (2011) Performance evaluation of GPU memory hierarchy using the FFT. In: The 11th international conference on computational and mathematical methods in science and engineering, CMMSE 2011
Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput Arch News 37(3):152–163
Ryoo S et al (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, Salt Lake City, UT, USA, pp 73–82
Luebke D (2008) CUDA: scalable parallel programming for high-performance scientific computing. In: 5th IEEE international symposium on biomedical imaging: from nano to macro, 2008. ISBI 2008
Ryoo S et al (2008) Program optimization space pruning for a multithreaded gpu. In: Proceedings of the 6th annual IEEE/ACM international symposium on code generation and optimization, 2008. ACM, Boston, MA, USA, pp 195–204
Baghsorkhi SS et al (2010) An adaptive performance modeling tool for GPU architectures. SIGPLAN Not 45(5):105–114
Zhao D, Yu J (2015) Efficiently solving tri-diagonal system by chunked cyclic reduction and single-GPU shared memory. J Supercomput 71(2):369–390
Shi L et al (2012) vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans Comput 61(6):804–816
Gou C, Gaydadjiev GN (2011) Elastic pipeline: addressing GPU on-chip shared memory bank conflicts. In: Proceedings of the 8th ACM international conference on computing frontiers, 2011. ACM, Ischia, Italy, pp 1–11
Yang Y et al (2010) A GPGPU compiler for memory optimization and parallelism management. SIGPLAN Not 45(6):86–97
Che S et al (2008) A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68(10):1370–1380
Han TD, Abdelrahman TS (2009) hiCUDA: a high-level directive-based language for GPU programming. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, 2009. ACM, Washington, DC, pp 52–61
Mei C, Jiang H, Jenness J (2010) CUDA-based AES parallelization with fine-tuned GPU memory utilization. In: IEEE international symposium on parallel and distributed processing, workshops and Phd forum (IPDPSW), 2010
Govindaraju NK et al (2006) A memory model for scientific algorithms on graphics processors. In: SC 2006 Conference, Proceedings of the ACM/IEEE
Gupta V et al (2009) GViM: GPU-accelerated virtual machines. In: Proceedings of the 3rd ACM workshop on system-level virtualization for high performance computing, 2009. ACM, Nuremburg, Germany, pp 17–24
Chen D, Chen W, Zheng W (2012) CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs. Sci China Inf Sci 55(3):663–676
Karantasis KI, Polychronopoulos ED, Ekaterinaris JA (2014) High order accurate simulation of compressible flows on GPU clusters over software distributed shared memory. Comput Fluids 93:18–29
Ji F, Ma X (2011) Using shared memory to accelerate MapReduce on graphics processing units. In: 2011 IEEE international parallel and distributed processing symposium (IPDPS), IEEE
Che S, Sheaffer JW, Skadron K (2011) Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, 2011. ACM, Seattle, Washington, pp 1–11
Lee W-J et al (2012) SGRT: a scalable mobile GPU architecture based on ray tracing. In: ACM SIGGRAPH 2012 posters, 2012. ACM, Los Angeles, California
Lee W-J et al (2013) SGRT: a mobile GPU architecture for real-time ray tracing. In: Proceedings of the 5th high-performance graphics conference, 2013. ACM, Anaheim, California, pp 109–119
Nah J-H et al (2010) MobiRT: an implementation of OpenGL ES-based CPU–GPU hybrid ray tracer for mobile devices. In: ACM SIGGRAPH ASIA 2010 sketches, 2010. ACM, Seoul, Republic of Korea, pp 1–2
Singhal N et al (2011) Design and optimization of image processing algorithms on mobile GPU. In: ACM SIGGRAPH 2011 posters, 2011. ACM, Vancouver, British Columbia, Canada, pp 1–1
Abramov A et al (2012) Real-time segmentation of stereo videos on a portable system with a mobile GPU. IEEE Trans Circuits Syst Video Technol 22(9):1292–1305
Singhal N, Yoo JW, Choi HY, Park IK (2010) Implementation and optimization of image processing algorithms on handheld GPU. In: 2010 17th IEEE international conference on image processing (ICIP)
Bachoo A (2010) Using the CPU and GPU for real-time video enhancement on a mobile computer. In: 2010 IEEE 10th international conference on signal processing (ICSP)
López MB et al (2014) Interactive multi-frame reconstruction for mobile devices. Multimed Tools Appl 69(1):31–51
Rister B, Wang G, Wu M, Cavallaro JR (2013) A fast and efficient sift detector using the mobile GPU. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP)
Cheng K-T, Wang Y-C (2011) Using mobile GPU for general-purpose computing—a case study of face recognition on smartphones. In: 2011 international symposium on VLSI design, automation and test (VLSI-DAT)
Wang G et al (2013) Accelerating computer vision algorithms using OpenCL framework on the mobile GPU—a case study. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP)
Wang Y-C, Donyanavard B, Cheng K-T (2012) Energy-aware real-time face recognition system on mobile CPU-GPU platform. In: Kutulakos KN (ed) Trends and topics in computer vision. Springer, Berlin, pp 411–422
Wang Y-C, Cheng K-T (2011) Energy-optimized mapping of application to smartphone platform—a case study of mobile face recognition. In: 2011 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW)
Wang Y-C, Pang S, Cheng K-T (2010) A GPU-accelerated face annotation system for smartphones. In: Proceedings of the international conference on Multimedia, 2010. ACM, Firenze, Italy, pp 1667–1668
Hartl A et al (2011) Rapid reconstruction of small objects on mobile phones. In: 2011 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW)
Nvidia (2014) NVIDIA Tegra K1 A new era in mobile computing. NVIDIA Corporation, San Jose, California
Zhao D et al (2014) Acceleration of l1-regularization MRI reconstruction by lookup table and GPU shared memory based DWT. In: GPU technology conference, 2014, San Jose California
Acknowledgments
We thank Nvidia for Jetson TK1 development board through the Tegra K1 CUDA Vision Challenge 2014–2015.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, D. Fast filter bank convolution for three-dimensional wavelet transform by shared memory on mobile GPU computing. J Supercomput 71, 3440–3455 (2015). https://doi.org/10.1007/s11227-015-1443-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1443-7