Skip to main content
Log in

Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms

  • Original Research Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Considering the prevalent usage of multimedia applications on commodity computers equipped with both CPU and GPU devices, the possibility of simultaneously exploiting all parallelization capabilities of such hybrid platforms for high performance video encoding has been highly quested for. Accordingly, a method to concurrently implement the H.264/ advanced video coding (AVC) inter-loop on hybrid GPU + CPU platforms is proposed in this manuscript. This method comprises dynamic dependency aware task distribution methods and real-time computational load balancing over both the CPU and the GPU, according to an efficient dynamic performance modeling. With such optimal balance, the set of rather optimized parallel algorithms that were conceived for video coding on both the CPU and the GPU are dynamically instantiated in any of the existing processing devices, to minimize the overall encoding time. The proposed model does not only provide an efficient task scheduling and load balancing for H.264/AVC inter-loop, but it also does not introduce any significant computational burden to the time-limited video coding application. Furthermore, according to the presented set of experimental results, the proposed scheme has proved to provide speedup values as high as 2.5 when compared with highly optimized GPU-only encoding solutions or even other state of the art algorithm. Moreover, by simply using the existing computational resources that usually equip most commodity computers the proposed scheme is able to achieve inter-loop encoding rates as high as 40 fps at a HD 1920 × 1080 resolution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T., Wedi, T.: Video coding with H.264/AVC tools, performance, and complexity. IEEE Circuits Syst. Mag. 4(1), 7–28 (2004)

    Article  Google Scholar 

  2. Wiegand, T., Schwartz, H., Kossentini, F., Ulivan G., S.: Rate-constrained coder control and comparison of video coding standards. IEEE Trans. Circuits Syst. Video Technol. 13(7), 668–703 (2003)

    Google Scholar 

  3. Lu, C.-T., Hang, H.-M.:Multiview encoder parallelized fast search realization on NVIDIA CUDA. In: Proc. Visual Communications and Image Processing (VCIP), IEEE, pp. 1–4 (2011)

  4. Schwalb, M., Ewerth, R., Freisleben, B.: Fast motion estimation on graphics hardware for H.264 video encoding. IEEE Trans. Multimed. 11(1), 1–10 (2009)

    Article  Google Scholar 

  5. Momcilovic, S., Sousa, L.: Development and evaluation of scalable video motion estimators on GPU. In: Proc. Workshop on Signal Processing Systems (SIPS) (2009)

  6. Kung, M.C., Au, O., Wong, P., Liu, C.-H.: Intra frame encoding using programmable graphics hardware. In: Proc. Pacific Rim Conference on Advances in Multimedia Information Processing (PCM), pp. 609–618. Springer, Berlin (2007)

  7. Obukhov, A., Kharlamovl, A.: Discrete cosine transform for 8x8 blocks with CUDA. Research report, NVIDIA, Santa Clara, CA (2008)

  8. Shen, G., Gao, G.-P., Li, S., Shum, H.-Y., Zhang, Y.-Q.: Accelerate video decoding with generic GPU. IEEE Trans. Circuits Syst. Video Technol. 15(5), 685–693 (2005)

    Article  Google Scholar 

  9. Pieters, B., Hollemeersch, C.-F., De Cock, J., Lambert, P., De Neve, W., Vande Walle, R.: Parallel deblocking filtering in MPEG-4 AVC/H.264 on massively parallel architectures. IEEE Trans. Circuits Syst. Video Technol. 21(1), 96–100 (2011)

    Article  Google Scholar 

  10. Cheung, N.-M., Fan, X., Au O., C., Kung, M.-C.: Video coding on multicore graphics processors. IEEE Signal Process. Mag. 27(2), 79–89 (2010)

    Article  Google Scholar 

  11. Azevedo, A., Juurlink, B., Meenderinck, C., Terechko, A., Hoogerbrugge, J., Alvarez, M., Ramirez, A., Valero, M.: A highly scalable parallel implementation of H.264. In: Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), pp. 111–134 (2011)

  12. Chen, W.-N., Hang, H.-M.: H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA). In: Proc. International Conference on Multimedia and Expo (ICME), pp. 697–700 (2008)

  13. Momcilovic, S., Roma, N., Sousa, L.: Multi-level parallelization of advanced video coding on hybrid CPU/GPU platform. In: Proceedings of the 10th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar/Euro-Par 2012) (2012)

  14. Ates, H.F., Altunbasak, Y.: SAD reuse in hierarchical motion estimation for the H.264 encoder. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2, 905–908 (2005)

    Google Scholar 

  15. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem). Intel Corporation (2008)

  16. Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28(4), 13–27 (2008)

    Article  Google Scholar 

  17. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)

    Article  Google Scholar 

  18. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959)

    Article  MathSciNet  MATH  Google Scholar 

  19. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)

  20. Intel Corporation. SSE4 Programming Reference (2007). http://edc.intel.com/Link.aspx?id=1630

  21. Momcilovic, S., Ilic, A., Roma, N., Sousa, L.: Advanced Video Coding on CPUs and GPUs: Parallelization and RD Analysis. Technical report (available online), INESC-ID (2013)

  22. Aji, A.M., Feng, W., Blagojevic, F., Nikolopoulos, D.S.: Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine. In: CF ’08: Proceedings of the 5th Conference on Computing Frontiers, pp. 13–22. ACM, New York (2008) (ISBN 978-1-60558-077-7)

  23. ITU-T. JVT Reference Software, version 17.2 (2010). http://iphome.hhi.de/suehring/tml/download

  24. Tan, T.; Sullivan,G.; Wedi. Recommended simulation common conditions for coding efficiency experiments-revision 3. Doc. VCEG-AI10, ITU-Telecommunications Standardization Sector, STUDY GROUP 16 Question 6, Video Coding Experts Group (VCEG), Lisbon, Portugal (2008)

Download references

Acknowledgments

This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT), under project PEst-OE/EEI/LA0021/2013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Svetislav Momcilovic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Momcilovic, S., Roma, N. & Sousa, L. Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms. J Real-Time Image Proc 11, 571–587 (2016). https://doi.org/10.1007/s11554-013-0357-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-013-0357-y

Keywords

Navigation