Abstract
Based on the three rotational degrees (video in three dimensions, on the X, Y and Z axes) of freedom provided by VR, the viewer is free to control the viewing point and has six degrees of freedom (6DoF). When watching a sports game, the audience is no longer limited by the position of the camera, and can freely choose the viewing angle and position just like watching in the real world, which can greatly improve the immersion of viewing. However, the major barrier that prevents 6DoF video live from being industrialized lies in the extremely high computational complexity, of which multi-view depth estimation and Depth Image Based Rendering (DIBR) is difficult to realize. And existing devices do not have hardware interfaces that support multi-views coding technology. Therefore, we need new technologies for depth estimation and virtual view synthesis, and we need to use existing hardware coding/decoding interfaces to reduce power consumption. In this paper, we provide a 6DoF live video system, which includes multi-view depth estimation technique based on unsupervised learning, virtual viewpoint real-time rendering technology and 6DoF video coding. Experimental results demonstrate that our proposed acceleration method can speed up the original depth estimation algorithm by more than 34x, and can speed up the original DIBR algorithm by more than 168x. With our 6DoF video coding method, experimental results show that the bit rate achieves an average of 70%, 64%, 33%, 60% and 66% bitrate saving for AVC, HEVC, AV1, AVS3, VVC codec standard respectively.
Similar content being viewed by others
References
AV1 software (commit d7fe8a44e87a)
Cai Y, Wang R, Cui T, Lv H, Ma S (2013) Intermediate view synthesis based on edge detecting. In: 2013 IEEE international conference on image processing, Melbourne, VIC, pp 3172–3175. https://doi.org/10.1109/ICIP.2013.6738653.
Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698
Duan Y, Sun J, Yan L, Chen K, Guo Z (2014) Novel efficient HEVC decoding solution on general-purpose processors. IEEE Trans Multimedia 16(7):1915–1928
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. NIPS 1(3):5
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE international conference on computer vision (ICCV), Santiago, pp 2650–2658
Fan K, Wang R, Wang Y, Li G, Gao W (2017) Improved intra boundary filters for HEVC. In: 2017 IEEE visual communications and image processing (VCIP), St. Petersburg, FL, pp 1–4
Fang J, Varbanescu AL, Sips H (2011) A comprehensive performance comparison of CUDA and OpenCL. In: Proc IEEE int conf parallel process, pp 216–225
Fehn C (2004) Depth-image-basedrendering(DIBR),compression,andtransmission for a new approach on 3D-TV. In: Proc SPIE stereoscopic displays virtual reality syst XI, pp 93–104
ftp://47.93.196.121vruadminVruAdmin+17
Gu X, Fan Z, Zhu S, Dai Z, Tan F, Tan P (2020) Cascade cost volume for high-resolution multi-view stereo and stereo matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2495–2504.
Hirschmüller H (2008) IEEE Trans Pattern Anal Mach Intell 30(2):328–341
JVET-M1002 (2019) Algorithm description for versatile video coding and test model 4 (VTM 4). 13th meeting: Marrakech, MA, 9–18 Jan
JCT-VC Subversion repository for the HEVC test model. https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/
Konda K, Memisevic R (2013) Unsupervised learning of depth and motion. https://arxiv.org/abs/1312.3429
Levin A, Fergus R, Durand F, Freeman WT (2007) Image and depth from a conventional camera with a coded aperture. In SIGGRAPH
Levoy M, Hanrahan P (1996) Light field rendering. In: International conference on computer graphics and interactive techniques, (ACM SIG-GRAPH). ACM Press, New York, pp 31–42
Li L, Li Z, Li B, Liu D, Li H (2017) Pseudo-sequence-based 2-D hierarchical coding structure for light-field image compression. IEEE J Select Top Signal Process 11(7):1107–1119. https://doi.org/10.1109/JSTSP.2017.2725198
Ligon J, Bein D, Ly P, Onesto B (2018) 3D point cloud processing using spin images for object detection. In: 2018 IEEE 8th annual computing and communication workshop and conference (CCWC), Las Vegas, NV, pp 731–736
Lin S, Zhang X, Yu Q, Qi H, Ma S (2013) Parallelizing video transcoding with load balancing on cloud computing. In: 2013 IEEE international symposium on circuits and systems (ISCAS), Beijing, pp 2864–2867
Liu D, Wang L, Li L, Xiong Z, Wu F, Zeng W (2016) Pseudo-sequence-based light field image compression. In: 2016 IEEE international conference on multimedia & expo workshops (ICMEW), Seattle, WA, pp 1–4. https://doi.org/10.1109/ICMEW.2016.7574674.
Liu Z, Lin Z, Wei X, Chan S (2018) A new model-based method for multi-view human body tracking and its application to view transfer in image-based rendering. IEEE Trans Multimedia 20(6):1321–1334
Momcilovic S, Ilic A, Roma N, Sousa L (2014) Dynamic load balancing for real-time video encoding on heterogeneous CPU+GPU systems. IEEE Trans Multimedia 16(1):108–121
Morvan, Y (2007) Acquisition, compression and rendering of depth and texture for multi-view video. PhD thesis, Technische Universiteit Eindhoven
Mueller M, Zilly F, Kauff P (2010) Adaptive cross-trilateral depth map filtering. In: 2010 3DTV-conference: the true vision—capture, transmission and display of 3D video, Tampere, pp 1–4
Opitz M, Waltner G, Poier G, Possegger H, Bischof H (2016) Grid loss: detecting occluded faces. In: Proc Eur conf comput vis, pp 386–402
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: MICCAI
Scharstein D, Szeliski R, Zabih R (2001) A taxonomy and evalua- tion of dense two-frame stereo correspondence algorithms. In: IEEE workshop on stereo and multi-baseline vision, pp 131–140, 2001
Schönberger JL, Zheng E, Frahm J-M, Pollefeys M (2016) Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), v 9907 LNCS. Computer vision 14th European conference, ECCV 2016, proceedings, pp 501–518
Seitz SM, Dyer CR (1996) View morphing. In: International conference on computer graphics and interactive techniques, (ACM SIG-GRAPH). ACM Press, New York, pp 21–30
Sullivan GJ, Ohm J, Woo-Jin H, Wiegand T, Wiegand T (2012) Overview of the high efficiency video coding (HEVC) standard. IEEE Trans Circ Syst Video Technol 22(12):1649–1668
Tanimoto M (2014) “FTV standardization in MPEG, ” 2014 3DTV-conference: the true vision - capture, transmission and display of 3D video (3DTV-CON). Budapest. https://doi.org/10.1109/3DTV.2014.6874767
Tech G, Wegner K, Chen Y, Yea S (2013) “3D-HEVC Test Model 3,” ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, JCT3V-C1005. 3rd meeting, Geneva, CH, 17–23 Jan
Tomasi C, Manduchi R (1998) Bilateral filtering for gray and color images. In: Proceedings of the IEEE international conference on computer vision
van de Sande KEA, Gevers T, Snoek CGM (2011) Empowering visual categorization with the GPU. IEEE Trans Multimedia 13(1):60–70
Wang R et al (2017) Accelerating image-domain-warping virtual view synthesis on GPGPU. IEEE Trans Multimedia 19(6):1392–1400
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Imagequality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Wang B, de Sousa DF, Alvarez-Mesa M, Chi CC, Juurlink B, Ilic A, Roma N, Sousa L (2018) Highly parallel HEVC decoding for heterogeneous systems with CPU and GPU. Image Commun 62:93–105
Wang S, Wang R (2019) Robust view synthesis in wide-baseline complex geometric environments. In: ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 2297–2301
Würmlin S, Lamboray E, Gross M (2004) 3D video fragments: dynamic point samples for real-time free-viewpoint video. In: Computers and graphics, special issue on coding, compression and streaming techniques for 3D and multimedia data. Elsevier, Amsterdam, pp 3–14
Yao Y, Luo Z, Li S, Fang T, Quan L (2018) Mvsnet: depth inference for unstructured multi-view stereo. In: Proceedings of the European conference on computer vision (ECCV), pp 767–783
Acknowledgements
This work is supported by National Natural Science Foundation of China 61672063, 61902008, Shenzhen Research Projects of JCYJ20180503182128089 and 201806080921419290. And this work is partially supported by the project "PCL Future Greater-Bay Area Network Facilities for Large-scale Experiments and Applications (LZC0019). Thanks to Hisense for providing the experimental platform and data evaluation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file1 (MP4 6875 kb)
Rights and permissions
About this article
Cite this article
Cai, Y., Gao, X., Chen, W. et al. Towards 6DoF live video streaming system for immersive media. Multimed Tools Appl 81, 35875–35898 (2022). https://doi.org/10.1007/s11042-021-11589-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11589-2