Overview of the MVC + D 3D video coding standard

doi:10.1016/j.jvcir.2013.03.013

Journal of Visual Communication and Image Representation

Volume 25, Issue 4, May 2014, Pages 679-688

https://doi.org/10.1016/j.jvcir.2013.03.013 Get rights and content

Highlights

•
MVC + D supports depth-image-based rendering for advanced 3D video use cases.
•
MVC + D decoder can reuse H.264/AVC or MVC hardware decoder implementation modules.
•
MVC + D supports package level bandwidth and device decoding capability adaptation.
•
MVC + D allows asymmetric depth to have different spatial resolution as texture.
•
MVC + D requires typically about twice the bitrate of 2D video coded by H.264/AVC.

Abstract

3D video services are emerging in various application domains including cinema, TV broadcasting, Blu-ray discs, streaming and smartphones. A majority of the 3D video content in market is still based on stereo video, which is typically coded with the multiview video coding (MVC) extension of the Advanced Video Coding (H.264/AVC) standard or as frame-compatible stereoscopic video. However, the 3D video technologies face challenges as well as opportunities to support more demanding application scenarios, such as immersive 3D telepresence with numerous views and 3D perception adaptation for heterogeneous 3D devices and/or user preferences. The Multiview Video plus Depth (MVD) format enables depth-image-based rendering (DIBR) of additional viewpoints in the decoding side and hence helps in such advanced application scenarios. This paper reviews the MVC + D standard, which specifies an MVC-compatible MVD coding format.

Introduction

3D video is becoming popular for example in cinemas, high definition (HD) TVs and smart phones, with different delivery mediums, such as Blu-ray discs, content delivery networks (CDN) and TV broadcasting channels. Interoperability of 3D video content over different applications and devices requires standardized 3D video data and compression formats.

In this section, we will first review the existing 3D video data and compression formats, intending to explain the relation of the MVC-based 3D video coding standard (MVC + D) to earlier coding standards and formats for 3D video. Then, in Sections 1.1 the target applications that may require MVC + D are described. Finally, we outline the contributions of this paper and introduce the structure of the paper in Section 1.3.

Stereoscopic 3D perception is achieved by presenting different views to the left and right eyes, which can either be facilitated with stereoscopic display technology requiring specific viewing glasses or auto-stereoscopic display technology whereby no viewing glasses are required. Consequently, coding of two video views is sufficient to obtain 3D perception with stereoscopic displays. Thus, data and compression formats to address stereoscopic (2-view) and multiview video coding have been developed and can be coarsely categorized into two families: frame-compatible stereoscopic video and inter-view-predicted multiview video coding, which are described in the next paragraphs.

In frame-compatible stereoscopic video, a picture of the left view and the corresponding picture of the right view, jointly referred to as the constituent pictures, are spatially packed into a single frame or temporally interleaved prior to encoding. Then, a regular 2D video codec is used to encode the frame-packed video signal. In the receiving end, the video bitstream is first decoded conventionally and then the pictures of the left and right views are extracted from the output pictures of the decoder to be rendered correctly on a stereoscopic display. Different types of frame packing have been proposed and are supported by various standards. As a common feature among most types of frame packing, the spatial packing of the constituent pictures typically results in reducing the resolution (e.g., horizontally or vertically) of the constituent pictures to half of the respective original resolution and hence keeping the resolution of the packed frames the same as that of a single view prior to encoding. The most commonly used frame packing formats include side-by-side and top–bottom arrangements, where the constituent pictures appear next to each other either horizontally or vertically within the packed frame. The applied frame packing type can be signaled by various means, such as with the frame packing arrangement supplemental enhancement information (SEI) message of H.264/AVC [1].

In inter-view-predicted multiview video coding, pictures for each view are coded as separate entities, which are referred to as view components in MVC [1], [2]. One of the views is designated to be the base view, which can be coded and decoded independently of the other view or views, referred to as non-base views. A picture within a non-base view may use the respective picture in the base view as a reference for inter-view prediction, which improves the compression efficiency of multiview video coding compared to that achieved by frame-compatible stereoscopic video where inter-view prediction typically (e.g., between the views in the same time instance) cannot be used [3].

Multiview video assumes fixed view positions and does not support viewpoint adjustment or additional viewpoint generation in the receiver-side and thus is not sufficient for achieving more flexible 3D perception. Consequently, data formats including depth or disparity information associated with coded video views have been studied, because they enable view synthesis through the depth-image-based rendering (DIBR) process [4]. A sequence of pictures representing depth or disparity or other ranging information for a particular viewpoint is called a depth view, while a regular color image sequence of the view is called a texture view. The pictures of a depth view may be referred to as depth maps regardless of the type of the ranging information actually represented by the depth view. In a format often referred to as 2D + Z, one depth view is coded with one coded texture view. The depth view can be regarded as monochromatic video and coded with any regular 2D video codec. The characteristics of the coded depth view, such as the represented depth range, must be indicated in order to use DIBR. The signaling of the depth view characteristics can be done using MPEG-C Part 3 [5], for example. The 2D + Z format provides no means for handling disocculusions, which are typical in the synthesized views, and hence it facilitates view synthesis within a limited viewpoint angle from the single coded view.

In order to facilitate view synthesis more flexibly than what 2D + Z can support, the Multiview Video plus Depth (MVD) format has been proposed [6], [7]. In MVD, each texture view is accompanied by a respective depth view. The texture views can be encoded as one bitstream and the respective depth views as another bitstream, each e.g., with an MVC codec, but then methods for encapsulating texture bitstream and depth bitstream into the same container, synchronizing transmission and decoding of texture and depth views, and buffer management among other things would have to be dealt within the application layer.

While coding of two views can be considered sufficient for achieving 3D perception on stereoscopic displays, it has been discovered that an optimal selection of camera baseline selection depends on the display size and viewing distance among other things [8]. Furthermore, it has been found that there are relatively large differences regarding the stereoscopic disparity preferences [8]. Moreover, auto-stereoscopic display technology, where a relatively large number of views have to be displayed simultaneously, is emerging. 3D video specifications, including MVC + D, are designed to address new application demands and technology trends that cannot be supported by existing standards.

As creation of the 3D video content is typically expensive, it is preferred that the same 3D video content can be experienced by users with heterogeneous devices, thus adaptation of the depth perception to different screen sizes or user preferences is required. Various application scenarios where adaptation of the 3D content is required are illustrated in Fig. 1 and described in the following paragraphs with reference to Fig. 1. Shown in Scenario (a) is the depth perception adaptation for stereo display; Scenario (b) includes the illustration for the use case involving auto-stereoscopic displays; Scenarios (c) and (d) present the backward compatibility functionalities to MVC and H.264/AVC decoders respectively; Scenario (e) is related to free-viewpoint video. A description of the mentioned scenarios is given below.

The same acquisition and post-production process is often used for 3D content production regardless of the target consumption platform or environment, which may vary from handheld 3D devices to movie theaters. Furthermore, particularly for broadcasting or streaming applications, the same encoded 3D video content is transmitted to any potential end-user devices, which may include for example high-definition 3D television sets (3D HDTVs) and 3D smartphones. However, 3D video content may have been produced in such a manner that the acquisition and post-production parameters, such as camera separation, were optimized for a particular consumption environment, such as 3D HDTVs. Therefore, the depth perception might be sub-optimal for other types of 3D screens or viewing environments.

To provide the best 3D perception across all types of devices, the 3D video content shall provide the possibility of adjusting the 3D perception, e.g., by reducing or increasing the disparity between two stereoscopic views for a display with a width that is different from the most suitable display width of the stereoscopic content. This capability is efficiently supported by coding the 3D video content with MVD.

As shown in Fig. 1, Scenario (a), after receiving the coded 3D video content and decoding the texture and depth pictures with an MVC + D decoder, the smartphone can perform DIBR to generate a view at a desirable location. The generated view, together with one of the transmitted views, can provide more suitable viewing experience for the smartphone. This scenario can also be extended in a way that the user can interactively select the best 3D perception and the position of the virtual view changes real-time.

If a conventional multiview video codec, such as MVC, was used to transmit 3D video content for multiview auto-stereoscopic displays, a substantial bandwidth proportional to the number of views in the display would be required. As tens of views, such as 28 for the Dimenco glasses-free 3D 55” display [9], may be displayed simultaneously, the 3D video content coded with e.g., MVC, for auto-stereoscopic displays may easily require more than 10 times of bandwidth compared to 2D video based on e.g., H.264/AVC coding. Hence, bandwidth reduction for 3D video content covering a relatively large viewing range/angle is a key for the delivery of 3D video content real-time while not degrading the 3D perception.

The MVD representation facilitates transmission of a small number of texture and depth views and rendering more views on an auto-stereoscopic display in the decoding end. For example, up to three views may be transmitted and a number of views in between can be rendered based on the transmitted views, since the DIBR technology provides the possibility to interpolate any view with a location between two horizontally neighboring views [6]. As shown in Fig. 1, Scenario (b), a view synthesis module can take advantage of the MVD representation and generate as many virtual views as needed for the auto-stereoscopic display.

The simulation results in Section 5 indicate that MVC + D requires roughly only twice the bit rate of the 2D video coded with H.264/AVC (e.g., the bit rate of the base view) to transmit 2–3 views, with each of them containing depth, so that any number of views can be rendered on an auto-stereoscopic display.

In Scenarios (c)–(e) of Fig. 1, the coded 3D video bitstream is pruned to contain a subset of texture and depth views according to the capabilities of the receiving device or according to user controls. For these Scenarios, the 3D video scene is firstly captured and then encoded by an MVC + D encoder. Either 2 or 3 views may be coded depending on whether any receivers are expected to have multiview auto-stereoscopic displays. A server transmits the coded bitstream(s) to different clients with different capabilities, possibly through media gateways. The possible client devices include 2D HDTVs, 2D/3D switchable stereoscopic HDTVs, 3D capable smartphones and HD auto-stereoscopic TVs. The media gateway is an intelligent device, also referred to as a media-aware network element (MANE), which may selectively forward the incoming video packets. At the final stage, the coded video is decoded and rendered with different means according to the application scenario and capabilities of the receiver.

Prior to displaying of the 3D video content, the reconstructed MVD representation of the 3D video scene may be processed by a 3D renderer, as part of the client device, to generate one or more virtual views real-time. This process, namely view synthesis, is outside the scope of any existing video coding standard similarly to other post-processing operations. Different devices may have different proprietary view synthesis solutions.

Currently the most popular 3D TVs are stereoscopic TVs, which display just two views. As shown in Fig. 1, Scenario (c), the MANE may forward only the packets belonging to the texture of two views. After receiving the MVC sub-bitstream, a stereoscopic TV with an MVC decoder, decodes and displays just views 1 and 3.

Furthermore, as shown in Fig. 1, Scenario (d), the MVC + D bitstream can be extracted to form an H.264/AVC 2D bitstream which is decodable for majority of the 2D TVs, so a 2D version of the content can be enjoyed.

In Fig. 1, Scenario (e), a regular 2D HDTV is equipped with MVC + D decoding capability and a User Interface (UI) that enables viewpoint selection. The UI can be based for example on a remote control, through which a user can manually select a viewpoint, or head tracking, whereby the viewpoint is automatically selected based on the user’s position relative to the screen. It the determined by the UI does not correspond to one of the transmitted views, view synthesis module provides the virtual view based on the MVC + D content. The virtual view can have a viewing angle or horizontal position between those of any two transmitted views, thus free-viewpoint video is realized [10].

This paper reviews the MVC + D standard, which specifies the encapsulation of texture views and depth views into the same bitstream. The mechanisms that extend the MVC features to MVC + D for depth views as well as the operation points involving depth views are described in this paper. The rest of this paper is organized as follows. Section 2 introduces MVC + D and the other currently ongoing standardization efforts for depth-enhanced multiview video coding and explains their differences and finalization timelines. In Section 3, we describe the fundamental normative features of MVC + D, such as the bitstream structure, while the optional features of the standard are presented in Section 4. Simulation results on the compression performance of MVC + D are presented in Section 5. Finally, Section 6 concludes this paper.

Section snippets

MPEG call for proposals and requirements of 3D video coding that enables decoder-side view synthesis

The Moving Picture Experts Group (MPEG) explored technologies related to the MVD format for several years and, after receiving evidence of sufficient technology advances, the MPEG issued the final call for proposals (CfP) on 3D video (3DV) technology in March 2011 [11]. The requirements associated with this CfP are described in [12]. In this MPEG output document, several application scenarios, as described above, are listed. The requirements are categorized into three aspects: data format,

Structure of MVC + D bitstreams

The description of the MVC + D bitstream structure assumes that each view contains both the texture portion and depth portion.

Similar to the MVC, each coded representation of a view within one access unit (typically equivalent to one time instance) is a view component. However, in MVC + D, a view (identified by a unique view_id) may contain a texture view or a depth view or both of them with the same view_id, therefore each view component consists of a texture view component and/or a depth view

Metadata on depth representation format and acquisition parameters

As explained above, the depth view components are coded as monochromatic pictures. In order to utilize the decoded depth views in view synthesis, the characteristics of the depth view components should be indicated similarly to MPEG-C Part 3 (see Section 1.1) and other metadata necessary or useful for view synthesis should be provided.

H.264/AVC bitstreams can include SEI messages, containing metadata which is not required for the decoding process but which may help in other purposes. The syntax

Compression performance

Similar to MVC, one advantage of the MVC + D is that the standard does not have any new macroblock-level coding tool introduced, thus the hardware modules for H.264/AVC and/or MVC can be exploited to implement MVC + D without modifications. Hence, it is not necessary to compare the coding performance of MVC + D with e.g., MVC. However, as it may be interesting to learn how the coded depth views affect the bitrate, we provide some simulation results in Table 1. According to the simulation results, MVC +

Conclusion

In this paper, we reviewed the key features of the MVC + D standard, which specifies a format for including coded depth views in the same bitstream as coded texture views and hence can be used to synthesize new views through depth-image-based rendering for optimized 3D viewing and/or advanced 3D video use cases. MVC + D is compatible with the MVC extension of the H.264/AVC standard and therefore MVC decoders can successfully decode the texture views of MVC + D bitstreams. Furthermore, MVC + D does not

References (29)

Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec. H.264 and ISO/IEC 14496–10 (MPEG-4 AVC),...
Y. Chen, Y.-K. Wang, K. Ugur, M.M. Hannuksela, J. Lainema, M. Gabbouj, The emerging MVC standard for 3D video services,...
K. Müller, P. Merkle, H. Schwarz, T. Hinz, A. Smolic, T. Oelbaum, T. Wiegand, Multi-view video coding based on...
C. Fehn, Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV, in: Proceedings...
Text of ISO/IEC FDIS 23002–3 Representation of Auxiliary Video and Supplemental Information, ISO/IEC JTC 1/SC 29/WG 11,...
P. Kauff, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, A. Smolic, R. Tanger, Depth map creation and image based...
A. Smolic, K. Müller, P. Merkle, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, R. Tanger, P. Kauff, T. Wiegand, Z....
T. Shibata, J. Kim, D.M. Hoffman, M.S. Banks, The zone of comfort: predicting visual discomfort with stereo displays,...
...
A. Smolic, P. Kauff, Interactive 3D video representation and coding technologies, in: Proceedings of IEEE, Special...

Call for Proposals on 3D Video Coding Technology, N12036, MPEG of ISO/IEC JTC1/SC29/WG11, Geneva, Switzerland, March...

Applications and Requirements on 3D Video Coding, N12035, MPEG of ISO/IEC JTC1/SC29/WG11, Geneva, Switzerland, March...

Requirements on Multi-view Video Coding v. 5, N7539, MPEG of ISO/IEC JTC1/SC29/WG11, Nice, France, October...

K. Müller, A. Vetro, V. Baroncini, AHG Report on 3D Video Coding, m21469, MPEG of ISO/IEC JTC1/SC29/WG11, Geneva,...

Cited by (70)

A joint sparse representation and low rank prior regularization model for image deblocking
2023, Optik
When the compression quality factor (QF) is low, JPEG images are usually accompanied by very severe block artifacts. The conventional image deblocking models based on low-rank regularization focus on eliminating block artifacts, but lack the capacity of recovering texture details on image edges. In this paper, an image deblocking model which combines sparse representation and low-rank regularization based on fractional norm is proposed. To avoid the abrupt increase in computational complexity, multiple orthogonal dictionary learning is proposed to enhance the ability to sparsify diverse image structures and eliminate correlations among the atoms in the overcomplete dictionary. To further improve the image deblocking effect, a quantization constraint and fractional norm are used to more effectively enforce the low-rank property. Finally, an alternating minimization strategy is applied to solve the derived optimization problem. The model proposed in this paper preserves more texture details while effectively removing image block artifacts compared with the mainstream image deblocking model utilizing low-rank regularization alone. Experimental results show that the proposed method outperforms the current state-of-the-art image deblocking models in terms of both subjective metrics and objective perception.
View synthesis using foreground object extraction for disparity control and image inpainting
2018, Journal of Visual Communication and Image Representation
Citation Excerpt :
Besides, it provides smooth motion parallax while the viewers are moving around. And the development of multiview video coding and transmission enables depth-image-based rendering (DIBR) of additional viewpoints and helps in the application scenario [3–5]. Visual comfort and virtual view synthesis are two important ingredients of multiview displays.
Among the rapidly growing three-dimensional technologies, multiview displays have drawn great research interests in three-dimensional television due to their adaption to the motion parallax and wider viewing angles. However, multiview displays still suffer from dazzling discomfort on the border of viewing zones. Leveraging on the separability of scene via foreground segmentation, we propose a novel virtual view synthesis method for depth-image-based rendering to alleviate the discomfort. Foreground objects of interest are extracted to segment the whole image into multiple layers, which are further warped to the virtual viewpoint in order. To alleviate the visual discomfort, global disparity adjustments and local depth control are performed for specific objects in each layer. For the post-processing, we improve an exemplar-based inpainting algorithm to tackle the disoccluded areas. Experimental results demonstrate that our method achieves effective disparity control and generates high-quality virtual view images.
Performance assessment of three-dimensional video codecs in mobile terminals
2018, Computer Communications
Citation Excerpt :
Nonetheless, the resulting coding rate is still 50% larger than that of a single view [15] because the information related to view’s differences is also sent in order to restore the secondary view in the decoder. Alternatively, depth-based coding schemes (V + D, for video and depth) [16,17], add depth information (usually, a greys map) to a classical 2D video stream, enabling depth-image-based rendering (DIBR) of additional viewpoints in the decoder to generate the 3D video representation with full resolution. The resulting coding rate is only 10–20% higher than conventional video.
Understanding the most relevant factors influencing the end user experience is key to design and manage multimedia services in mobile networks. Thus, user demand can be dealt with efficient strategies of resource management. In this paper, a comprehensive analysis is carried out to evaluate the Quality of Experience (QoE) obtained by three-dimensional (3D) video codecs in mobile terminals. The analysis covers the most widespread 3D video coding formats, namely the frame-compatible Side-by-Side (SbS) and Multiview Video Coding (MVC) schemes. The analysis considers both subjective and objective measurements over compression strategies that combine bit rate reduction and frame-rate dropping. Subjective tests are done by presenting 3D video sequences with different coding parameters to users who judge image quality, depth perception and visual comfort. For this purpose, a mobile phone with autostereoscopic screen is used. Then, mean opinion scores are compared with objective measurements obtained by a video quality measurement tool. Results show that MVC outperforms SbS in terms of image quality. However, when internal parameter settings are set to very restrictive configurations, the impairment of the original image are similar for both codecs. Likewise, results show that the reduction of the video bit rate is the key parameter controlling video quality, and that the use of frame-rate dropping is not enough to counteract the impairment introduced by bit rate reduction. Finally, it is confirmed that a simple QoE indicator for 3D video can be obtained from objective measurements.
A novel distortion criterion of rate-distortion optimization for depth map coding
2018, Journal of Visual Communication and Image Representation
Citation Excerpt :
3D video offers more selections of viewing angles and a better visual experience to viewers. In order to enjoy 3D scene freely, extensive 3D information of a real world should be utilized, but it leads to a huge amount of data [2,3]. Therefore, how to ensure high quality of free view videos and reduce transmission costs become the main challenge of 3D video applications.
In 3D video coding systems, depth maps are not displayed to the viewers, but provide the geometric information to generate virtual views. To ensure the quality of virtual views, the rate-distortion optimization (RDO) in depth map coding adopts the virtual view distortion as the distortion item. The virtual view distortion comes from the reconstructed color video distortion and depth distortion. It is usually recognized that the virtual view distortion caused by reconstructed color video distortion is independent of that in depth map coding. Preliminary experiments reveal that the virtual view distortion in depth map coding is also influenced by the reconstructed color video distortion. Therefore, we proposed a novel distortion criterion of depth map coding in which the reconstructed color video distortion is modeled and joins into the virtual view distortion calculation. Correspondingly, the associated Lagrange multiplier is also proposed. Experimental results demonstrate that the method by integrating the proposed distortion criterion into RDO for depth map coding can achieve an average 12.72% bitrate saving compared with SSD based RDO method and can also lead a bitrate reduction (0.64%) compared with the existing distortion estimation method in the current 3D-HEVC reference software. With the associated Lagrange multiplier, the proposed distortion criterion can achieve 12.98% bitrate saving compared with SSD based RDO method on average.
Adaptive view synthesis optimization for low complexity 3D-HEVC encoding
2018, Journal of Visual Languages and Computing
Citation Excerpt :
In contrast with the conventional 2D video, the 3D video brings higher volume of data, such that advanced representation format and compression technologies are desired. Texture-plus-Depth representation [20,23], which is one typical format to characterize the 3D scene, can feasibly render the virtual views based on the acquired n-view texture videos and corresponding depth maps [5,6]. Recently, various depth image-based rendering (DIBR) algorithms have been proposed in the literature [1,21].
Depth compression plays an important role in 3D video coding with the typical texture-plus-depth representation. In this paper, to reduce the encoding complexity of depth map, we propose a low complexity adaptive View Synthesis Optimization scheme for the 3D extension of high efficiency video coding (3D-HEVC) standard. More specifically, we distinguish the coding tree units (CTUs) based on the influence of depth map compression on the quality of rendered synthesized view, and classify them into two categories, including synthesized view distortion change (SVDC) based and view synthesis distortion estimation (VSDE) based CTUs. In this manner, we can dynamically distinguish the CTUs and apply different rate-distortion optimization strategies. Moreover, for VSDE based CTUs, a new distortion model is proposed to infer the distortion of the synthesized view based on the depth distortion and texture characteristics. As such, we can achieve a good trade-off between the rate-distortion performance and computational complexity for depth map coding. Experimental results also confirm that the proposed scheme is effective in reducing the encoding complexity with ignorable rate-distortion performance loss compared with the state-of-the-art scheme in the 3D-HEVC platform.
Sum-of-gradient based fast intra coding in 3D-HEVC for depth map sequence (SOG-FDIC)
2017, Journal of Visual Communication and Image Representation
Citation Excerpt :
For that, a joint expert group of MPEG and ITU-T, named Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V), was established to develop 3D-HEVC as an extension of High Efficiency Video Coding (HEVC) [2–4]. Different from its predecessors-H.264/MVC, 3D-HEVC supports the multi-view video plus depth (MVD) format [5] rather than the only texture video, e.g., conventional stereo video (CSV) and multi-view video (MVV) [6–8]. Fig. 1 shows the block diagram of 3D-HEVC system and MVD data structure.
As the latest video coding standard for multi-view plus depth video, 3D-HEVC yields high coding efficiency but at the cost of heavy computational complexity. To reduce the computational complexity, a fast intra coding algorithm based on sum-of-gradient criterion for depth map coding in 3D-HEVC, named SOG-FDIC, is proposed in this paper. Based on the observation that DMM modes and smaller partitioning sizes are rarely used in flat region, sum of gradient is presented to determine whether the current block belongs to the flat region so as to skip unnecessary checking of DMMs and smaller partitioning sizes. Experimental results show that the proposed algorithm can save about 21.8% coding time while keeping almost the same coding efficiency and the reconstructed video quality of depth maps and synthesized views, compared with the original 3D-HEVC. Moreover, it has been verified that the proposed method outperforms the state-of-the-art methods.

View all citing articles on Scopus

View full text