Multi-view video based multiple objects segmentation using graph cut and spatiotemporal projections

doi:10.1016/j.jvcir.2009.09.005

Journal of Visual Communication and Image Representation

Volume 21, Issues 5–6, July–August 2010, Pages 453-461

https://doi.org/10.1016/j.jvcir.2009.09.005 Get rights and content

Abstract

In this paper, we present an automatic algorithm to segment multiple objects from multi-view video. The Initial Interested Objects (IIOs) are automatically extracted in the key view of the initial frame based on the saliency model. Multiple objects segmentation is decomposed into several sub-segmentation problems, and solved by minimizing the energy function using binary label graph cut. In the proposed novel energy function, the color and depth cues are integrated with the data term, which is then modified with background penalty with occlusion reasoning. In the smoothness term, foreground contrast enhancement is developed to strengthen the moving objects boundary, and at the same time attenuates the background contrast. To segment the multi-view video, the coarse predictions of the other views and the successive frame are projected by pixel-based disparity and motion compensation, respectively, which exploits the inherent spatiotemporal consistency. Uncertain band along the object boundary is shaped based on activity measure and refined with graph cut, resulting in a more accurate Interested Objects (IOs) layer across all views of the frames. The experiments are implemented on a couple of multi-view videos with real and complex scenes. Excellent subjective results have shown the robustness and efficiency of the proposed algorithm.

Introduction

In the recent decades, image/video segmentation has become an active research topic in image processing, computer vision and computer graphics, leading to significant breakthroughs on the development of its theories and technologies. Robust and accurate separation of foreground object from background has turned out to be a crucial prerequisite for many applications such as face segmentation in videotelephony [1], video object cut for pasting [2], and 3D modeling and reconstruction by joint segmentation [3]. Current segmentation methods can be categorized into two groups, region-based segmentation and boundary-based segmentation. Region-based segmentation methods aim to directly construct the region itself, while boundary-based segmentation methods tend to represent each region by its boundary. Some of the classical region-based segmentation methods are mean-shift [4], region growing [5], and graph partition (graph cut [6], grab cut [7]), as well as some popular image cutout tools such as Magic Wand in Photoshop. Active contour (snake) [8], level set [9] and GVF [10] are the representative approaches for boundary-based segmentation. Lazy snapping [11] designs a novel user interface for image cutout by inheriting the advantages of region-based and boundary-based methods.

Most of the interest has been focused on the research of single view segmentation, thus many advanced algorithms have emerged [12], [13], [14], [15]. On the contrary, multiple view segmentation has not attracted much attention due to the limitation of image capturing technology and the difficulty to segment all the images simultaneously in real-time. However, multi-view images capturing the real-world environment from arbitrary viewpoints are capable of describing dynamic scene from different angles and can provide the observer more vivid and extensive viewing experience than the single-view image, resulting in more realistic and exciting visual effect. Additionally, depth information in the 3D scene can be reconstructed from multi-view images and assists in characterizing the visual objects more efficiently than the conventional 2D representation. Furthermore, efficient segmentation of IOs has played an important role in many multi-view applications, such as image-based rendering and 3D object model reconstruction. In image-based rendering, multi-view images are available for good visual rendering quality. The end-users may desire to render only the IOs instead of the whole scene, which makes the accurate segmentation of the objects desirable. For 3D object model reconstruction, integrating the 2D images captured from different views to reconstruct the 3D object model is a challenging problem. The first task is the efficient removal of background from these objects.

With the recent growing capability of the capturing devices, multi-view capturing system with dense or sparse camera array [16], [17] can be built with ease, which motivates the development of multi-view techniques and its related applications. A multi-view image segmentation algorithm proposed in [18] aims to segment foreground object from a collection of 2D images taken from different viewpoints for 3D object reconstruction. It incorporates some useful and well-known algorithms including graph cut image segmentation, volumetric graph cut and learning shape priors. Quan et al. [19] investigated the issue of image-based plant modeling. They propose a plant modeling system for generating 3D models of natural-looking plant from a number of images captured by a hand-held camera with different views. Segmenting the leaves of a plant is a tough problem because of the occlusion and similarity of color between different overlapping leaves. In their approach, leaf segmentation problem is formulated as graph-based optimization aided by 3D and 2D information. To reconstruct the 3D geometry of static scene, an algorithm in [20] simultaneously deals with the depth map estimation and background separation in multi-view setting with several calibrated cameras. By exploiting the strong interdependency of two problems and minimizing a discrete energy functional using graph cut, this combined approach yields more correct depth estimate and better background separation on both real-world and synthetic scenes. The state-of-the-art work for bi-layer segmentation of the stereo video sequence is presented in [21]. By probabilistic fusion of stereo, color and contrast cues, it efficiently separates the foreground from background layer in real-time, and successfully applies to background substitution.

Section snippets

Overview of the proposed framework

In this paper, we propose an automatic and efficient algorithm to segment multiple objects from multi-view video. Fig. 1 shows the algorithm framework composed of three components: data pre-processing, offline-operations and online segmentation. We built a five-view camera system to capture the multi-view video data. Given the multi-view image sets $I_{t}^{v}$ captured at time instances t from five different views $v \in {0, 1, 2, 3, 4}$ , the objective is to obtain the labeling field $f_{t}^{v}$ . After data acquisition,

Multiple objects segmentation for key view

In computer vision, image segmentation generally can be formulated as an energy minimization problem. Graph cut as a powerful energy minimization tool, has been widely used for solving many related vision and graphic problems with great success, such as stereo matching [23], multi-view reconstruction [24] and texture synthesis [25]. With its efficiency in segmentation as demonstrated by Boykov and Jolly [6], graph cut has generated extensive interest for image segmentation and spawned many

Multi-view video segmentation

In the above work, we have dealt with the segmentation in a single key view of initial frame. In many applications, accurate object segmentation for all views of the frames of a video is required. In this section, we extend the segmentation algorithm to multi-view video.

Experimental results

The efficiency and robustness of the proposed algorithm are demonstrated on two types of multi-view videos simulating different scenarios, which were captured by our five-view camera system in indoor scenes, with resolution of 640 ∗ 480 at frame rate of 30 frames per second (fps).

Conclusions

In this paper, we propose an automatic segmentation algorithm for multiple objects from multi-view video. After data pre-processing, offline operations are carried out to yield motion and disparity information facilitating the online segmentation. IIOs are extracted in an unsupervised manner in the key view of initial frame based on the saliency model, where a single topological saliency map is calculated by combining motion and depth information. Multiple objects segmentation is decomposed

Acknowledgment

This work was supported in part by the Research Grants Council of the Hong Kong SAR (Project CUHK415707).

References (37)

S. Osher et al.
Fronts propagating with curvature-dependent speed: algorithms based on Hamilton–Jacobi formulations
Journal of Computational Physics
(1988)
Dongbo Min et al.
2D/3D freeview video generation for 3DTV system
Signal Processing: Image Communication
(2009)
D. Chai et al.
Face segmentation using skin color map in videophone applications
IEEE Transactions on Circuits and Systems for Video Technology
(1999)
Y. Li et al.
Video object cut and paste
ACM Transactions on Graphics
(2005)
L. Quan et al.
Image-based modeling by joint segmentation
International Journal of Computer Vision
(2007)
D. Comaniciu et al.
Mean shift: a robust approach toward feature space analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2002)
L.G. Shapiro et al.
Computer Vision
(2001)
Y. Boykov, M.P. Jolly, Interactive graph cuts for optimal boundary and region segmentation of objects in N–D images,...
C. Rother et al.
Grabcut: interactive foreground extraction using iterated graph cuts
ACM Transactions on Graphics
(2004)
M. Kass et al.
Snakes: active contour models
International Journal of Computer Vision
(1987)

C. Xu et al.

Snakes, shapes, and gradient vector flow

IEEE Transactions on Image Processing

(1998)

Y. Li et al.

Lazy snapping

ACM Transactions on Graphics

(2004)

G. Sfikas, C. Nikou, N. Galatsanos, Edge preserving spatially varying mixtures for image segmentation, in: Proceedings...

V. Lempitsky, C. Rother, A. Blake, LogCut – efficient graph cut optimization for Markov random fields, in: Proceedings...

Y.W. Tai et al.

Soft color segmentation and its applications

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2007)

Y.C. Huang, Q.S. Liu, D. Metaxas, Video object segmentation by hypergraph cut, in: Proceedings of the IEEE Conference...

A. Kubota et al.

Multi-view imaging and 3DTV

IEEE Signal Processing Magazine

(2007)

C.L. Zitnick et al.

High-quality video view interpolation using a layered representation

ACM Transactions on Graphics

(2004)

Cited by (19)

Unsupervised visual hull extraction in space, time and light domains
2014, Computer Vision and Image Understanding
Citation Excerpt :
For the case where only a single image is available, an initialization procedure such as tri-map or rectangle is normally used [16,20,21]. In the case where several images or a video sequence are available, initialization based on fixation point, background subtraction or stereo is used [18,22–24]. This initialization step allows the construction of object and background prior likelihoods based on a color model such as a Gaussian Mixture Model (GMM) or intensity histograms.
This paper presents an unsupervised image segmentation approach for obtaining a set of silhouettes along with the visual hull (VH) of an object observed from multiple viewpoints. The proposed approach can deal with mostly any type of appearance characteristics such as texture, similar background color, shininess, transparency besides other phenomena such as shadows and color bleeding. Compared to more classical methods for silhouette extraction from multiple views, for which certain assumptions are made on the object or scene, neither the background nor the object appearance properties are modeled. The only assumption is the constancy of the unknown background for a given camera viewpoint while the object is under motion. The principal idea of the method is the estimation of the temporal evolution of each pixel over time which provides a stability measurement and leads to its associated background likelihood. In order to cope with shadows and self-shadows, an object is captured under different lighting conditions. Furthermore, the information from the space, time and lighting domains is exploited and merged based on a MRF framework and the constructed energy function is minimized via graph cut. Experiments are performed on a light stage where the object is set on a turntable and is observed from calibrated viewpoints on a hemisphere around the object. Real data experiments show that the proposed approach allows for robust and efficient VH reconstruction of a variety of challenging objects.
Saliency detection using joint spatial-color constraint and multi-scale segmentation
2013, Journal of Visual Communication and Image Representation
Citation Excerpt :
Two mechanisms are believed for attention deployment: the bottom-up, rapid, pre-attentive and stimulus-driven manner as well as the top-down, slower, attentive and task-dependent manner [3–5]. Visual attention is of widespread interest due to a large number of applications, including adaptive image/video compression [6–9], object-of-attention image segmentation [10–13], object recognition [14,15], surveillance [16], smart image retargeting [17,18], image/video retrieval and summary [18–20]. Visual saliency is the perceptual quality that makes an object visually different to its neighborhoods and grabs our attention [21].
In this paper, a novel method is proposed to detect salient regions in images. To measure pixel-level saliency, joint spatial-color constraint is defined, i.e., spatial constraint (SC), color double-opponent (CD) constraint and similarity distribution (SD) constraint. The SC constraint is designed to produce global contrast with ability to distinguish the difference between “center and surround”. The CD constraint is introduced to extract intensive contrast of red-green and blue-yellow double opponency. The SD constraint is developed to detect the salient object and its background. A two-layer structure is adopted to merge the SC, CD and SD saliency into a saliency map. In order to obtain a consistent saliency map, the region-based saliency detection is performed by incorporating a multi-scale segmentation technique. The proposed method is evaluated on two image datasets. Experimental results show that the proposed method outperforms the state-of-the-art methods on salient region detection as well as human fixation prediction.
On multi-view video segmentation for object-based coding
2012, Digital Signal Processing: A Review Journal
A novel scheme for multi-view segmentation and tracking is proposed aiming to acquire perceptually consistent results for object-based coding. Firstly, a classic image segmentation technique is employed to perform initial segmentation to divide the whole image into spatially homogeneous regions. Secondly, the motion information is extracted based on frame differences and the disparity information is derived by employing a classic disparity estimation technique. Thirdly, a novel scheme is proposed to perform merging of the initial segmentation results based on both motion and disparity information to remove over-segmented regions and extract perceptually consistent semantic objects. Finally, a contour-based tracking algorithm is proposed to implement accurate and robust object tracking along both temporal and view directions. Experiments are conducted and the results demonstrate that the proposed scheme is effective and, compared with the existing technique, it can acquire more perceptually consistent results.
Automatic body segmentation with graph cut and self-adaptive initialization level set (SAILS)
2011, Journal of Visual Communication and Image Representation
Citation Excerpt :
For the multi-view video scenario, object segmentation works very accurately and efficiently for a probabilistic fusion of multiple cues, i.e., depth, color, and contrast. Using the multi-camera, Zhang [13] employs a visual attention saliency map which is generated from depth, motion and wavelet features to initialize graph cut segmentation. In [14], a stereo-matching technology is proposed to enhance the segmented result and quantitatively evaluated by comparison with the ground-truth.
In this paper, we propose an automatic human body segmentation system which mainly consists of human body detection and object segmentation. Firstly, an automatic human body detector is designed to provide hard constraints on the object and background for segmentation. And a coarse-to-fine segmentation strategy is employed to deal with the situation of partly detected object. Secondly, background contrast removal (BCR) and self-adaptive initialization level set (SAILS) are proposed to solve the tough segmentation problems of the high contrast at object boundary and/or similar colors existing in the object and background. Finally, an object updating scheme is proposed to detect and segment new object when it appears in the scene. Experimental results demonstrate that our body segmentation system works very well in the live video and standard sequences with complex background.
Automatic moving foreground extraction using random walks
2019, Indonesian Journal of Electrical Engineering and Computer Science
Automatic motion segmentation using random walks
2017, ACM International Conference Proceeding Series

View all citing articles on Scopus

View full text

Multi-view video based multiple objects segmentation using graph cut and spatiotemporal projections

Abstract

Introduction

Section snippets

Overview of the proposed framework

Multiple objects segmentation for key view

Multi-view video segmentation

Experimental results

Conclusions

Acknowledgment

Journal of Computational Physics

Signal Processing: Image Communication

Face segmentation using skin color map in videophone applications

IEEE Transactions on Circuits and Systems for Video Technology

Video object cut and paste

ACM Transactions on Graphics

Image-based modeling by joint segmentation

International Journal of Computer Vision

Mean shift: a robust approach toward feature space analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence

Computer Vision

Grabcut: interactive foreground extraction using iterated graph cuts

ACM Transactions on Graphics

Snakes: active contour models

International Journal of Computer Vision

Snakes, shapes, and gradient vector flow

IEEE Transactions on Image Processing

Lazy snapping

ACM Transactions on Graphics

Soft color segmentation and its applications

IEEE Transactions on Pattern Analysis and Machine Intelligence

Multi-view imaging and 3DTV

IEEE Signal Processing Magazine

High-quality video view interpolation using a layered representation

ACM Transactions on Graphics