Multi-view video based multiple objects segmentation using graph cut and spatiotemporal projections

https://doi.org/10.1016/j.jvcir.2009.09.005Get rights and content

Abstract

In this paper, we present an automatic algorithm to segment multiple objects from multi-view video. The Initial Interested Objects (IIOs) are automatically extracted in the key view of the initial frame based on the saliency model. Multiple objects segmentation is decomposed into several sub-segmentation problems, and solved by minimizing the energy function using binary label graph cut. In the proposed novel energy function, the color and depth cues are integrated with the data term, which is then modified with background penalty with occlusion reasoning. In the smoothness term, foreground contrast enhancement is developed to strengthen the moving objects boundary, and at the same time attenuates the background contrast. To segment the multi-view video, the coarse predictions of the other views and the successive frame are projected by pixel-based disparity and motion compensation, respectively, which exploits the inherent spatiotemporal consistency. Uncertain band along the object boundary is shaped based on activity measure and refined with graph cut, resulting in a more accurate Interested Objects (IOs) layer across all views of the frames. The experiments are implemented on a couple of multi-view videos with real and complex scenes. Excellent subjective results have shown the robustness and efficiency of the proposed algorithm.

Introduction

In the recent decades, image/video segmentation has become an active research topic in image processing, computer vision and computer graphics, leading to significant breakthroughs on the development of its theories and technologies. Robust and accurate separation of foreground object from background has turned out to be a crucial prerequisite for many applications such as face segmentation in videotelephony [1], video object cut for pasting [2], and 3D modeling and reconstruction by joint segmentation [3]. Current segmentation methods can be categorized into two groups, region-based segmentation and boundary-based segmentation. Region-based segmentation methods aim to directly construct the region itself, while boundary-based segmentation methods tend to represent each region by its boundary. Some of the classical region-based segmentation methods are mean-shift [4], region growing [5], and graph partition (graph cut [6], grab cut [7]), as well as some popular image cutout tools such as Magic Wand in Photoshop. Active contour (snake) [8], level set [9] and GVF [10] are the representative approaches for boundary-based segmentation. Lazy snapping [11] designs a novel user interface for image cutout by inheriting the advantages of region-based and boundary-based methods.

Most of the interest has been focused on the research of single view segmentation, thus many advanced algorithms have emerged [12], [13], [14], [15]. On the contrary, multiple view segmentation has not attracted much attention due to the limitation of image capturing technology and the difficulty to segment all the images simultaneously in real-time. However, multi-view images capturing the real-world environment from arbitrary viewpoints are capable of describing dynamic scene from different angles and can provide the observer more vivid and extensive viewing experience than the single-view image, resulting in more realistic and exciting visual effect. Additionally, depth information in the 3D scene can be reconstructed from multi-view images and assists in characterizing the visual objects more efficiently than the conventional 2D representation. Furthermore, efficient segmentation of IOs has played an important role in many multi-view applications, such as image-based rendering and 3D object model reconstruction. In image-based rendering, multi-view images are available for good visual rendering quality. The end-users may desire to render only the IOs instead of the whole scene, which makes the accurate segmentation of the objects desirable. For 3D object model reconstruction, integrating the 2D images captured from different views to reconstruct the 3D object model is a challenging problem. The first task is the efficient removal of background from these objects.

With the recent growing capability of the capturing devices, multi-view capturing system with dense or sparse camera array [16], [17] can be built with ease, which motivates the development of multi-view techniques and its related applications. A multi-view image segmentation algorithm proposed in [18] aims to segment foreground object from a collection of 2D images taken from different viewpoints for 3D object reconstruction. It incorporates some useful and well-known algorithms including graph cut image segmentation, volumetric graph cut and learning shape priors. Quan et al. [19] investigated the issue of image-based plant modeling. They propose a plant modeling system for generating 3D models of natural-looking plant from a number of images captured by a hand-held camera with different views. Segmenting the leaves of a plant is a tough problem because of the occlusion and similarity of color between different overlapping leaves. In their approach, leaf segmentation problem is formulated as graph-based optimization aided by 3D and 2D information. To reconstruct the 3D geometry of static scene, an algorithm in [20] simultaneously deals with the depth map estimation and background separation in multi-view setting with several calibrated cameras. By exploiting the strong interdependency of two problems and minimizing a discrete energy functional using graph cut, this combined approach yields more correct depth estimate and better background separation on both real-world and synthetic scenes. The state-of-the-art work for bi-layer segmentation of the stereo video sequence is presented in [21]. By probabilistic fusion of stereo, color and contrast cues, it efficiently separates the foreground from background layer in real-time, and successfully applies to background substitution.

Section snippets

Overview of the proposed framework

In this paper, we propose an automatic and efficient algorithm to segment multiple objects from multi-view video. Fig. 1 shows the algorithm framework composed of three components: data pre-processing, offline-operations and online segmentation. We built a five-view camera system to capture the multi-view video data. Given the multi-view image sets Itv captured at time instances t from five different views v{0,1,2,3,4}, the objective is to obtain the labeling field ftv. After data acquisition,

Multiple objects segmentation for key view

In computer vision, image segmentation generally can be formulated as an energy minimization problem. Graph cut as a powerful energy minimization tool, has been widely used for solving many related vision and graphic problems with great success, such as stereo matching [23], multi-view reconstruction [24] and texture synthesis [25]. With its efficiency in segmentation as demonstrated by Boykov and Jolly [6], graph cut has generated extensive interest for image segmentation and spawned many

Multi-view video segmentation

In the above work, we have dealt with the segmentation in a single key view of initial frame. In many applications, accurate object segmentation for all views of the frames of a video is required. In this section, we extend the segmentation algorithm to multi-view video.

Experimental results

The efficiency and robustness of the proposed algorithm are demonstrated on two types of multi-view videos simulating different scenarios, which were captured by our five-view camera system in indoor scenes, with resolution of 640  480 at frame rate of 30 frames per second (fps).

Conclusions

In this paper, we propose an automatic segmentation algorithm for multiple objects from multi-view video. After data pre-processing, offline operations are carried out to yield motion and disparity information facilitating the online segmentation. IIOs are extracted in an unsupervised manner in the key view of initial frame based on the saliency model, where a single topological saliency map is calculated by combining motion and depth information. Multiple objects segmentation is decomposed

Acknowledgment

This work was supported in part by the Research Grants Council of the Hong Kong SAR (Project CUHK415707).

References (37)

  • S. Osher et al.

    Fronts propagating with curvature-dependent speed: algorithms based on Hamilton–Jacobi formulations

    Journal of Computational Physics

    (1988)
  • Dongbo Min et al.

    2D/3D freeview video generation for 3DTV system

    Signal Processing: Image Communication

    (2009)
  • D. Chai et al.

    Face segmentation using skin color map in videophone applications

    IEEE Transactions on Circuits and Systems for Video Technology

    (1999)
  • Y. Li et al.

    Video object cut and paste

    ACM Transactions on Graphics

    (2005)
  • L. Quan et al.

    Image-based modeling by joint segmentation

    International Journal of Computer Vision

    (2007)
  • D. Comaniciu et al.

    Mean shift: a robust approach toward feature space analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • L.G. Shapiro et al.

    Computer Vision

    (2001)
  • Y. Boykov, M.P. Jolly, Interactive graph cuts for optimal boundary and region segmentation of objects in N–D images,...
  • C. Rother et al.

    Grabcut: interactive foreground extraction using iterated graph cuts

    ACM Transactions on Graphics

    (2004)
  • M. Kass et al.

    Snakes: active contour models

    International Journal of Computer Vision

    (1987)
  • C. Xu et al.

    Snakes, shapes, and gradient vector flow

    IEEE Transactions on Image Processing

    (1998)
  • Y. Li et al.

    Lazy snapping

    ACM Transactions on Graphics

    (2004)
  • G. Sfikas, C. Nikou, N. Galatsanos, Edge preserving spatially varying mixtures for image segmentation, in: Proceedings...
  • V. Lempitsky, C. Rother, A. Blake, LogCut – efficient graph cut optimization for Markov random fields, in: Proceedings...
  • Y.W. Tai et al.

    Soft color segmentation and its applications

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • Y.C. Huang, Q.S. Liu, D. Metaxas, Video object segmentation by hypergraph cut, in: Proceedings of the IEEE Conference...
  • A. Kubota et al.

    Multi-view imaging and 3DTV

    IEEE Signal Processing Magazine

    (2007)
  • C.L. Zitnick et al.

    High-quality video view interpolation using a layered representation

    ACM Transactions on Graphics

    (2004)
  • Cited by (19)

    • Unsupervised visual hull extraction in space, time and light domains

      2014, Computer Vision and Image Understanding
      Citation Excerpt :

      For the case where only a single image is available, an initialization procedure such as tri-map or rectangle is normally used [16,20,21]. In the case where several images or a video sequence are available, initialization based on fixation point, background subtraction or stereo is used [18,22–24]. This initialization step allows the construction of object and background prior likelihoods based on a color model such as a Gaussian Mixture Model (GMM) or intensity histograms.

    • Saliency detection using joint spatial-color constraint and multi-scale segmentation

      2013, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Two mechanisms are believed for attention deployment: the bottom-up, rapid, pre-attentive and stimulus-driven manner as well as the top-down, slower, attentive and task-dependent manner [3–5]. Visual attention is of widespread interest due to a large number of applications, including adaptive image/video compression [6–9], object-of-attention image segmentation [10–13], object recognition [14,15], surveillance [16], smart image retargeting [17,18], image/video retrieval and summary [18–20]. Visual saliency is the perceptual quality that makes an object visually different to its neighborhoods and grabs our attention [21].

    • On multi-view video segmentation for object-based coding

      2012, Digital Signal Processing: A Review Journal
    • Automatic body segmentation with graph cut and self-adaptive initialization level set (SAILS)

      2011, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      For the multi-view video scenario, object segmentation works very accurately and efficiently for a probabilistic fusion of multiple cues, i.e., depth, color, and contrast. Using the multi-camera, Zhang [13] employs a visual attention saliency map which is generated from depth, motion and wavelet features to initialize graph cut segmentation. In [14], a stereo-matching technology is proposed to enhance the segmented result and quantitatively evaluated by comparison with the ground-truth.

    • Automatic moving foreground extraction using random walks

      2019, Indonesian Journal of Electrical Engineering and Computer Science
    • Automatic motion segmentation using random walks

      2017, ACM International Conference Proceeding Series
    View all citing articles on Scopus
    View full text