2-D mesh-based video object segmentation and tracking with occlusion resolution

doi:10.1016/S0923-5965(00)00055-2

Signal Processing: Image Communication

Volume 16, Issue 10, August 2001, Pages 949-962

https://doi.org/10.1016/S0923-5965(00)00055-2 Get rights and content

Abstract

This paper integrates fully automatic video object segmentation and tracking including detection and assignment of uncovered regions in a 2-D mesh-based framework. Particular contributions of this work are (i) a novel video object segmentation method that is posed as a constrained maximum contrast path search problem along the edges of a 2-D triangular mesh, and (ii) a 2-D mesh-based uncovered region detection method along the object boundary as well as within the object. At the first frame, an optimal number of feature points are selected as nodes of a 2-D content-based mesh. These points are classified as moving (foreground) and stationary nodes based on multi-frame node motion analysis, yielding a coarse estimate of the foreground object boundary. Color differences across triangles near the coarse boundary are employed for a maximum contrast path search along the edges of the 2-D mesh to refine the boundary of the video object. Next, we propagate the refined boundary to the subsequent frame by using motion vectors of the node points to form the coarse boundary at the next frame. We detect occluded regions by using motion-compensated frame differences and range filtered edge maps. The boundaries of detected uncovered regions are then refined by using the search procedure. These regions are either appended to the foreground object or tracked as new objects. The segmentation procedure is re-initialized when unreliable motion vectors exceed a certain number. The proposed scheme is demonstrated on several video sequences.

Introduction

Object-based video representation requires segmentation of the scene into video objects through robust boundary tracking and detection of uncovered regions as either parts of existing objects or new objects. MPEG-4 has recently emerged as a popular object-based video standard, but no normative segmentation method was specified in the standard [5], [13]. It is clear that the utility of the object-based tools in the standard depends on the quality of the segmentation and discrimination of video objects, should these tools be employed for compression and manipulation of traditional video.

Video object segmentation methods can be classified as either two-frame motion/object segmentation or multi-frame spatio-temporal segmentation/tracking methods. Among the former are region-based parametric motion segmentation methods [1], [2] and clustering pixel eigenfeature vectors using fuzzy c-means method [3]. Among the latter, are blob-tracking algorithms such as P-finder [15], contour-tracking methods such as condensation algorithm [7] and occlusion adaptive motion snake [6], and methods based on finding best matches of object models in contour map of successive frames [8]. Several 2-D mesh-based object tracking methods [4], [11], [12], [14] have also been proposed; however, they assumed that the initial object boundary was marked interactively. Region or mesh-based methods have in general been shown to be more robust than pixel-based segmentation methods [1], [11]. In this paper, we propose a unified 2-D mesh-based approach for fully automatic video object segmentation and tracking, which fuses node-based motion and triangle-based color information instead of using a pixel-based approach.

At the first frame, a number of feature points are selected as nodes of a coarse 2-D content-based mesh. These points are classified as foreground and background nodes based on node motion analysis over the next N frames, yielding a coarse estimate of the foreground object boundary. Color differences across triangles near the coarse boundary are exploited for a maximum contrast path search, subject to search control constraints, along the edges of the 2-D mesh to refine the boundary of the video object. Next, we propagate this refined boundary to the subsequent frame by using motion vectors of the node points to form the coarse boundary at the next frame, which will then be refined by the maximum contrast path search. Because motion estimation cannot be perfect for all nodes and there may be occlusion regions, the 2-D mesh topology needs to be updated [4] at certain locations. The boundary of the newly uncovered regions are then refined by using the 2-D mesh topology and search mechanism. These regions are re-meshed and either appended to the foreground object or tracked as new objects. The segmentation procedure is re-initialized when the detected occlusion regions exceed a given percentage of the video object area.

The organization of the paper is as follows. The proposed 2-D mesh-based segmentation method, formulated as a constrained maximum contrast path search problem, is discussed in Section 2. Section 3 presents a 2-D mesh tracking method with occlusion detection and mesh update. Experimental results are given in Section 4. Conclusions and future directions are discussed in Section 5.

Section snippets

2-D mesh-based video object segmentation with multi-frame motion filtering

This section presents a coarse to fine hierarchical 2-D mesh-based video object segmentation algorithm. First, a coarse boundary of the video object is estimated based on feature (node) point selection and a multi-frame node motion analysis as discussed in Section 2.1. Next, refinement of this coarse boundary is formulated as a constrained maximum contrast path search based on node point motion vectors and colors within the triangles as explained in Section 2.2.

2-D mesh tracking with occlusion and new object detection

This section deals with the tracking of the refined boundary from the previous frame to the current frame in the presence of self-occlusion (out-of-plane rotation or articulated motion) and object-to-object occlusion. The tracking algorithm includes three main steps: uncovered-region detection, classification of occlusion type, and boundary refinement. We consider only occlusions that relate to the object boundary; that is, we do not attempt to detect self-occlusions that are completely within

Experimental results

We demonstrate the proposed segmentation and tracking method on three sequences: frames 20–100 of “Mother and Daughter”, frames 40–75 of “Hall”, and frames 8–19 of “Hamburg Taxi” with frame increments of 2, 2 and 1, respectively. The 20th, 40th and 8th frames of these sequences are shown in Fig. 3, Fig. 4, Fig. 5, respectively. We consider “Mother and Daughter” as a single large foreground video object (VO). The man in the “Hall” VO, on the contrary, is a relatively small VO with articulated

Conclusions

A 2-D mesh-based hierarchical segmentation and tracking method with occlusion detection has been proposed. The results show that the method can discriminate successfully between multiple moving objects and track them in the presence of self-occlusion and object-to-object occlusion. An important consideration for the mesh-based tracking approach is that the mesh topology may need to be updated in the presence of articulated motion (e.g., legs crossing each other or hand movements) as well as

References (15)

A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tuncel, T. Sikora, Image sequence analysis for emerging interactive...
Y. Altunbaşak, P.E. Eren, A.M. Tekalp, Region-based parametric motion segmentation using color information, Graphical...
R. Castagno, T. Ebrahimi, M. Kunt, Video segmentation based on multiple features for interactive multimedia...
I. Celasun, A.M. Tekalp, Optimal 2D hierarchical content-based mesh design and update for object-based video, IEEE...
L. Chiariglione, MPEG and multimedia communications, IEEE Trans. Circuits Systems Video Technol. 7 (1) (February 1997)...
Y. Fu, A.T. Erdem, A.M. Tekalp, Tracking visible boundary of objects using occlusion-adaptive motion snake, IEEE Trans....
M. Isard, A. Blake, Condensation – Conditional density propagation for visual tracking, Internat. J. Comput. Vision 29...

There are more references available in the full text version of this article.

Cited by (25)

Hierarchical spatio-temporal extraction of models for moving rigid parts
2011, Pattern Recognition Letters
Citation Excerpt :
This work is an important technological aspect for the success of emerging object-based MPEG-4 and MPEG-7 multimedia applications. Multi-frame spatio-temporal segmentation/tracking: Celasun et al. (2001) present VOS based on 2D meshes. Tekalp et al. (1998) describe 2D mesh-based modeling of video objects as a compact representation of motion and shape for interactive video manipulation, compression, and indexing.
This paper presents a method to extract a part-based model of an observed scene from a video sequence. Independent motion is a strong cue that two points belong to different “rigid” entities. Conversely, things that move together throughout the whole video belong together and define a “rigid” object or part. Successfully tracked features indicate trajectories of salient points in the scene. A triangulated graph connects the salient points and encodes their local neighborhood in the first frame. The length variation of the triangle edges is used to label them as relevant (on an object) or separating (connecting different objects). A following grouping process uses the motion of the triangles marked as relevant as a cue to identify the “rigid” parts of the foreground or the background. The choice of the motion-based grouping criterion depends on the type of motion: in the image plane or out of the image plane. The result is a hierarchical description (graph pyramid) of the scene, where each vertex in the top level of the pyramid represents a “rigid” part of the foreground or the background, and encloses to the salient features used to describe it. Promising experimental results show the potential of the approach.
A multi-Kalman filtering approach for video tracking of human-delineated objects in cluttered environments (DOI:10.1016/j.cviu.2006.02.003)
2006, Computer Vision and Image Understanding
In this paper, we propose a new approach that uses a motion–estimation based framework for video tracking of objects in cluttered environments. Our approach is semi-automatic, in the sense that a human is called upon to delineate the boundary of the object to be tracked in the first frame of the image sequence. The approach presented requires no camera calibration; therefore it is not necessary that the camera be stationary. The heart of the approach lies in extracting features and estimating motion through multiple applications of Kalman filtering. The estimated motion is used to place constraints on where to seek feature correspondences; successful correspondences are subsequently used for Kalman-based recursive updating of the motion parameters. Associated with each feature is the frame number in which the feature makes its first appearance in an image sequence. All features that make first-time appearances in the same frame are grouped together for Kalman-based updating of motion parameters. Finally, in order to make the tracked object look visually familiar to the human observer, the system also makes its best attempt at extracting the boundary contour of the object—a difficult problem in its own right since self-occlusion created by any rotational motion of the tracked object would cause large sections of the boundary contour in the previous frame to disappear in the current frame. Boundary contour is estimated by projecting the previous-frame contour into the current frame for the purpose of creating neighborhoods in which to search for the true boundary in the current frame. Our approach has been tested on a wide variety of video sequences, some of which are shown in this paper.
Multiregion competition: A level set extension of region competition to multiple region image partitioning
2006, Computer Vision and Image Understanding
The purpose of this study is to investigate a new representation of a partition of an image domain into a fixed but arbitrary number of regions by explicit correspondence between the regions of segmentation and the regions defined by simple closed planar curves and their intersections, and the use of this representation in the context of region competition to provide a level set multiregion competition algorithm. This formulation leads to a system of coupled curve evolution equations which is easily amenable to a level set implementation and the computed solution is one that minimizes the stated functional. An unambiguous segmentation is garanteed because at all time during curve evolution the evolving regions form a partition of the image domain. We present the multiregion competition algorithm for intensity-based image segmentation and we subsequently extend it to motion/disparity. Finally, we consider an extension of the algorithm to account for images with aberrations such as occlusions. The formulation, the ensuing algorithm, and its implementation have been validated in several experiments on gray level, color, and motion segmentation.
A multi-Kalman filtering approach for video tracking of human-delineated objects in cluttered environments
2005, Computer Vision and Image Understanding
Citation Excerpt :
The goal of boundary updating is to make a best attempt at extracting the silhouette of the object in the current frame. As was already mentioned in the Introduction, the problem of silhouette extraction of a moving object undergoing rotation is highly ill-posed [3,33,12,28,24,40,22]. All proposed solutions are based on the assumption that the object pixels in the vicinity of the boundary in the current frame possess texture and color properties similar to the object pixels in the vicinity of the boundary in the previous frame.
In this paper, we propose a new approach that uses a motion–estimation based framework for video tracking of objects in cluttered environments. Our approach is semi-automatic, in the sense that a human is called upon to delineate the boundary of the object to be tracked in the first frame of the image sequence. The approach presented requires no camera calibration; therefore it is not necessary that the camera be stationary. The heart of the approach lies in extracting features and estimating motion through multiple applications of Kalman filtering. The estimated motion is used to place constraints on where to seek feature correspondences; successful correspondences are subsequently used for Kalman-based recursive updating of the motion parameters. Associated with each feature is the frame number in which the feature makes its first appearance in an image sequence. All features that make first-time appearances in the same frame are grouped together for Kalman-based updating of motion parameters. Finally, in order to make the tracked object look visually familiar to the human observer, the system also makes its best attempt at extracting the boundary contour of the object—a difficult problem in its own right since self-occlusion created by any rotational motion of the tracked object would cause large sections of the boundary contour in the previous frame to disappear in the current frame. Boundary contour is estimated by projecting the previous-frame contour into the current frame for the purpose of creating neighborhoods in which to search for the true boundary in the current frame. Our approach has been tested on a wide variety of video sequences, some of which are shown in this paper.
Video analysis of human dynamics - A survey
2003, Real-Time Imaging
Video analysis of human dynamics is an important area of research devoted to detecting people and understanding their dynamic physical behavior in a complex environment that can be used for biometric applications. This paper provides a detailed survey of the various studies in areas related to the tracking of people and body parts such as face, hands, fingers, legs, etc., and modeling behavior using motion analysis.
ARTOD: Autonomous real time objects detection by a moving camera using recursive density estimation
2016, Studies in Computational Intelligence

View all citing articles on Scopus

^☆: This work was supported by TÜBİTAK under contract EEEAG-198E011.

View full text

2-D mesh-based video object segmentation and tracking with occlusion resolution☆