Semi-automatic object-based video segmentation with labeling of color segments

doi:10.1016/S0923-5965(02)00092-9

Signal Processing: Image Communication

Volume 18, Issue 1, January 2003, Pages 51-65

https://doi.org/10.1016/S0923-5965(02)00092-9 Get rights and content

Abstract

In this paper we propose a semi-automatic method for general object-based segmentation of image sequences. A label field is initialized by the user in the first frame of the sequence and then it is automatically tracked for the rest of the frames based on the color and motion properties of the various objects in the scene. We propose a novel statistical modeling which is based on the local objects’ properties. The locality introduced in our modeling allows the tracking of complex objects that, globally, might be inhomogeneous in their color and motion characteristics. The labeling criterion is the maximization of the joint probability of the labels and the observed color and motion properties. The proposed method utilizes an initial color-based segmentation which is obtained for each frame of the sequence. Both the modeling and the optimization are expressed in terms of the color-segments’ statistical properties. Experimental results are presented on real-image sequences that include complex human motion in cluttered backgrounds.

Introduction

Boosted by the technological advances in the area of communications and computer engineering, there has been an explosion in the amount of the distributed visual information in the last decade. This fact and the applications that are continuously emerging advocate the need for the development of a wide range of methods for distributing and processing the available visual information. Object-based segmentation of image sequences is one of the issues that often arise in the world of video processing and communications. By partitioning each frame into segments that correspond to meaningful objects, we obtain a representation of the scene that allows coding, delivering and viewing the visual information in terms of its actual contents. The application that can benefit are numerous. Content-based functionalities such as object-dependent coding quality or object manipulation for video editing can now be developed. Furthermore, in the context of MPEG-7 [7], a semantically rich representation allows the attachment of meaningful descriptors to the data, thus boost applications that revolve around retrieval of multimedia material.

Activity in the area of object-based segmentation in the last decade soon revealed that fully automatic, general object-based segmentation of image sequences is a chimera. On the one hand, the most popular assumptions that lead to robust schemes—such as the assumption that the camera is static [15] or the assumption that the objects are rigidly moving [18]—are too strict for general scenes. On the other hand, depending on the application and the user, the goal of the segmentation may vary; the problem of general object-based segmentation is therefore inherently ill-posed.

In order to overcome these problems, semi-automatic methods started to emerge. Typically, in these schemes a label field is initialized by the user in the first frame of the sequence and then automatically tracked in the subsequent frames. The challenges in such schemes lie in tracking robustly multiple complex objects in complex environments. Obviously, the global homogeneity assumptions on motion characteristics that lie behind the robustness of many automatic schemes do not hold true for complex objects. Even the assumption that it is possible to decompose the scene based on the kinematic behavior of the depicted objects might not hold true. Other sources of information such as color and/or texture should be also used. For such sources of information the adoption of global homogeneity models is in general more problematic than in the case of motion: objects that are semantically meaningful are unlikely to exhibit global homogeneity in such properties and/or discontinuities might be difficult to be detected. Such inhomogeneity in the objects’ properties makes probable that the detected objects’ borders are attracted by details and structure in other neighboring objects. Temporal evidence can support a correct decision, but it should be used with care; even if the label field of the previous frame is perfectly estimated, the motion field that establishes correspondences with the current frame in notoriously known to contain errors.

The semi-automatic method for object-based labeling (i.e. object-based segmentation) that we propose here is based on all of the above considerations. We follow the dominant paradigm in which the label field is initialized by the user and then automatically tracked for the rest of the sequence. An outline of the proposed method is illustrated in Fig. 1 and a preliminary version of the paper was presented in [17]. The method operates at three levels. At Level 1 (pixel level) a feature vector is estimated for each pixel in the current frame. At Level 2 (segment level) a color-based segmentation method decomposes the current frame in a number of color segments. Subsequently, we estimate the statistical properties of the color segments. Finally, at Level 3 (object level) a labeling based on a probabilistic classification of the color segments takes place. The labeling imposes a “common fate” to the pixels belonging to the same color segment. The classification is based on a novel local statistical modeling of the color and motion properties of the objects in the scene. The classification criterion is the maximization of the joint probability of the label field and the observations (i.e. feature vectors), with respect to the label field. For the maximization a deterministic iterative local search algorithm is proposed.

The relation of our method with other methods in the literature will be addressed in the next section. Let us note here that the novel local modeling that we propose allows the modeling of multiple objects that do not exhibit global homogeneity in their characteristics. Furthermore, we propose explicit modeling of the color and motion characteristics as well as the use of color segments as primary elements in the modeling and in the classification. In this way, we aim at robustness against spurious edges in the background and against errors in the estimation of the motion field that establishes correspondences with the label field in the previous frame. Finally, our scheme requires the manual setting of only one parameter: the degree of the locality of the proposed modeling.

The remainder of the paper is organized as follows. In Section 3 we briefly describe the color-based segmentation method, the user interaction and the motion estimation algorithm. In Section 4 we describe the local modeling of the objects’ properties and in Section 5 we formulate and solve the labeling as an optimization problem. In Section 6 the projection procedure is described. In Section 7 we present experimental results and in Section 8 we draw conclusions.

Section snippets

Related work

From the prism of the proposed method we will concisely review semi-automatic methods in which the user interaction is limited to the initialization of the label field at the first frame of the sequence. We will concentrate on three main categories of methods. To the first category belong methods that track a non-parametric contour of the object. To the second category belong methods which model statistically the properties of the regions that correspond to each of the objects in the scene. To

Feature extraction and user interaction

In this section we will discuss the issues related to the first two levels of the proposed method, as well as issues related to the user interaction phase.

At the lowest level (Level 1 in Fig. 1), for each pixel $i$ in the current frame a feature vector $x_{i}$ which characterizes its color and motion properties is defined. The feature vector has five components: one for each of the three dimensions of the chosen color space and two for the horizontal and vertical components of the motion.

As far as the

Local mixture modeling

At the highest level of the method (Level 3 in Fig. 1) the color segments are labeled according to the local statistical color and motion properties of the objects. Statistical representations of the color and motion characteristics of an object have been used in the literature (e.g. [5], [13]) for segmenting and tracking objects in complex scenes. Both in [5], [13] a mixture of Gaussians is used to model the color and motion features, and in both of them the labeling is performed per pixel.

Maximization of joint probability

The labeling criterion is the maximization of Eq. (1). This is equivalent to the maximization of its logarithm, $L(X,L)= ln (P(X,L))= ∑ i ln p(x_{i} |θ_{sn} (L))+ ∑ i ln (π_{sn} (L)).$ We employ an iterative local search algorithm which generates a sequence of label fields L^k (where k denotes the iteration) that increase the logarithm of the joint probability (L(X,L)). The optimization procedure also involves the initialization of the label field (L⁰). For the first frame this is provided as the result of a watershed

Object projection

One of the most important issues in the segmentation of image sequences is the temporal coherency of the label field. That is, how consistent the label fields are in time. Dealing with this issue involves establishing a link between the label field at the current frame and the label field at the previous frame. This is usually achieved with a motion-based projection in the label field estimated at the previous frame.

In our approach, the temporal consistency affects the label field at the first

Experimental results

The method has been tested in a number of image sequences. Here we present results for three of them, namely the “mother”, the “jardin” and the “hieue” image sequences. All of them contain complex objects which are inhomogeneous in their color and motion characteristics, clutter in the background and occlusions. We present comparative results with two methods that are representative of the first and the third category of the methods that were revised in Section 2.

The “mother” sequence depicts a

Conclusions

In this paper we presented a semi-automatic method for labeling image sequences based on a local model-based statistical classification algorithm. An initial color segmentation scheme partitions each frame in a number of segments which are subsequently labeled on the basis of their color and motion statistics. The labeling is expressed as an optimization problem, where the criterion is the maximization of the joint probability of the labels and the color and motion distribution within each

References (25)

Y. Altunbasak et al.
Region-based parametric motion segmentation using color information
Graphical Models Image Process.
(January 1998)
S. Ayer, Sequential and competitive methods for estimation of multiple motions, Ph.D. Thesis, Ecole Polytechnique...
R.G. Brown et al.
Introduction to Random Signals and Applied Kalman Filtering
(1996)
R. Castagno et al.
Video segmentation based on multiple features for interactive multimedia applications
IEEE Trans. Circuits Systems Video Technol.
(September 1998)
E. Chalom, Statistical image sequence segmentation using multidimensional attributes, Ph.D. Thesis, Massachusetts...
A.P. Dempster et al.
Maximum likelihood from incomplete data via the em algorithm
J. Roy. Stat. Soc., Ser. B
(1977)
A. Vetro (Ed.), MPEG-7: Applications document, Technical Report ISO/IEC JTC1/SC29/WG11/N3934, MPEG, January 2001. Pisa....
C. Gu et al.
Semiautomatic segmentation and tracking of semantic video objects
IEEE Trans. Circuits Systems Video Technol.
(September 1998)
C. Gu, Multivalued morphology and segmentation-based coding, Ph.D. Thesis, Ecole Polytechnique Federale de Lausanne,...
F. Marques, J. Llach, Tracking of generic objects for video object generation, in: Proceedings of the IEEE...

H.T. Nguyen et al.

Tracking non-parametrized object contours in video

IEEE Trans. Image Process.

(September 2002)

N.E. O'Connor, N. Brady, S. Marlow, Supervised image segmentation using em-based estimation of mixture density...

Cited by (11)

Automatic body segmentation with graph cut and self-adaptive initialization level set (SAILS)
2011, Journal of Visual Communication and Image Representation
Citation Excerpt :
Segmentation can be interpreted as a region-based or contour-based process. In a region-base approach [12,29,30], the algorithm, likes graph cut we have discussed, aims at learning the statistics of the object and background so that it is able to find a boundary to distinguish between the two. But region-based algorithms cannot provide directly control over the boundary location.
In this paper, we propose an automatic human body segmentation system which mainly consists of human body detection and object segmentation. Firstly, an automatic human body detector is designed to provide hard constraints on the object and background for segmentation. And a coarse-to-fine segmentation strategy is employed to deal with the situation of partly detected object. Secondly, background contrast removal (BCR) and self-adaptive initialization level set (SAILS) are proposed to solve the tough segmentation problems of the high contrast at object boundary and/or similar colors existing in the object and background. Finally, an object updating scheme is proposed to detect and segment new object when it appears in the scene. Experimental results demonstrate that our body segmentation system works very well in the live video and standard sequences with complex background.
Semantic object classes in video: A high-definition ground truth database
2009, Pattern Recognition Letters
Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.
The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corresponding semantically labeled images at 1 Hz and in part, 15 Hz.
The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driving scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expanding this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by measuring the performance of an algorithm from each of three distinct domains: multi-class object recognition, pedestrian detection, and label propagation.
Fast registration of remotely sensed images for earthquake damage estimation
2006, Eurasip Journal on Applied Signal Processing
Video Object Segmentation and Tracking
2020, ACM Transactions on Intelligent Systems and Technology
Video object segmentation and tracking: A survey
2019, arXiv
Classification algorithms for interactive multimedia services: A review
2013, Multimedia Tools and Applications

View all citing articles on Scopus

View full text

Semi-automatic object-based video segmentation with labeling of color segments

Abstract

Introduction

Section snippets

Related work

Feature extraction and user interaction

Local mixture modeling

Maximization of joint probability

Object projection

Experimental results

Conclusions

Graphical Models Image Process.

Introduction to Random Signals and Applied Kalman Filtering

Video segmentation based on multiple features for interactive multimedia applications

IEEE Trans. Circuits Systems Video Technol.

Maximum likelihood from incomplete data via the em algorithm

J. Roy. Stat. Soc., Ser. B

Semiautomatic segmentation and tracking of semantic video objects

IEEE Trans. Circuits Systems Video Technol.

Tracking non-parametrized object contours in video

IEEE Trans. Image Process.