Hyper-rectangle based segmentation and clustering of large video data sets

https://doi.org/10.1016/S0020-0255(01)00195-5Get rights and content

Abstract

Video information processing has been one of great challenging areas in the database community since it needs huge amount of storage space and processing power. In this paper, we investigate the problem of clustering large video data sets that are collections of video clips as foundational work for the subsequent processing such as video retrieval. A video clip, a sequence of video frames, is represented by a multidimensional data sequence, which is partitioned into video segments considering temporal relationship among frames, and then similar segments of the clip are grouped into video clusters. We present the effective video segmentation and clustering algorithm which guarantees the clustering quality to such an extent that satisfies predefined conditions, and show its effectiveness via experiments on various video data sets.

Introduction

Recently, the video information has become widely used in many application areas such as news broadcasting, video on demand, and video conferencing, as digital storage technology and computing power have been significantly advanced in the last decade. These applications involve searching, consuming, or exchanging large volume of complex video data sets. To handle such voluminous data sources, it is essential that the video data should be effectively represented, stored, and retrieved.

A video database may contain a number of video clips that can be represented by multidimensional data sequences (MDSs). In our earlier work [11], we have formally defined an MDS S with K points in the n-dimensional space as a sequence of its component vectors, S=〈S[1],S[2],…,S[K]〉, where each vector S[j] (1⩽jK) is composed of n scalar entries, that is, S[j]=(S1[j],S2[j],…,Sn[j]). A video clip consists of multiple frames in temporal order, each of which can be represented by a multidimensional vector in the feature space such as RGB or YCbCr color space. Thus, a video clip is modeled as a sequence of points in a multidimensional space such that each frame of the sequence constitutes a multidimensional point, whose components are feature values of a frame. By modeling a video clip to an MDS, the problem of clustering frames in a video clip is transformed into that of clustering points of an MDS in a multidimensional space. Each sequence is partitioned into video segments (or video shots) and then similar segments are grouped into a video cluster. Fig. 1 shows the hierarchical structure of video data.

The clustering has attracted great interest in many database applications such as customer segmentation, sales analysis, pattern recognition, and similarity search. The task of clustering data points can be defined as follows: given a set of points in a multidimensional space, partition the points into clusters such that points within each cluster have similar characteristics while points in different clusters are dissimilar. A point that is considerably dissimilar to or inconsistent with the remainder of the data is referred to as an outlier or a noise.

Various clustering methods have been studied in database communities, however, the clustering of video data should be handled in a way different from the existing clustering methods in various aspects. First, in a video clustering, the temporal relationship among frames and among video segments should be considered importantly, since the temporal ordering of frames and video segments is an intrinsic feature of video data. Existing methods did not consider it. Second, a target object to be clustered in existing methods is mapped to a single point in a multidimensional space and thus belongs to a single cluster, while a video clip is represented by multiple points that can be partitioned into multiple separate clusters. Third, the shapes of clusters may also be considered differently. The existing methods attempt to look at quantitative properties of clusters, independent of how they will be used. They determine a certain number of clusters that optimize given criteria such as the mean square error. Thus, the shapes of clusters are determined arbitrarily depending on the distribution of points in the data space. However, we consider, in addition to the clustering itself, the subsequent retrieval process importantly, such as Find video clips that are similar to a given news video. Therefore, the shapes of clusters should be appropriate for this purpose.

It is usual in the video search that one or more key frames are selected for each video segment, and a query is processed on the selected frames [7]. But the search by the key frames does not guarantee the correctness since they cannot summarize all the frames of the segment. We proposed in [11] the similarity search scheme based on the hyper-rectangle that tightly bounds all points (or frames) in the segment, not on the key frames to prevent false dismissal. We believe that guaranteeing the correctness is one of important features in the similarity search. In addition, the shapes of clusters should be proper for the indexing mechanism. We use a hyper-rectangle as the shape of a cluster, since current dominant indexing mechanisms such as the R-tree [9] and its variants [3], [4], [14] are based on a minimum bounding rectangle (MBR) as their node shape.

The representation and the retrieval of video data place various special requirements on clustering techniques, motivating the need for designing a new clustering algorithm. Those requirements are categorized into two classes as follows: the geometric characteristics of clusters, and the temporal and semantic relationship among elements in a cluster.

First, the cluster should be dense with respect to (wrt.) the volume and the edge for the efficient retrieval, by minimizing the volume and the edge of a cluster per point and maximizing the number of points per cluster. Next, the temporal and semantic relationship among elements in a cluster should be maintained. It means that the information on temporal ordering of elements in a cluster should be preserved, and elements in a cluster should be semantically similar. In addition to these requirements, it should be able to deal with outliers appropriately, and minimize the number of input parameters to the clustering algorithm. Considering these requirements, the clustering problem in this paper is formalized as follows:

  • Given: A data set of video clips and the minimum number of points minPts per video segment.

  • Goal: To find the sets of video clusters and outliers that optimize the values of predefined measurement criteria.


An input parameter minPts is needed to determine outliers. In our method, each point in a sequence is initially regarded as a segment with a single point, and then closely related segments are repeatedly merged to form a cluster. If a certain segment has far fewer points than the average after the segmentation process, all points in it can then be considered as outliers. For instance, if a segment with 2 or 3 points is located away from other segments, it may be a set of outliers with high possibility. Far fewer is of course heuristically determined depending on applications. A too small value of minPts makes unimportant segments be indexed, degrading the memory utilization, while a too large value of minPts makes meaningful segments be missed. In this context, if a segment has points the number of which is less than a given minPts value after the segmentation process, all points in the segment are regarded as outliers. Those outliers are not indexed, but written out to the disk for later processing.

In the first step of our method, video clips are parsed to generate a data set of MDS's. Feature values are extracted from each frame of the video clip by averaging color values of pixels of a frame or segmented blocks of a frame. As an optional process, if the dimensionality of generated data is high, it is reduced to a low dimensionality to avoid dimensionality curse problem. It is usual that high dimensional data may not be used in reality since it needs huge amount of storage space and causes severe processing overhead.

In the next step, the generated MDS is partitioned into video segments such that predefined geometric and semantic criteria are satisfied. Outliers are also identified in this process. Finally, similar segments of a sequence are grouped into a video cluster in the clustering process to get the better clustering quality. In this way, a given video clip is represented by a small number of video clusters which will be indexed and stored into a database for later processing. In this paper, we focus on the segmentation and the clustering processes. The overall structure is shown in Fig. 2.

The segmentation and clustering method proposed in this paper is a foundational work for the creation of video databases, and can be used for various application domains such as video digital libraries, video on demand, news on demand, and tele-education systems. One of potential applications, which is emphasized in this paper, is the segmentation and clustering of video data sets, but we believe other application areas in which data can be represented in the form of MDS can also benefit. For examples, audio sequences, time series data, and various analog signals can be represented by MDS, and thus our method can be applied.

The rest of the paper is organized as follows: Section 2 provides a survey of related works with a brief discussion on clustering data points and data sequences. Section 3 includes basic definitions, clustering characteristics, and various measurements of clustering quality. The segmentation process is described in Section 4 with an algorithm to produce video segments from an MDS. Section 5 provides the clustering process to generate video clusters by merging video segments. Experimental results are presented in Section 6 and we give conclusions in Section 7.

Section snippets

Related works

Many excellent approaches on clustering data points in a multidimensional space have been proposed, such as CLARANS [13], BIRCH [15], DBSCAN [5], CLIQUE [2], and CURE [8].

CLARANS is a clustering algorithm that is based on randomized search and gets its efficiency by reducing the search space using user-supplied input parameters. The algorithm BIRCH constructs a hierarchical data structure called the CF-tree for multiphase clustering by scanning a database and uses an arbitrary clustering

Preliminaries

In this section, we discuss various characteristics of a hyper-rectangle that is used to define a video segment and a video cluster, and clustering factors to be considered for effective segmentation and clustering. Table 1 summarizes symbols and definitions used in this paper.

Video segmentation

Once multidimensional sequences have been generated from video clips, each sequence is partitioned into video segments. The segmentation is the repeating process of merging a point of the sequence into a hyper-rectangle if predefined criteria are satisfied. Consider a point P to be merged to a hyper-rectangle HR=〈L,H,k〉 in the unit space [0,1]n. Then, the segmentation is done in such a way that if the merging of P into HR satisfies certain given conditions then it is merged into the current

Video clustering

After video segments are generated from an MDS, those segments that are spatially close need to be merged together to promote the clustering quality defined in Eq. (8). It is important to determine whether two hyper-rectangles of video segments or clusters are to be merged or not. Merging two hyper-rectangles is allowed as long as the predefined condition is satisfied. This process generates larger clusters gradually to optimize given measurement criteria. We formally define the video cluster

Experiments

In order to evaluate the effectiveness of our proposed method, we have conducted experiments on data sets of various real-world videos such as TV news, dramas, and animation films. Our experiment focuses on showing the clustering quality of the method wrt. predefined measurements mentioned in Section 3.4. In this section, we describe our preparation for the experiment and give the results with brief analyses.

Conclusions

The retrieval of video data sets is one of the great potential areas in database applications, even though it has not been widely studied. For an efficient retrieval of video data sets, the clustering process is essential as a foundational work for representing, indexing, and storing video data sets. In this paper, we have investigated the segmentation and clustering of large video data sets. To solve the problem, we have first discussed clustering factors considering geometric and semantic

Acknowledgements

We would like to thank anonymous reviewers for their valuable comments and Prof. Timothy Shih, the guest editor, for his help. This research was supported by the Korea Research Foundation Grant (KRF-2000-041-E00262).

References (15)

  • C.C Aggarwal et al.

    Fart algorithms for projected clustering

  • R Agrawal et al.

    Automatic subspace clustering of high dimensional data for data mining applications

  • N Beckmann et al.

    The R-tree: an efficient and robust access method for points and rectangles

  • S Berchtold et al.

    The X-tree: an index structure for high-dimensional data

  • M Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

  • C Faloutsos et al.

    Fast subsequence matching in time-series databases

  • M Flickner et al.

    Query by image and video content: the QBIC system

    IEEE Computer

    (1995)
There are more references available in the full text version of this article.

Cited by (0)

View full text