Keywords

1 Introduction

The availability of several mobile devices, such as digital cameras, cell phones, and tablets, has enabled people to generate a large amount of multimedia content. The growth of digital data, particularly video streaming, has contributed to the advance of many knowledge domains, for instance, entertainment, education, telemedicine, robotics, surveillance and security.

In contrast to computationally expensive and time consuming task of manual annotation, the development of automatic and scalable strategies for storing, indexing, transmitting and retrieving multimedia data [15, 18] is crucial to manage such massive growth of digital content.

Shot boundary detection plays an important role in temporal video segmentation [6, 9, 10, 12, 19], whose purpose is to partition the video content into meaningful units that constitute the most representative keyframes, known as shots. The summary of a video can be constructed from a set of keyframes that represent the shots.

Two categories of video transitions between shots are commonly defined: gradual and abrupt transitions. A gradual transition represents a smooth change over several frames, whereas an abrupt transition corresponds to a cut between one frame of a shot and its adjacent frame in the next shot.

There are various challenges associated with the video shot boundary detection task, such as illumination variability, camera motion, diversity of video genres, as well as the inherent subjectivity of the segmentation process. Although different video shot boundary detection methods have been proposed in the literature [2, 3, 8, 16], two common steps are generally performed: (i) a similarity or dissimilarity measure is computed for each pair of consecutive frames and (ii) a cut is detected if the measure is higher than a specified threshold.

This work investigates and evaluates a novel shot boundary detection approach based on bit planes extracted from the video frames. An adaptive thresholding scheme is employed to determine if a transition is an abrupt shot boundary. Experiments are conducted on two public video benchmarks to show the effectiveness of the proposed method. Results are compared to other approaches available in the literature.

This paper is organized as follows. The proposed shot boundary detection method is detailed in Sect. 2. Experimental results are presented and discussed in Sect. 3. Finally, some final remarks and directions for future work are included in Sect. 4.

2 Methodology

The proposed video shot boundary detection method is based on the bit planes that compose a grayscale image. Since a pixel in a grayscale frame typically requires 1 byte of storage, 8 binary images can be produced by taking each bit plane independently. Figure 1 illustrates the bit planes extracted as binary images from a grayscale image (video frame). Each of these images describes different details about the pixel intensities, so a set of these images are used to extract features of the frame.

Fig. 1.
figure 1

Grayscale image and its decompositions in binary bit plane images.

Algorithm 1 describes how the features are extracted. For each bit plane image, a horizontal and vertical projection are calculated. These projections differ from the standard approach to summation of foreground pixels [7], whose purpose is to describe an object present in the image, since the information here about background is also relevant.

figure a

The background computation is important in the projection process since there are often cases with no foreground pixels, whose sum would result in zero. To avoid this, a transformation is applied to convert the binary image values from [0, 1] to \([-1,1]\). Thus, the projections correspond to the difference between the number of foreground and background pixels along each line and column, assigning positive values where the foreground exceeds the background, zero where they are equal, and negative values otherwise.

The concatenation of both horizontal and vertical projections constitutes the feature vector for a bit plane image. In order to compute the frame dissimilarities, a correlation-based distance (Eq. 1) between feature vectors is applied between equivalent bit planes for frames t and \(t-1\). The total dissimilarity is calculated as the average distance in the set of bit planes. Algorithm 2 summarizes such procedure.

$$\begin{aligned} \textit{correlation}(u, v) = 1 - \frac{(u - \overline{u}) \cdot (v - \overline{v})}{\Vert (u - \overline{u})\Vert _2 \Vert (v - \overline{v}\Vert _2} \end{aligned}$$
(1)

where u and v are two feature vectors.

figure b

Finally, the vector of dissimilarities for the entire video is subject to a thresholding method [14], which is locally adaptive. It normalizes the vector into the range [0, 1], takes a moving window over the vector, computes the median value within the window, and determines a shot boundary if the center of the window is the maximum value within the window and is greater than the median plus a fixed \(\alpha \). Figure 2 shows an example of dissimilarity vector after applying the thresholding process.

Fig. 2.
figure 2

Illustration of the thresholding technique.

3 Experimental Results

In this section, we present the experimental setup and analyze the results obtained on two data sets with our shot boundary detection method based on bit planes.

3.1 Data Sets

The TRECVID’2002 [17] consists of 18 video sequences with variability in quality, length, production style and noise level. The minimum number of cuts in a video is 18 and the maximum is 163, there is an average of 83 cuts per video. The duration of the longest video is 28 min 48 s, whereas the shortest is 6 min 32 s. This data set was used as benchmark for the TREC Video Retrieval Evaluation competition in the shot boundary detection task. Unlike some recent competition data sets, this benchmark is freely available.

The VIDEOSEG’2004 [20] is composed of 10 video sequences from different genres, sizes, digitization quality and production effects. The video with maximum number of cuts is 87, minimum number of cuts is 0, where the average number of cuts is 28.

3.2 Evaluation Metrics

The evaluation guidelines available for the TRECVID competition are followed in this work, such that the results are reported in terms of precision, recall and harmonic mean (\(F_{score}\)), as expressed in Eqs. 2, 3 and 4, respectively. The precision indicates the capacity of the method in detecting only the real cuts, the recall indicates the capacity in finding all existing cuts, whereas the \(F_{score}\) is the harmonic mean between the other two metrics.

$$\begin{aligned} \text {Precision}&= \frac{\displaystyle \sum _{f_i \in V}{S(i) \in \textit{Cut} \wedge i \in \textit{True Cut}}}{\sum _{f_i \in V} {S(i) \in \textit{Cut}}} \end{aligned}$$
(2)
$$\begin{aligned} \text {Recall}&= \frac{\displaystyle \sum _{f_i \in V}{S(i) \in \textit{Cut} \wedge i \in \textit{True Cut}}}{\sum _{f_i \in V} {i \in \textit{True Cut}}} \end{aligned}$$
(3)
$$\begin{aligned} F_{\textit{score}}&= 2 \, \frac{\text {Precision} \times \text {Recall}}{\text {Precision}+\text {Recall}} \end{aligned}$$
(4)

where V is a Video and S a detection set.

As established in the TRECVID competition, we set a tolerance of \(+5\) and \(-5\) for the frame number of the detected transition. This is done due to possible changes in the numeration of a frame caused by different video encoders.

3.3 Parameter Settings

Similarly to other approaches [13, 14], the parameters used in the thresholding stage are kept. The window size is set to 7 and \(\alpha = 0.2\). The feature extraction method and the distance calculation require as parameters only the set of bit planes.

An empirical experiment is carried out to determine the most appropriate set with respect to \(F_{score}\). A search is executed in the TRECVID’2002 data set by evaluating all combinations out of the 8 planes. Table 1 shows the best results for each number of planes used. It is important to highlight that the results reported for VIDEOSEG’2004 are based on the same set as chosen for TRECVID’2002.

Table 1. Results for different combinations of bit planes.

As expected, the most significant bits occur in the combinations with higher accuracy. Nonetheless, the bit plane 6 presented the best results among the most significant ones, such as bit plane 7. This can be explained by the fact that the more significant the bit is, the more susceptible it is with respect to small contrast variations, which also justify the favoritism of bit 0 in place of other most significant bits. When the three most significant bits are present, bit 0 works to regulate the average and avoid false transitions caused by illumination changes.

In order to have a trade-off in terms of accuracy for both data sets and have a single best set of bit planes, the results presented in the next section consider the set of 4 bit planes ({0, 5, 6, 7}).

3.4 Results

Table 2 shows the results on TRECVID’2002 data set when the dissimilarities between frames were calculated through color histograms with 32 bins and cross-correlation between pixels. Both baseline methods were implemented in our framework with an adaptive threshold, such that only the features and distances were varied.

Table 3 shows comparative results for the VIDEOSEG’2004 data set. The results for the baseline methods were extracted from the literature.

For the TRECVID’2002 data set, our method surpassed the other methods in terms of precision, while maintained a proper recall rate. The precision rate is particularly difficult in this data set since the video sequences have several different types of transitions and glitch effects, which may be confused as cuts. For the VIDEOSEG’2004 data set, our method did not obtain the best precision or recall rate individually, however, it achieved a proper trade-off between both, resulting in the highest \(F_{\textit{score}}\) measure.

Table 2. Video cut detection results for TRECVID’2002.
Table 3. Comparative results for VIDEOSEG’2004.

4 Conclusions

The decomposition of a grayscale image into bit planes was explored in this work for the purpose of video shot boundary detection. The bit planes are employed in the process as binary images, where feature vectors are extracted from them by means of vertical and horizontal projections and used in the computation of dissimilarity between adjacent video frames.

Experiments were conducted on two data sets to assess the proposed methodology. It shows that features based on the bit planes image are relevant to compute similarity between images. It was capable to outperform other approaches present in the literature of cut detection, and even when using the same settings and conditions it surpass the histogram of colors.