Keywords

1 Introduction

Every day an enormous amount of surveillance video is captured 24 hours throughout the whole world for providing security, monitoring, preventing crime, and controlling traffic. In general, a number of surveillance video cameras are set up in a number of difference places of a building, business area, or congested area. These cameras are connected to a monitoring cell for storing and investigating. To store this huge volume of video data requires tremendous memory space. In addition to this, to find out any important events from the stored video for investigating or performing analysis, operators need to access the stored videos. This process is very tedious, lengthy and expensive. To solve these problems, a method for generating the shorter version of original video containing important events is highly desirable for memory management and information retrieval.

Video summarization (VS) is the technique to select the most informative frames so that it can contain all the necessary events and reject unnecessary contents to make the summarized video as concise as possible. Therefore, a good video summarization method is one that has several important properties. First, it must have the capability to include all significant incidents within the original video. Second, it should be able to generate a smaller version of the provided long video. Third, it should not contain repetitive information. The main purpose of VS is to represent a long original video in a condensed version in such a way that a user can get the whole idea of the entire video in a constrained amount of time.

In a video, foreground objects generally contain more detail information [1]. Again, human usually concentrate more on the movements of objects [2]. Consequently, objects as well as their motion are important features for a video. Motivated by these findings, a video summarization scheme is proposed in this paper based on objects and their motion in a video. To include foreground object information, Gaussian mixture-based parametric background modeling (BGM) [37] has been applied. To acquire the complete information of object motion in a video, object motion is extracted not only in spatial domain but also in frequency domain. To obtain motion information in spatial domain, consecutive frame difference is applied. For achieving object motion in frequency domain, phase correlation technique [8, 9] is applied. To the best of our knowledge, phase correlation is not applied for video summarization methods. Therefore, the main contribution of this paper is to apply phase correlation in video summarization method. The computational time of phase correlation is very low and rich motion information is obtained by phase correlation technique [8].

The structure of the remaining paper is as follows. Section 2 reviews related research. The proposed method is described in Sect. 3. Experimental results as well as detail discussions are provided in Sect. 4. Finally, a concluding remarks and future direction are drawn in Sect. 5.

2 Related Research

In the literature, different approaches have been proposed for summarizing various types of videos. For egocentric video summarization, region saliency is predicted in [10] using a regression model and storyboard are generated based on region importance score. In the method proposed in [11], story driven egocentric video is summarized by discovering the most influential objects within a video. Gaze tracking information is applied in [12] for summarization. In case of user generated video summarization, adaptive submodular maximization function is applied in [13]. A collaborative sparse coding model is utilized in [14] for generating summary of the same type of videos. Web images are used in [15] to enhance the process of summarizing the user generated video. To summarize movie, aural, visual and textual are merged in [16]. Role community network is applied in [17]. Film comic is generated using eye tracking data in [18].

However, the importance of surveillance video for industrial application is very higher than other types of videos (e.g., egocentric, user generated, movie). To summarize surveillance video, object centered technique is applied in [19]. Dynamic VideoBook is proposed in [20] for representing the surveillance video in a hierarchical order. Learned distance metric is introduced in [21] for summarizing nursery school surveillance video. In [22], salient motion information is applied. Maximum a posteriori probability (MAP) is used in [23] for synopsis generation. Recently, a method is proposed in [1] for a multi-view surveillance video summarization. Firstly, a single view summarization is generated in this approach for each sensor independently. For this purpose, MPEG-7 color layout descriptor is applied to each video frame and an online-Gaussian mixture model (GMM) is used for clustering. The key frames are selected based on the parameters of cluster. As the decision of selecting or neglecting a frame is performed based on the continuously updates of these clustering parameters, a video segment is extracted instead of key frames. Lastly, multi-view summarization is produced by applying distributed view selection method using the video segments extracted for each sensor in the previous step.

To the best of our knowledge, phase correlation technique has not been applied for video summarization. In this proposed approach, phase correlation technique is applied to incorporate motion information in frequency domain and fused with moving foreground object and spatial motion information.

3 The Proposed Method

The proposed scheme is based on area of moving foreground objects and their motion information in spatial and frequency domain. The main steps of the proposed method are (1) moving foreground object extraction (2) motion information calculation in spatial domain, (3) motion estimation in frequency domain, (4) fusion of foreground object area and spatial as well as frequency motion information, and (5) video summary generation. The flow chart of the proposed method is shown in Fig. 1. The detail of each step is explained in the subsequent sub-section.

Fig. 1.
figure 1

Framework of the proposed method

3.1 Foreground Object Extraction

In the proposed method, Gaussian mixture-based parametric BGM [37] is applied. In this parametric BGM, each pixel is modeled by the k Gaussian distributions (k=3) and each Gaussian model represents either static background or dynamic foreground object during time frame. For instance, suppose a pixel intensity \(X_t\) at time t is modeled by kth Gaussian with recent value \(\gamma _k^t\), mean \(\mu _k^t\), standard deviation \(\sigma _k^t\) and weight \(\omega _k^t\) such that \(\sum \omega _k^t=1\). The learning parameter is used to update parameter values, such as mean, standard deviation, and weight. At the beginning, the system contains empty set of Gaussian models. After observing the first pixel (t=1), a new Gaussian model (k=1) is generated with\(\gamma _k^t=\mu _k^t=X_t\), standard deviation \(\sigma _k^t=30\) and weight \(\omega _k^t=0.001\). Then for each new observation of pixel intensity \(X_t\) of the same location at t, it tries to find a matched model from the existing models such that \(|X_t - \mu _k |\le 2.5\sigma _k\).

To obtain gray scale background frame, background modeling [37] is applied after converting each color-frame into gray scale image. Then, A color video frame at time t is converted into gray image I(t) and subtracted from the corresponding gray background frame B(t) obtained by the background modeling. A pixel is considered foreground pixel and set the value to one, if the pixel intensity difference between I(t) and B(t) is greater than or equal to a certain threshold (Thr1). If the pixel intensity does not satisfy this condition, it is regarded as a background pixel and set to zero. In this way, a foreground pixel \(G_{i,j}(t)\) is obtained as follows

$$\begin{aligned} G_{i,j}(t) = {\left\{ \begin{array}{ll} 1, &{} \text {if} \quad \vert I_{i,j}^{r}(t)-B_{i,j}^{r}(t) \vert \ge Thr1 \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where (i, j) is the pixel position. The value of Thr1 is set to 20 to avoid subtle changes between background and foreground. This is a common practice to set the threshold value to 20 to identify object from the background as mentioned in [4]. After that, the total number of non-zero pixels in \(G_{i,j}(t)\) is used as area of foreground object feature F(t) which is obtained by the following equation

$$\begin{aligned} G(t)=\sum _{i=1}^{r}\sum _{j=1}^{c} G_{i,j}(t), \end{aligned}$$
(2)

where r and c represent the row and column of F respectively. According to the psychological theories of human attention, motion information is more significant than the static attention clues [2]. Therefore, motion information is included in the proposed method in addition to the foreground object.

3.2 Motion Information Calculation in Spatial Domain

Human usually concentrate more on the movements of objects in a video [2]. In order to obtain object motion information in spatial domain, frame-to-frame difference is applied. Consider two consecutive frames \(I(t-1)\) and I(t) at time \(t-1\) and t in video. To find out spatial motion information, the color difference in red, green, and blue channel between these frames is calculated. If the differences at each pixel in three different channels are greater than or equal to a threshold (T2), this pixel is considered as motion pixel and set to value one. Otherwise, it is sure that this pixel does not contain any motion information. Therefore, the spatial motion information \(S_{i,j}(t)\) in pixel (ij) at time t can be obtained by the following equation

$$\begin{aligned} S_{i,j}(t) = {\left\{ \begin{array}{ll} 1, &{} \text {if} \quad \vert I_{i,j}^{r}(t)-I_{i,j}^{r}(t-1) \vert \ge T2 \\ &{} \text {and} \quad \vert I_{i,j}^{g}(t)-I_{i,j}^{g}(t-1) \vert \ge T2 \\ &{} \text {and} \quad \vert I_{i,j}^{b}(t)-I_{i,j}^{b}(t-1) \vert \ge T2 \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

where \(I_(i,j)^r\), \(I_(i,j)^g\), and \(I_(i,j)^b\) represent red, green and blue color at (i,j) channels respectively. This is a common practice to set 20 as a threshold value to obtain information between two consecutive frames [4]. Therefore, the value of T2 is set to 20. The spatial motion information S(t) is obtained at time t by summing all values in \(S_{i,j}(t)\) as follows

$$\begin{aligned} S(t)=\sum _{i=1}^{r}\sum _{j=1}^{c} S_{i,j}(t), \end{aligned}$$
(4)

where r and c represent row and column of S respectively. However, motion extracted in spatial domain is not sensitive to diffuse phenomena [24]. For example, it does not work well when global illumination changes occur. Additionally, spatial motion estimation is prone to local inaccuracies and small motion discontinuities [25].

3.3 Motion Estimation in Frequency Domain

To overcome the problem of spatial motion calculation, motion information is calculated in frequency domain. Motion estimated in frequency domain has some advantages over spatial domain [24]. It is efficient for global changes of illumination and robust to motion estimation near object boundaries. To obtain motion information in frequency domain, each frame is divided into a number of blocks of \(16 \times 16\) pixels size. Then, phase correlation technique [8] is applied between the current block and reference block. The phase correlation peak \( \beta \), the magnitude of the motion accuracy, extracted from phase correlation method is used as motion indicator for that block. The phase difference \( \phi \) is calculated between the current block and its co-located reference block after applying Fast Fourier transformation \(FFT \) on each block. The inverse FFT is performed on the calculated phase difference and finally two dimensional ( 2-D) motion vector (dx, dy) is obtained [26]. This 2-D motion vector contains peaks \( \beta \) where there are shift between the current and reference blocks.

$$\begin{aligned} \phi =fftshift \vert ifft(e^{j(\angle F_{r} - \angle F_{c})} ) \vert , \end{aligned}$$
(5)

where \(F_{r}\) and \(F_{c}\) represent FFF of current and reference block respectively.

$$\begin{aligned} (dx,dy)=\max (\phi )-b/2-1 \end{aligned}$$
(6)
$$\begin{aligned} \beta = \phi (dx+b/2+1, dy+b/2+1) \end{aligned}$$
(7)

where b is block size. For example, b will be 16 if \(16 \times 16\) is used. If the value of \(\beta \) of a block greater than a threshold (T3), it is considered that this block contains sufficient motion information. In this method, the value of T3 is set to 0.6. All the values greater than T3 are summed to obtain motion information F(t) in frequency domain.

$$\begin{aligned} F(t)=\sum _{n=1}^{N}\sum _{m=1}^{M} F_{n,m}(t), \end{aligned}$$
(8)

where N and M represent row/b and column/b of F respectively. The motion information obtained in frequency domain applying phase correlation technique at different blocks of frame no 3869 of bl-14 video is shown in Fig. 2. No motion is represented in block (4,4) with only a single highest pick (Fig. 2(b)). Single (block (10,14)) and complex motions (block (5,8)) are showed in Fig. 2(c) and (d) respectively. In contrast, frequency based motion estimation approach lacks of localization problem [24]. Therefore, motions obtained in both spatial and frequency domains are combined with foreground objects for video summarization.

Fig. 2.
figure 2

An example of motion generated in each block of frame no 3869 of bl-14 video; (a) frame difference between 3868 and 3869 (multiplied by 6 for better visualization), phase correlation pick with no motion, complex motion and single motion are represented in (b), (c) and (d) respectively.

3.4 Fusion of Foreground and Motion Information

In order to select more accurate frame sequences, both area of foreground object and motion information are combined. In this approach, a weighted linear fusion is applied to combine the features for ranking each frame according to their representativeness in a video. Before applying fusion method, each feature is converted into z-score normalization using the following equation

$$\begin{aligned} Z(t)=(X(t)-\mu _{f})/ \sigma _{f}, \end{aligned}$$
(9)

where X(t) is a feature value at time t, \(mu_{f}\) is the mean and \(sigma_{f}\) is standard deviation of the feature values. Z-score, Z(t) is a normalized form of X(t). In this scheme, z-score normalization is the preferred method because it produces meaningful information about each data point, and provides better results in the presence of outliers than min-max based normalization [27]. The weighted linear fusion is obtained as follows

$$\begin{aligned} A(t)= W_{1} \times Z_{G}(t)+ W_{2} \times Z_{G}(t) + W_{3} \times Z_{G}(t), \end{aligned}$$
(10)

where A(t) is fusion value; \(Z_{G}(t)\), \(Z_{S}(t)\) and \(Z_{F}(t)\) are z-score normalization of foreground feature (G(t)), spatial motion feature (S(t)), and motion information in frequency domain (F(t)) respectively at time t. Empirically, it is evaluated that if the values of weights \(W_{1}\), \(W_{2}\), and \(W_{3}\) are set to 15, 60, and 25 respectively, it provides better results for all videos in BL-7F dataset. The rationality to provide higher weight to motion feature compared to the foreground area is that according to the psychological theories of human attention, motion information is more significant than the static attention clues [2]. After that, A(1),A(2),A(3), ... ... A(T) is sorted based on descending order where T is total number of frames in a video.

3.5 Video Summary Generation

The proposed method summarizes a video based on the threshold (\(Thr_{kf}\)) provided by a user. From the sorted list of A(1),A(2),A(3), ... ... A(T), \(Thr_{kf}\) number of frames are selected. A summary of video is generated from the selected frames by maintaining their chronological sequence.

4 Results and Discussion

The proposed method is applied on the publicly available BL-7F dataset [1] which contains 19 surveillance videos. A complete list of ground truth key frames for each video is also provided in BL-7F dataset. The foreground object, spatial and frequency domain motion information are extracted by the proposed method are shown in Fig. 3(d), (e), and (f) respectively.

Fig. 3.
figure 3

An example of foreground and motion information extracted by the proposed method; (a) and (b) are frame no 740 and 741 of bl-0 video, (c) is the background image of (b), (d) is the foreground image of (b), (e) is the object motion between two frames and (f) is the motion obtained by phase correlation technique on frame no 741.

In Fig. 4, a number of ground truth frames of bl-11 video of BL-7F dataset [1] and the results obtained by GMM based method, as well as the proposed method are shown. GMM based method fails to select frame number 9963 and 12523 even if they contain significant contents. In contrast, the proposed method can select these frames successfully. The main reason of this success is that the proposed method combines area of foreground objects as well as frequency and spatial motion information.

Fig. 4.
figure 4

Evaluation of key frames extraction of bl-11 video of BL-7F dataset; first, second, third, and forth columns indicate frame no, ground truth, results obtained by GMM based method and the proposed method respectively.

In this experiment, the quality of the summarized video is estimated by precision, recall, and F1-measure obtained using the following equations

$$\begin{aligned} Precision=T_{p}/(T_{p}+F_{p}), \end{aligned}$$
(11)
$$\begin{aligned} Recall=T_{p}/(T_{p}+F_{n}), \end{aligned}$$
(12)
$$\begin{aligned} F_{1}-measure =2 \times \frac{Recall \times Precision}{Recall + Precision}, \end{aligned}$$
(13)

where \(T_{p}\), \(F_{p}\), and \(F_{n}\) indicates true positive, false positive, and false negative respectively.

The proposed approach is compared with the single-view video summarization results provided by GMM based method [1]. The GMM method is the most relevant and the state-of-the-art method to compare with because both techniques use GMM for the training purpose. However, there are two key differences between GMM based method [1] and Gaussian mixture-based parametric BGM [4]. Firstly, GMM based method works at frame level whereas Gaussian mixture-based BGM works in pixel level. Secondly, GMM based method utilizes color descriptor as feature while Gaussian mixture-based parametric BGM uses pixel intensity. Another reason to compare with GMM based method is that both techniques use BL-7F standard dataset to verify the performance.

Table 1. Precision, recall, and F1-measure of GMM [1] and the proposed method

In this proposed method, the summarization threshold Thrkf is set to the total number of key frames suggested by the ground truth for each video. It is evaluated that the introduced scheme generates more accurate results if \(Thrkf \times 1.35\) frames are selected from the ranked sorted list of A(1),A(2),A(3), ... ... A(T) where T is total number of frames in a video. The results of precision, recall, and F1-measure of GMM based method and the proposed method are shown in Table 1. It is observed from Table 1 that the mean values of the precision, recall, and F1-measure of GMM based method are 59.1, 82.2, and 0.63 respectively. On the other hand, the proposed method shows enhanced performance compared to GMM based method with mean precision (83.6), recall (94.2), and F1-measure (0.88). In addition to this, standard deviations (STDs) of precision, recall, and F1-measure of the proposed method are much lower compared to the GMM based method (see Table 1). This indicates that the proposed method not only performs in higher accuracy but also variance of the performance is more consistent in the different videos compared to the GMM based method [1].

Fig. 5.
figure 5

F1-measure of the proposed and GMM based approach.

The graphical representations of F1-measure of GMM based method (F-GMM), the proposed method using only foreground objects (f-foreground), spatial motion (f-spatial), motion in frequency domain (f-frequency), and combining all these features (f-proposed) are shown in Fig. 5. After examining this graph, it is obvious that the proposed using only foreground objects shows less performance in bl-2, bl-4, and bl-12 videos compared to GMM based method. If only spatial motion is considered, it fails to outperform in bl-12 and bl-17 videos. Again, the proposed method applying only phase correlation technique shows poor result in bl-2, bl-12, and bl-17 videos. Therefore, the proposed approach combines all these features and performs superior to GMM based method in 18 videos. However, the proposed technique fails to perform better for bl-12 video compared to the GMM based method [1]. After observing the key frames extracted by the proposed method for bl-12 video, the reasons of failure have been explored. In bl-12 video, there are some frames with significant object, and motion, however, they are not selected as ground truth frames by [1]. Although there is no foreground object and/or motion, in some frames, they are considered as ground truth. For example, frame no 4083, 4120, and 4563 contain sufficient amount of object, and motion as shown in first row of Fig. 6. In these frames, it is clearly visible that a person is working near the door. However, these frames are not selected as ground truth (key frames). On the other hand, there is no object, and significant motion exist in frame no 12615, 12675 and 12750 (see the second row of Fig. 6). Nonetheless, they are selected as key frames (ground truth). There is no explanation found about this incident in [1]. After evaluating the proposed method quantitatively and qualitatively, it is revealed that the proposed method based on foreground objects, and motion in spatial and frequency perform superior to the state-of-the-art GMM based method [1].

Fig. 6.
figure 6

Sample frames of bl-12 are not selected as ground truth (first row) and considered as key frame (second row).

5 Conclusion

In this paper, a novel framework is proposed to summarize surveillance video combining foreground object along with motion information in spatial and frequency domain. According to [1], foreground objects usually contain details information of the video contents. Moreover, human being naturally gives more attention to object motion in a video [2]. Therefore, there two important properties of a video are included in this approach. To include motion information in frequency domain, phase correlation technique [8] is applied. To the best of our knowledge, phase correlation technique is applied for the first time for video summarization. Extensive experimental results reveal that the proposed method outperforms the state-of-the-art method.