Abstract
Surveillance video camera captures a large amount of continuous video stream every day. To analyze or investigate any significant events from the huge video data, it is laborious and boring job to identify these events. To solve this problem, a video summarization technique combining foreground objects as well as motion information in spatial and frequency domain is proposed in this paper. We extract foreground object using background modeling and motion information in spatial domain and frequency domain. Frame transition is applied for obtaining motion information in spatial domain. For acquiring motion information in frequency domain, phase correlation (PC) technique is applied. Later, foreground objects and motions in spatial and frequency domain are fused and key frames are extracted. Experimental results reveal that the proposed method performs better than the state-of-the-art method.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Every day an enormous amount of surveillance video is captured 24 hours throughout the whole world for providing security, monitoring, preventing crime, and controlling traffic. In general, a number of surveillance video cameras are set up in a number of difference places of a building, business area, or congested area. These cameras are connected to a monitoring cell for storing and investigating. To store this huge volume of video data requires tremendous memory space. In addition to this, to find out any important events from the stored video for investigating or performing analysis, operators need to access the stored videos. This process is very tedious, lengthy and expensive. To solve these problems, a method for generating the shorter version of original video containing important events is highly desirable for memory management and information retrieval.
Video summarization (VS) is the technique to select the most informative frames so that it can contain all the necessary events and reject unnecessary contents to make the summarized video as concise as possible. Therefore, a good video summarization method is one that has several important properties. First, it must have the capability to include all significant incidents within the original video. Second, it should be able to generate a smaller version of the provided long video. Third, it should not contain repetitive information. The main purpose of VS is to represent a long original video in a condensed version in such a way that a user can get the whole idea of the entire video in a constrained amount of time.
In a video, foreground objects generally contain more detail information [1]. Again, human usually concentrate more on the movements of objects [2]. Consequently, objects as well as their motion are important features for a video. Motivated by these findings, a video summarization scheme is proposed in this paper based on objects and their motion in a video. To include foreground object information, Gaussian mixture-based parametric background modeling (BGM) [3–7] has been applied. To acquire the complete information of object motion in a video, object motion is extracted not only in spatial domain but also in frequency domain. To obtain motion information in spatial domain, consecutive frame difference is applied. For achieving object motion in frequency domain, phase correlation technique [8, 9] is applied. To the best of our knowledge, phase correlation is not applied for video summarization methods. Therefore, the main contribution of this paper is to apply phase correlation in video summarization method. The computational time of phase correlation is very low and rich motion information is obtained by phase correlation technique [8].
The structure of the remaining paper is as follows. Section 2 reviews related research. The proposed method is described in Sect. 3. Experimental results as well as detail discussions are provided in Sect. 4. Finally, a concluding remarks and future direction are drawn in Sect. 5.
2 Related Research
In the literature, different approaches have been proposed for summarizing various types of videos. For egocentric video summarization, region saliency is predicted in [10] using a regression model and storyboard are generated based on region importance score. In the method proposed in [11], story driven egocentric video is summarized by discovering the most influential objects within a video. Gaze tracking information is applied in [12] for summarization. In case of user generated video summarization, adaptive submodular maximization function is applied in [13]. A collaborative sparse coding model is utilized in [14] for generating summary of the same type of videos. Web images are used in [15] to enhance the process of summarizing the user generated video. To summarize movie, aural, visual and textual are merged in [16]. Role community network is applied in [17]. Film comic is generated using eye tracking data in [18].
However, the importance of surveillance video for industrial application is very higher than other types of videos (e.g., egocentric, user generated, movie). To summarize surveillance video, object centered technique is applied in [19]. Dynamic VideoBook is proposed in [20] for representing the surveillance video in a hierarchical order. Learned distance metric is introduced in [21] for summarizing nursery school surveillance video. In [22], salient motion information is applied. Maximum a posteriori probability (MAP) is used in [23] for synopsis generation. Recently, a method is proposed in [1] for a multi-view surveillance video summarization. Firstly, a single view summarization is generated in this approach for each sensor independently. For this purpose, MPEG-7 color layout descriptor is applied to each video frame and an online-Gaussian mixture model (GMM) is used for clustering. The key frames are selected based on the parameters of cluster. As the decision of selecting or neglecting a frame is performed based on the continuously updates of these clustering parameters, a video segment is extracted instead of key frames. Lastly, multi-view summarization is produced by applying distributed view selection method using the video segments extracted for each sensor in the previous step.
To the best of our knowledge, phase correlation technique has not been applied for video summarization. In this proposed approach, phase correlation technique is applied to incorporate motion information in frequency domain and fused with moving foreground object and spatial motion information.
3 The Proposed Method
The proposed scheme is based on area of moving foreground objects and their motion information in spatial and frequency domain. The main steps of the proposed method are (1) moving foreground object extraction (2) motion information calculation in spatial domain, (3) motion estimation in frequency domain, (4) fusion of foreground object area and spatial as well as frequency motion information, and (5) video summary generation. The flow chart of the proposed method is shown in Fig. 1. The detail of each step is explained in the subsequent sub-section.
3.1 Foreground Object Extraction
In the proposed method, Gaussian mixture-based parametric BGM [3–7] is applied. In this parametric BGM, each pixel is modeled by the k Gaussian distributions (k=3) and each Gaussian model represents either static background or dynamic foreground object during time frame. For instance, suppose a pixel intensity \(X_t\) at time t is modeled by kth Gaussian with recent value \(\gamma _k^t\), mean \(\mu _k^t\), standard deviation \(\sigma _k^t\) and weight \(\omega _k^t\) such that \(\sum \omega _k^t=1\). The learning parameter is used to update parameter values, such as mean, standard deviation, and weight. At the beginning, the system contains empty set of Gaussian models. After observing the first pixel (t=1), a new Gaussian model (k=1) is generated with\(\gamma _k^t=\mu _k^t=X_t\), standard deviation \(\sigma _k^t=30\) and weight \(\omega _k^t=0.001\). Then for each new observation of pixel intensity \(X_t\) of the same location at t, it tries to find a matched model from the existing models such that \(|X_t - \mu _k |\le 2.5\sigma _k\).
To obtain gray scale background frame, background modeling [3–7] is applied after converting each color-frame into gray scale image. Then, A color video frame at time t is converted into gray image I(t) and subtracted from the corresponding gray background frame B(t) obtained by the background modeling. A pixel is considered foreground pixel and set the value to one, if the pixel intensity difference between I(t) and B(t) is greater than or equal to a certain threshold (Thr1). If the pixel intensity does not satisfy this condition, it is regarded as a background pixel and set to zero. In this way, a foreground pixel \(G_{i,j}(t)\) is obtained as follows
where (i, j) is the pixel position. The value of Thr1 is set to 20 to avoid subtle changes between background and foreground. This is a common practice to set the threshold value to 20 to identify object from the background as mentioned in [4]. After that, the total number of non-zero pixels in \(G_{i,j}(t)\) is used as area of foreground object feature F(t) which is obtained by the following equation
where r and c represent the row and column of F respectively. According to the psychological theories of human attention, motion information is more significant than the static attention clues [2]. Therefore, motion information is included in the proposed method in addition to the foreground object.
3.2 Motion Information Calculation in Spatial Domain
Human usually concentrate more on the movements of objects in a video [2]. In order to obtain object motion information in spatial domain, frame-to-frame difference is applied. Consider two consecutive frames \(I(t-1)\) and I(t) at time \(t-1\) and t in video. To find out spatial motion information, the color difference in red, green, and blue channel between these frames is calculated. If the differences at each pixel in three different channels are greater than or equal to a threshold (T2), this pixel is considered as motion pixel and set to value one. Otherwise, it is sure that this pixel does not contain any motion information. Therefore, the spatial motion information \(S_{i,j}(t)\) in pixel (i, j) at time t can be obtained by the following equation
where \(I_(i,j)^r\), \(I_(i,j)^g\), and \(I_(i,j)^b\) represent red, green and blue color at (i,j) channels respectively. This is a common practice to set 20 as a threshold value to obtain information between two consecutive frames [4]. Therefore, the value of T2 is set to 20. The spatial motion information S(t) is obtained at time t by summing all values in \(S_{i,j}(t)\) as follows
where r and c represent row and column of S respectively. However, motion extracted in spatial domain is not sensitive to diffuse phenomena [24]. For example, it does not work well when global illumination changes occur. Additionally, spatial motion estimation is prone to local inaccuracies and small motion discontinuities [25].
3.3 Motion Estimation in Frequency Domain
To overcome the problem of spatial motion calculation, motion information is calculated in frequency domain. Motion estimated in frequency domain has some advantages over spatial domain [24]. It is efficient for global changes of illumination and robust to motion estimation near object boundaries. To obtain motion information in frequency domain, each frame is divided into a number of blocks of \(16 \times 16\) pixels size. Then, phase correlation technique [8] is applied between the current block and reference block. The phase correlation peak \( \beta \), the magnitude of the motion accuracy, extracted from phase correlation method is used as motion indicator for that block. The phase difference \( \phi \) is calculated between the current block and its co-located reference block after applying Fast Fourier transformation \(FFT \) on each block. The inverse FFT is performed on the calculated phase difference and finally two dimensional ( 2-D) motion vector (dx, dy) is obtained [26]. This 2-D motion vector contains peaks \( \beta \) where there are shift between the current and reference blocks.
where \(F_{r}\) and \(F_{c}\) represent FFF of current and reference block respectively.
where b is block size. For example, b will be 16 if \(16 \times 16\) is used. If the value of \(\beta \) of a block greater than a threshold (T3), it is considered that this block contains sufficient motion information. In this method, the value of T3 is set to 0.6. All the values greater than T3 are summed to obtain motion information F(t) in frequency domain.
where N and M represent row/b and column/b of F respectively. The motion information obtained in frequency domain applying phase correlation technique at different blocks of frame no 3869 of bl-14 video is shown in Fig. 2. No motion is represented in block (4,4) with only a single highest pick (Fig. 2(b)). Single (block (10,14)) and complex motions (block (5,8)) are showed in Fig. 2(c) and (d) respectively. In contrast, frequency based motion estimation approach lacks of localization problem [24]. Therefore, motions obtained in both spatial and frequency domains are combined with foreground objects for video summarization.
3.4 Fusion of Foreground and Motion Information
In order to select more accurate frame sequences, both area of foreground object and motion information are combined. In this approach, a weighted linear fusion is applied to combine the features for ranking each frame according to their representativeness in a video. Before applying fusion method, each feature is converted into z-score normalization using the following equation
where X(t) is a feature value at time t, \(mu_{f}\) is the mean and \(sigma_{f}\) is standard deviation of the feature values. Z-score, Z(t) is a normalized form of X(t). In this scheme, z-score normalization is the preferred method because it produces meaningful information about each data point, and provides better results in the presence of outliers than min-max based normalization [27]. The weighted linear fusion is obtained as follows
where A(t) is fusion value; \(Z_{G}(t)\), \(Z_{S}(t)\) and \(Z_{F}(t)\) are z-score normalization of foreground feature (G(t)), spatial motion feature (S(t)), and motion information in frequency domain (F(t)) respectively at time t. Empirically, it is evaluated that if the values of weights \(W_{1}\), \(W_{2}\), and \(W_{3}\) are set to 15, 60, and 25 respectively, it provides better results for all videos in BL-7F dataset. The rationality to provide higher weight to motion feature compared to the foreground area is that according to the psychological theories of human attention, motion information is more significant than the static attention clues [2]. After that, A(1),A(2),A(3), ... ... A(T) is sorted based on descending order where T is total number of frames in a video.
3.5 Video Summary Generation
The proposed method summarizes a video based on the threshold (\(Thr_{kf}\)) provided by a user. From the sorted list of A(1),A(2),A(3), ... ... A(T), \(Thr_{kf}\) number of frames are selected. A summary of video is generated from the selected frames by maintaining their chronological sequence.
4 Results and Discussion
The proposed method is applied on the publicly available BL-7F dataset [1] which contains 19 surveillance videos. A complete list of ground truth key frames for each video is also provided in BL-7F dataset. The foreground object, spatial and frequency domain motion information are extracted by the proposed method are shown in Fig. 3(d), (e), and (f) respectively.
In Fig. 4, a number of ground truth frames of bl-11 video of BL-7F dataset [1] and the results obtained by GMM based method, as well as the proposed method are shown. GMM based method fails to select frame number 9963 and 12523 even if they contain significant contents. In contrast, the proposed method can select these frames successfully. The main reason of this success is that the proposed method combines area of foreground objects as well as frequency and spatial motion information.
In this experiment, the quality of the summarized video is estimated by precision, recall, and F1-measure obtained using the following equations
where \(T_{p}\), \(F_{p}\), and \(F_{n}\) indicates true positive, false positive, and false negative respectively.
The proposed approach is compared with the single-view video summarization results provided by GMM based method [1]. The GMM method is the most relevant and the state-of-the-art method to compare with because both techniques use GMM for the training purpose. However, there are two key differences between GMM based method [1] and Gaussian mixture-based parametric BGM [4]. Firstly, GMM based method works at frame level whereas Gaussian mixture-based BGM works in pixel level. Secondly, GMM based method utilizes color descriptor as feature while Gaussian mixture-based parametric BGM uses pixel intensity. Another reason to compare with GMM based method is that both techniques use BL-7F standard dataset to verify the performance.
In this proposed method, the summarization threshold Thrkf is set to the total number of key frames suggested by the ground truth for each video. It is evaluated that the introduced scheme generates more accurate results if \(Thrkf \times 1.35\) frames are selected from the ranked sorted list of A(1),A(2),A(3), ... ... A(T) where T is total number of frames in a video. The results of precision, recall, and F1-measure of GMM based method and the proposed method are shown in Table 1. It is observed from Table 1 that the mean values of the precision, recall, and F1-measure of GMM based method are 59.1, 82.2, and 0.63 respectively. On the other hand, the proposed method shows enhanced performance compared to GMM based method with mean precision (83.6), recall (94.2), and F1-measure (0.88). In addition to this, standard deviations (STDs) of precision, recall, and F1-measure of the proposed method are much lower compared to the GMM based method (see Table 1). This indicates that the proposed method not only performs in higher accuracy but also variance of the performance is more consistent in the different videos compared to the GMM based method [1].
The graphical representations of F1-measure of GMM based method (F-GMM), the proposed method using only foreground objects (f-foreground), spatial motion (f-spatial), motion in frequency domain (f-frequency), and combining all these features (f-proposed) are shown in Fig. 5. After examining this graph, it is obvious that the proposed using only foreground objects shows less performance in bl-2, bl-4, and bl-12 videos compared to GMM based method. If only spatial motion is considered, it fails to outperform in bl-12 and bl-17 videos. Again, the proposed method applying only phase correlation technique shows poor result in bl-2, bl-12, and bl-17 videos. Therefore, the proposed approach combines all these features and performs superior to GMM based method in 18 videos. However, the proposed technique fails to perform better for bl-12 video compared to the GMM based method [1]. After observing the key frames extracted by the proposed method for bl-12 video, the reasons of failure have been explored. In bl-12 video, there are some frames with significant object, and motion, however, they are not selected as ground truth frames by [1]. Although there is no foreground object and/or motion, in some frames, they are considered as ground truth. For example, frame no 4083, 4120, and 4563 contain sufficient amount of object, and motion as shown in first row of Fig. 6. In these frames, it is clearly visible that a person is working near the door. However, these frames are not selected as ground truth (key frames). On the other hand, there is no object, and significant motion exist in frame no 12615, 12675 and 12750 (see the second row of Fig. 6). Nonetheless, they are selected as key frames (ground truth). There is no explanation found about this incident in [1]. After evaluating the proposed method quantitatively and qualitatively, it is revealed that the proposed method based on foreground objects, and motion in spatial and frequency perform superior to the state-of-the-art GMM based method [1].
5 Conclusion
In this paper, a novel framework is proposed to summarize surveillance video combining foreground object along with motion information in spatial and frequency domain. According to [1], foreground objects usually contain details information of the video contents. Moreover, human being naturally gives more attention to object motion in a video [2]. Therefore, there two important properties of a video are included in this approach. To include motion information in frequency domain, phase correlation technique [8] is applied. To the best of our knowledge, phase correlation technique is applied for the first time for video summarization. Extensive experimental results reveal that the proposed method outperforms the state-of-the-art method.
References
Ou, S., Lee, C., Somayazulu, V., Chen, Y., Chien, S.: On-line multi-view video summarization for wireless video sensor network. IEEE J. Sel. Top Sig. Process. 9, 165–179 (2015)
Gao, D., Mahadevan, V., Vasconcelos, N.: On the plausibility of the discriminant center-surround hypothesis for visual saliency. J. Vis. 8, 118 (2008)
Paul, M., Lin, W., Lau, C., Lee, B.: Explore and model better I-frames for video coding. IEEE Trans. Circuits Syst. Video Technol. 21, 1242–1254 (2011)
Haque, M., Murshed, M., Paul, M.: A hybrid object detection technique from dynamic background using Gaussian mixture models. In: IEEE 10th Workshop on Multimedia Signal Processing, pp. 915–920 (2008)
Chakraborty, S., Paul, M.: A novel video coding scheme using a scene adaptive non-parametric background model. In: IEEE 16th International Workshop on Multimedia Signal Processing, pp. 1–6 (2014)
Paul, M., Evans, C.J., Murshed, M.: Disparity-adjusted 3D multi-view video coding with dynamic background modelling. In: IEEE International Conference on Image Processing, pp. 1719–1723 (2013)
Paul, M.: Efficient video coding using optimal compression plane and background modelling. IET Image Process. 6, 1311–1318 (2012)
Paul, M., Lin, W., Lau, C.T., Lee, B.-S.: Direct intermode selection for H.264 video coding using phase correlation. IEEE Trans. Image Process. 20, 461–473 (2011)
Paul, M., Sorwar, G.: An efficient video coding using phase-matched error from phase correlation information. In: IEEE 10th Work on Multimedia Signal Processing, pp. 378–382 (2008)
Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1346–1353 (2012)
Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2714–2721 (2013)
Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video summarization via constrained submodular maximization. In: IEEE Conference Computer Vision Pattern Recognition, pp. 2235–2244 (2015)
Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: IEEE Cconference Computer Vision Pattern Recognit, pp. 3090–3098 (2015)
Liu, Y., Liu, H., Sun, F.: Outlier-attenuating summarization for user-generated-video. In: IEEE International Conference on Multimedia and Expo, pp. 1–6 (2014)
Khosla, A., Hamid, R.: Large-scale video summarization using web-image priors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705 (2013)
Evangelopoulos, G.: Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multimed. 15, 1553–1568 (2013)
Tsai, C., Kang, L.: Scene-based movie summarization via role-community networks. IIEEE Trans. Circuits Syst. Video Technol. 23, 1927–1940 (2013)
Sawada, T., Toyoura, M., Mao, X.: Film comic generation with eye tracking. In: Wang, M., Mei, T., Sebe, N., Yan, S., Hong, R., Gurrin, C., Li, S., Saddik, A. (eds.) MMM 2013, Part I. LNCS, vol. 7732, pp. 467–478. Springer, Heidelberg (2013)
Fu, W., Wang, J., Zhao, C., Lu, H., Ma, S.: Object-centered narratives for video surveillance. In: IEEE International Conference on Image Processing, pp. 29–32 (2012)
Sun, L., Ai, H., Lao, S.: The dynamic VideoBook: A hierarchical summarization for surveillance video. In: IEEE International Conference on Image Processing, pp. 3963–3966 (2013)
Wang, Y., Kato, J.: A distance metric learning based summarization system for nursery school surveillance video. In: IEEE International Conference on Image Processing, pp. 37–40 (2012)
Mehmood, I., Sajjad, M., Ejaz, W., Wook, S.: Saliency-directed prioritization of visual data in wireless surveillance networks. Inf. Fusion. 24, 16–30 (2015)
Huang, C., Chung, P.J.: Maximum a posteriori probability estimation for online surveillance video synopsis. IEEE Trans. Circuits Syst. Video Technol. 24, 1417–1429 (2014)
Ahuja, N., Briassouli, A.: Joint spatial and frequency domain motion analysis. In: International Conference on Automation Face Gesture Recognition, pp. 203–208 (2006)
Briassouli, A., Ahuja, N.: Integration of frequency and space for multiple motion estimation and shape-independent object segmentation. IEEE Trans. Circuits Syst. Video Technol. 18(5), 657–669 (2008)
Paul, M., Frater, M.R., Arnold, J.F.: An efficient mode selection prior to the actual encoding for H.264/AVC Encoder. IEEE Trans. Multimed. 11, 581–588 (2009)
Han, J., Kamber, M., Pei, J.: Data Mining, Southeast Asia Edition: Concepts andTechniques. Morgan Kaufmann, San Francisco (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Salehin, M.M., Paul, M. (2016). Fusion of Foreground Object, Spatial and Frequency Domain Motion Information for Video Summarization. In: Huang, F., Sugimoto, A. (eds) Image and Video Technology – PSIVT 2015 Workshops. PSIVT 2015. Lecture Notes in Computer Science(), vol 9555. Springer, Cham. https://doi.org/10.1007/978-3-319-30285-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-30285-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30284-3
Online ISBN: 978-3-319-30285-0
eBook Packages: Computer ScienceComputer Science (R0)