Deformable object tracking with spatiotemporal segmentation in big vision surveillance

doi:10.1016/j.neucom.2015.07.149

Neurocomputing

Volume 204, 5 September 2016, Pages 87-96

https://doi.org/10.1016/j.neucom.2015.07.149 Get rights and content

Abstract

Rapid development of worldwide networks have changed the traditional challenges in vision surveillance to a big data level. Accordingly, the video processing technologies also need to focus more on the new coming big vision problems such as efficient content understanding. As a fundamental and indispensable pre-step for high-level video analysis, e.g. behavior recognition for social security, accurate and robust object tracking can play an essential role because of its capability in extracting the salient information from the captured video dataset. Due to the complexity of the realistic application environments, accurate and robust tracking is not easy because the object appearance may continually change during its moving, especially for the deformable objects, it is difficult for the designed appearance model being adaptive to the heavy shape variations as rotation or distortion. In this paper, a novel object tracking based on spatial segmentation is proposed to handle the problem of drastic appearance changes of the deformable object. By using the motion information between the consecutive frames, the irregular areas of the deformable object can be segmented more accurately by energy function optimization with boundary convergence. In succession, the segmentation areas are modeled by a structural SVM as learning samples to achieve more effective online tracking. Based on the evaluation of the proposed tracking on the standard benchmark database containing the challenges of heavy intrinsic variations and occlusions, the experiment results have demonstrated a significant improvement in accuracy and robustness when compared with other state-of-art tracking approaches.

Introduction

In the past decades, different types of video surveillance systems have demonstrated their effectiveness for public security all around the world. With the rapid development of the HDTV and mobile networks nowadays, the video data volume and resolution having been enhanced with an incredible speed, which makes most of the current surveillance system [1], [2] need to face the similar challenges introduced by big data analytics as in the other areas as data storage or information retrieval for the obtained big videos [3]. In recent years, the content based learning-retrieval [4] mechanism has been verified a promising solution for the ‘big’ video data analysis, for which the learning sample cannot be extracted by offline as we usually do on the normal video dataset. Thus, employing some automatic object tracking to obtain the salient features for learning sample generation is a common way in many learning strategies. Specifically, more accurate tracking for learning sample generation, more satisfactory the retrieval result. Therefore, the research of visual tracking is still an essential topic in handling the challenge problems in the coming big data age.

Tracking a rigid shape object in a simple surveillance environment, such as a car running on a highway, has been resolved in different satisfactory ways [5], [6], [7], [8]. However, tracking the deformable object in realistic scenarios is still hard because the target appearance may change constantly during moving [9], especially for an irregular-shaped object, challenges mainly come from the intrinsic variations such as distortion, rotation and scaling [10]. In order to effectively adapt to such target appearance changes, the most popular way of conducting online tracking is to update the appearance model and make it suitable for distinguishing the object from background on-the-fly. In general cases, due to the deficiency knowledge to the learner before learning starts, the noisy constraints in training data leads to performance degeneration when updating classifiers. Hence, in order to introduce more rational paradigms between training data for classifier updates, Kalal et al. proposed a P–N learning scheme and applied it to the problem of online tracking-by-learning [11]. P–N learning establishes a novel structure of the training samples by exploiting the positive and negative constraints, which restricts the data labeling operation [3]. This framework also helps to guide the design of more sophisticated structural constraints that can fulfill requirements of the learning stability. However, its limitation resides in the usage of inaccurate positive data sample which sampled in the background areas inside the target bounding box. In addition, tracking failure in many cases is still hard to be avoided because of the inaccurate motion estimation between unreliable consecutive frames.

Similar as Kalal׳s tracker, many other proposed online tracking strategies also utilized regular geometry shapes such as bounding-box or ellipse to represent the appearance of the target. These regular-shape based tracking methods can track the targets of fixed shape such as human head or cars robustly, but the tracking failure often happen in handling the irregular-shaped targets with good accuracy, especially when targets have heavy partial occlusions or intrinsic variations. To overcome this limitation, Kwon et al. proposed an approach by using a pre-defined bounding-box collection to represent different parts of the target [12] for articulated object tracking. With the help of an adaptive Basin Hopping Monte-Carlo Sampling (BHMC) method, Kwon׳s approach can automatically update the target dynamic appearance changes and geometry relations over time. Likely, Yao et al. also utilized a global object box and a set of part boxes as an appearance model to approximate the irregular-shaped object [13]. With an online two-stage training mechanism to learn the parameter of part based model, Yao׳s strategy is able to overcome the complexity problem due to model overfitting. Different from the approaches [12] and [13] using regular part-based representation for single target, Zhang et al. introduced a model-free tracker that simultaneously tracks multiple objects by combining multiple single-object trackers via constraints on the spatial structure of the objects. The performance of this structure-preserving tracking approach show an obviously improvement in multi-object tracking by using an online structured SVM algorithm, which is similar as [13].

In many realistic situations, unfortunately, such pre-defined parts representation influences the extendability and generic application for those methods [14], especially when the target is composed by several objects, e.g. motor rider shown as Fig. 5, which is difficult to be effectively represented by discrete rectangles. Thus, more accurate representation such as using continuous contour to smoothly estimate target׳s contour/shape would be a possible way for target presentation. Sun et al. proposed a supervised level set model for tracking [15] in order to obtain more precise convergence to the target during tracking. With the specific knowledge of target region and edge cues, the contour curve can converge to the candidate area with maximum likelihood in a Bayesian manner. Recently, Hough-transform based approaches have received attraction in overcoming the limitations of using fixed-shape set for irregular representation [16]. Barinova et al. proposed a probabilistic framework for multiple object instances detection in Hough transforms domain [17], [18]. And the main point of this research also inspired the following work proposed by Godec et al. [10]. In Godec׳s tracker, the GMM based segmentation [19] is incorporated with Hough forest learning framework for irregular-shaped target tracking [16]. With the back-projection to support an online tracking process, Godec׳s tracker beyond many state-of-art work based on fixed-shape appearance representation. However, heavily relying on the discrimination of color Gaussian kernels makes the segmentation result unexpectedly, especially when intrinsic changes happen in the uniform color background containing obvious edges.

Aiming at the problem of learning data generation, in this paper, we propose a novel motion-appearance model to achieve accurate spatiotemporal segmentation for deformable object online tracking. Compared with the spatial guided segmentation [10] only depending on the texture/pixel information within the individual frame, the proposed model is able to segment the target areas more precisely with the help of motion information between consecutive frames, especially when the texture of background and target are similar. The segmented areas can provide more precise samples for online model updating. To effectively describe the appearance of deformable object, we utilize the structural SVM [20] to construct an online tracking framework, which is more accurate than only employing fixed rectangle for target appearance modeling in spatial domain. The proposed tracking shows more robust in many challenge scenes including rotation, intrinsic compression/stretching and aspect ratio changes, for vision based surveillance application.

The following organization of the paper is as: Section 2 introduces the proposed spatiotemporal segmentation with motion-appearance model and Section 3 briefly introduced the online learning tracking framework. Section 4 shows experiment results and discussion and Section 5 gives the conclusion of this paper.

Section snippets

Learning samples generation by spatiotemporal segmentation

Learning samples generation is an important issue for most of online tracking-by-learning strategies [11]. For the fixed-shape based approaches, the main challenge is: the correctness of generated samples for online learning is hard to be guaranteed due to the noise in coarse data by the annotated bounding boxes. The purpose of the proposed segmentation model is to separate the foreground from the background with fine-smooth contour. The accurately separated foreground areas can provide higher

Structural tracking-learning strategy

The proposed tracking framework is shown in Fig. 1. In the initialization step of tracking, we generate the training samples around the target region and obtain a robust SVM classifier as the target detector. For the target predication step, we evaluate each candidate targets with the trained detector and find the sample with local maximum predication value as the initial tracking result. Then we compute the target displacement between two consecutive frames as motion estimation, and the final

Experiments

The proposed work is implemented by using C++ with Intel OpenCV library, and tested on a workstation with an Intel Core i7 3.4 GHz processor with 4.0G RAM. We evaluate our tracking approach on video sequences coming from the object tracking video benchmark in [8], which contains the scenarios including surveillance environments (Couple, Crossing and Woman), human faces (Boy, David, Faceocc2, Trellis and Shaking) and other type of cases (Mountainbike and Singer2). These videos are utilized to

Conclusion

Accurate object tracking is able to provide more help in salient information extraction for big vision surveillance. In this paper, we proposed a novel online tracking method for deformable object tracking based on accurate spatiotemporal segmentation with motion-appearance model. By employing the structural SVM to carry out the online learning process with more precise samples in the segmented areas, the proposed tracking has shown a very satisfactory accuracy in different realistic scenarios.

Acknowledgments

This research is supported by Research Fund for the Doctoral Program of Higher Education of China 20126102120055, National Natural Science Foundation of China (61301194, 61571362 and 61175018), Foundation Grant from NWPU 3102014JSJ0014.

Peng Zhang received the B.E. degree from the Xian Jiaotong University, China in 2001. He received his Ph.D. from Nanyang Technological University, Singapore in 2011. He is now an associate professor in School of Computer Science, Northwestern Polytechnical University, China. His current research interests include signal processing, multimedia security and pattern recognition. He is a member of IEEE.

References (45)

M.J. Black et al.
The robust estimation of multiple motionsparametric and piecewise-smooth flow fields
Comput. Vis. Image Underst. (CVIU)
(1996)
C. Ding et al.
Multi-task pose-invariant face recognition
IEEE Trans. Image Process. (T-IP)
(2015)
D. Tao et al.
General tensor discriminant analysis and gabor features for gait recognition
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2007)
C. Xu et al.
Large-margin multi-view information bottleneck
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2014)
D. Tao et al.
Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2006)
A. Yilmaz et al.
Multiview Hessian regularization for image annotation
ACM Comput. Surv.
(2013)
H. Yang, L. Shao, F. Zheng, L. Wang, Z. Song, Recent advances and trends in visual tracking: a review, Neurocomputing...
S. Salti et al.
Adaptive appearance modeling for video trackingsurvey and evaluation
IEEE Trans. Image Process. (T-IP)
(2012)
Y. Wu, J. Lim, M.-H. Yang, Online object tracking: a benchmark, in: IEEE Conference on Computer Vision and Pattern...
Z. Zhang et al.
Slow feature analysis for human action recognition
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2012)

M. Godec, P. Roth, H. Bischof, Hough-based tracking of non-rigid objects, in: IEEE International Conference on Computer...

Z. Kalal et al.

Tracking learning detection

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

(2010)

J. Kwon, K. Lee, Tracking of a non-rigid object via patch based dynamic appearance modeling and adaptive Basin hopping...

R. Yao, Q. Shi, C.Shen, Y. Zhang, A. Hengel, Part-based visual tracking with online latent structural learning, in:...

T. Zhou et al.

Double shrinking sparse dimension reduction

IEEE Trans. Image Process. (T-IP)

(2013)

X. Sun, H. Yao, S. Zhang, A novel supervised level set method for non-rigid object tracking, in: IEEE Conference on...

B. Geng et al.

Ensemble manifold regularization

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

(2012)

O. Barinova, V. Lempitsky, P. Kohli, On detection of multiple object instances using Hough transforms, in: IEEE...

D. Tao et al.

Geometric mean for subspace selection

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

(2009)

C. Rother, V. Kolmogorov, A. Blake, “grabcut”: interactive foreground extraction using iterated graph cuts, in: ACM...

P.P.S. Branson, S. Belongie, Strong supervision from weak annotation: interactive training of deformable part models,...

A. Bruhn et al.

Lucas/kanade meets horn/schunkcombining local and global optical flow methods

Int. J. Comput. Vis. (IJCV)

(2005)

Cited by (16)

Adaptive segmentation model for liver CT images based on neural network and level set method
2021, Neurocomputing
Citation Excerpt :
With the advancement of medicine, medical image segmentation is getting more and more attention. Accurate and efficient image segmentation plays a vital role in many research, including recognition [1,2], object tracking [3–5], image analysis [6,7]. To assist doctors in diagnosis and improve the efficiency of identifying tissues and lesions, many segmentation methods have been given, such as manual segmentation and automatic segmentation.
Accurate segmentation is difficult for liver computed tomography (CT) images, since the liver CT images do not always have obvious and smooth boundaries. The location of the tumor is not specified and the image intensity is similar to that of the liver. Although manual and automatic segmentation methods, traditional and deep learning models currently exist, none can be specifically and effectively applied to segment liver CT images. In this paper, we propose a new model based on a level set framework for liver CT images in which the energy functional contains three terms including the data fitting term, the length term and the bound term. Then we apply the split Bregman method to minimize the energy functional that leads the energy functional to converge faster. The proposed model is robust to initial contours and can segment liver CT images with intensity inhomogeneity and unclear boundaries. In the bound term, we use the U-Net to get constraint information which has a considerable influence on effective and accurate segmentation. We improve a multi-phase level set of our model to get contours of tumor and liver at the same time. Finally, a parallel algorithm is proposed to improve segmentation efficiency. Results and comparisons of experiments are shown to demonstrate the merits of the proposed model including robustness, accuracy, efficiency and intelligence.
Robust visual tracking via Laplacian Regularized Random Walk Ranking
2019, Neurocomputing
Citation Excerpt :
However, the bounding box can not describe the target object accurately due to irregular object shapes, scale variations and occlusions, and the trackers will be disturbed by the introduced background information, which makes the tracker undertake the risk of model drifting. In order to overcome the above challenges, a lot of efforts have been developed to alleviate the undesirable effects of background information [3–7,9–16]. For example, some methods [4–6,9] update the object classifiers by further considering the distances of candidate bounding box with respect to the bounding box center and assigning higher weights to the candidate bounding box when they are close to the center.
Visual tracking is a fundamental and important problem in computer vision and pattern recognition. Existing visual tracking methods usually localize the visual object with a bounding box. Recently, learning the patch-based weighted features has been demonstrated to be an effective way to mitigate the background effects in the target bounding box descriptions, and can thus improve tracking performance significantly. In this paper, we propose a simple yet effective approach, called Laplacian Regularized Random Walk Ranking (LRWR), to learn more robust patch-based weighted features of the target object for visual tracking. The main advantages of our LRWR model over existing methods are: (1) it integrates both local spatial and global appearance cues simultaneously, and thus leads to a more robust solution for patch weight computation; (2) it has a simple closed-form solution, which makes our tracker efficiently. The learned features are incorporated into the structured SVM to perform object tracking. Experiments show that our approach performs favorably against the state-of-the-art trackers on two standard benchmark datasets.
Body part boosting model for carried baggage detection and classification
2017, Neurocomputing
Citation Excerpt :
Many object detection and recognition problems have been successfully implemented using part-based models, such as face [13], human [14], and general object detection [9], with incredible results. Recently, [15] introduced the utilization of a part model for deformable objects tracking in a vision surveillance system. As mentioned above, although our model is independent from the feature descriptors, in practice, two feature descriptors are used.
In the automatic video surveillance system, the detection of a human carrying baggage is a potentially important objective for security and monitoring purposes in the public spaces. This paper introduces a new approach for detecting and classifying baggage carried by a human on the images. It utilizes the spatial information of the baggage in reference to the body of the human carrying it. A human-baggage detector is modeled by the body parts of a human, including the head, torso, leg, and baggage parts. The feature descriptors are extracted for each part based on its characteristics and these features are further trained using a support vector machine (SVM) classifier. A mixture model is built specifically for the baggage part due to a significant variation in shape, size, color, and texture. The boosting strategy constructs a strong classifier by combining a set of weak classifiers which are obtained by training the body part. The proposed method has been extensively evaluated using the public datasets. The experimental results confirm that the proposed method is viable for a state-of-the-art in the carried baggage detection and classification system.
A Novel Subpixel Industrial Chip Detection Method Based on the Dual-Edge Model for Surface Mount Equipment
2023, IEEE Transactions on Industrial Informatics
Object Tracking Based on a Time-Varying Spatio-Temporal Regularized Correlation Filter With Aberrance Repression
2022, IEEE Photonics Journal
Implement of an automated unmanned recording system for tracking objects on mobile phones by image processing method
2021, Multimedia Tools and Applications

View all citing articles on Scopus

Tao Zhuo received the B.S. degree in Computer Science and Technology from the Xian Shiyou University, Xian, China, in 2009. In 2012 and 2016, he received the masters degree and Ph.D. degree respectively in Computer Science and Technology from Northwestern Polytechnical University, Xian, China. Currently, he is a research fellow in School of Computing, National University of Singapore. His current research interests include visual object tracking, machine learning and computer vision.

Lei Xie received the Ph.D. degree in computer science from Northwestern Polytechnical University, Xian, China, in 2004. He is currently a professor with School of Computer Science, Northwestern Polytechnical University, Xian, China. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium, as a visiting scientist. From 2004 to 2006, he was a senior research associate in the Center for Media Technology (RCMT), School of Creative Media, City University of Hong Kong, Hong Kong. From 2006 to 2007, he was a postdoctoral fellow in the Human-Computer Communications Laboratory (HCCL), Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong. He has published more than 100 papers in major journals and proceedings, such as the IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE TRANSACTIONS ON MULTIMEDIA, INFORMATION SCIENCES, PATTERN RECOGNITION, ACM/Springer Multimedia Systems, Springer Multimedia Tools and Applications, ACL, Interspeech, ICPR, and ICASSP. He has served as program chair, organizing chair, program and organizing committee members in major conferences. He is a senior member of IEEE, a member of ISCA, a member of ACM, a member of APSIPA and a senior member of China Computer Federation (CCF). He is a board-of-governor of the Chinese Information Processing Society of China (CIPSC), a board member of the APSIPA Speech, Language and Audio (SLA) technical committee, a board member of the multimedia technical committee of CCF, a board member of the multimedia technical committee of China Society of Image and Graphics (CSIG). His current research interests include speech and language processing, multimedia and human computer interaction.

Yanning Zhang is currently a professor in the School of Computer Science, Northwestern Polytechnical University, China. She received her Ph.D. from the Northwestern Polytechnical University, China in 1996. Her current research interests are in signal processing, multimedia and computer vision. Zhang has been an active member of the technical program committee of several international conferences and a reviewer of several reputed journals and conference, such as reviewer of IEEE Transactions on Systems, Man and Cybernetics (T-SMC), Pattern Recognition Letter. She has also been the organization chair of the Ninth Asian Conference on Computer Vision (ACCV09). She is currently a member of IEEE.

View full text

Deformable object tracking with spatiotemporal segmentation in big vision surveillance

Abstract

Introduction

Section snippets

Learning samples generation by spatiotemporal segmentation

Structural tracking-learning strategy

Experiments

Conclusion

Acknowledgments

Comput. Vis. Image Underst. (CVIU)

Multi-task pose-invariant face recognition

IEEE Trans. Image Process. (T-IP)

General tensor discriminant analysis and gabor features for gait recognition

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

Large-margin multi-view information bottleneck

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

Multiview Hessian regularization for image annotation

ACM Comput. Surv.

Adaptive appearance modeling for video trackingsurvey and evaluation

IEEE Trans. Image Process. (T-IP)

Slow feature analysis for human action recognition

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

Tracking learning detection

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

Double shrinking sparse dimension reduction

IEEE Trans. Image Process. (T-IP)

Ensemble manifold regularization

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

Geometric mean for subspace selection

IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)

Lucas/kanade meets horn/schunkcombining local and global optical flow methods

Int. J. Comput. Vis. (IJCV)