Deformable object tracking with spatiotemporal segmentation in big vision surveillance
Introduction
In the past decades, different types of video surveillance systems have demonstrated their effectiveness for public security all around the world. With the rapid development of the HDTV and mobile networks nowadays, the video data volume and resolution having been enhanced with an incredible speed, which makes most of the current surveillance system [1], [2] need to face the similar challenges introduced by big data analytics as in the other areas as data storage or information retrieval for the obtained big videos [3]. In recent years, the content based learning-retrieval [4] mechanism has been verified a promising solution for the ‘big’ video data analysis, for which the learning sample cannot be extracted by offline as we usually do on the normal video dataset. Thus, employing some automatic object tracking to obtain the salient features for learning sample generation is a common way in many learning strategies. Specifically, more accurate tracking for learning sample generation, more satisfactory the retrieval result. Therefore, the research of visual tracking is still an essential topic in handling the challenge problems in the coming big data age.
Tracking a rigid shape object in a simple surveillance environment, such as a car running on a highway, has been resolved in different satisfactory ways [5], [6], [7], [8]. However, tracking the deformable object in realistic scenarios is still hard because the target appearance may change constantly during moving [9], especially for an irregular-shaped object, challenges mainly come from the intrinsic variations such as distortion, rotation and scaling [10]. In order to effectively adapt to such target appearance changes, the most popular way of conducting online tracking is to update the appearance model and make it suitable for distinguishing the object from background on-the-fly. In general cases, due to the deficiency knowledge to the learner before learning starts, the noisy constraints in training data leads to performance degeneration when updating classifiers. Hence, in order to introduce more rational paradigms between training data for classifier updates, Kalal et al. proposed a P–N learning scheme and applied it to the problem of online tracking-by-learning [11]. P–N learning establishes a novel structure of the training samples by exploiting the positive and negative constraints, which restricts the data labeling operation [3]. This framework also helps to guide the design of more sophisticated structural constraints that can fulfill requirements of the learning stability. However, its limitation resides in the usage of inaccurate positive data sample which sampled in the background areas inside the target bounding box. In addition, tracking failure in many cases is still hard to be avoided because of the inaccurate motion estimation between unreliable consecutive frames.
Similar as Kalal׳s tracker, many other proposed online tracking strategies also utilized regular geometry shapes such as bounding-box or ellipse to represent the appearance of the target. These regular-shape based tracking methods can track the targets of fixed shape such as human head or cars robustly, but the tracking failure often happen in handling the irregular-shaped targets with good accuracy, especially when targets have heavy partial occlusions or intrinsic variations. To overcome this limitation, Kwon et al. proposed an approach by using a pre-defined bounding-box collection to represent different parts of the target [12] for articulated object tracking. With the help of an adaptive Basin Hopping Monte-Carlo Sampling (BHMC) method, Kwon׳s approach can automatically update the target dynamic appearance changes and geometry relations over time. Likely, Yao et al. also utilized a global object box and a set of part boxes as an appearance model to approximate the irregular-shaped object [13]. With an online two-stage training mechanism to learn the parameter of part based model, Yao׳s strategy is able to overcome the complexity problem due to model overfitting. Different from the approaches [12] and [13] using regular part-based representation for single target, Zhang et al. introduced a model-free tracker that simultaneously tracks multiple objects by combining multiple single-object trackers via constraints on the spatial structure of the objects. The performance of this structure-preserving tracking approach show an obviously improvement in multi-object tracking by using an online structured SVM algorithm, which is similar as [13].
In many realistic situations, unfortunately, such pre-defined parts representation influences the extendability and generic application for those methods [14], especially when the target is composed by several objects, e.g. motor rider shown as Fig. 5, which is difficult to be effectively represented by discrete rectangles. Thus, more accurate representation such as using continuous contour to smoothly estimate target׳s contour/shape would be a possible way for target presentation. Sun et al. proposed a supervised level set model for tracking [15] in order to obtain more precise convergence to the target during tracking. With the specific knowledge of target region and edge cues, the contour curve can converge to the candidate area with maximum likelihood in a Bayesian manner. Recently, Hough-transform based approaches have received attraction in overcoming the limitations of using fixed-shape set for irregular representation [16]. Barinova et al. proposed a probabilistic framework for multiple object instances detection in Hough transforms domain [17], [18]. And the main point of this research also inspired the following work proposed by Godec et al. [10]. In Godec׳s tracker, the GMM based segmentation [19] is incorporated with Hough forest learning framework for irregular-shaped target tracking [16]. With the back-projection to support an online tracking process, Godec׳s tracker beyond many state-of-art work based on fixed-shape appearance representation. However, heavily relying on the discrimination of color Gaussian kernels makes the segmentation result unexpectedly, especially when intrinsic changes happen in the uniform color background containing obvious edges.
Aiming at the problem of learning data generation, in this paper, we propose a novel motion-appearance model to achieve accurate spatiotemporal segmentation for deformable object online tracking. Compared with the spatial guided segmentation [10] only depending on the texture/pixel information within the individual frame, the proposed model is able to segment the target areas more precisely with the help of motion information between consecutive frames, especially when the texture of background and target are similar. The segmented areas can provide more precise samples for online model updating. To effectively describe the appearance of deformable object, we utilize the structural SVM [20] to construct an online tracking framework, which is more accurate than only employing fixed rectangle for target appearance modeling in spatial domain. The proposed tracking shows more robust in many challenge scenes including rotation, intrinsic compression/stretching and aspect ratio changes, for vision based surveillance application.
The following organization of the paper is as: Section 2 introduces the proposed spatiotemporal segmentation with motion-appearance model and Section 3 briefly introduced the online learning tracking framework. Section 4 shows experiment results and discussion and Section 5 gives the conclusion of this paper.
Section snippets
Learning samples generation by spatiotemporal segmentation
Learning samples generation is an important issue for most of online tracking-by-learning strategies [11]. For the fixed-shape based approaches, the main challenge is: the correctness of generated samples for online learning is hard to be guaranteed due to the noise in coarse data by the annotated bounding boxes. The purpose of the proposed segmentation model is to separate the foreground from the background with fine-smooth contour. The accurately separated foreground areas can provide higher
Structural tracking-learning strategy
The proposed tracking framework is shown in Fig. 1. In the initialization step of tracking, we generate the training samples around the target region and obtain a robust SVM classifier as the target detector. For the target predication step, we evaluate each candidate targets with the trained detector and find the sample with local maximum predication value as the initial tracking result. Then we compute the target displacement between two consecutive frames as motion estimation, and the final
Experiments
The proposed work is implemented by using C++ with Intel OpenCV library, and tested on a workstation with an Intel Core i7 3.4 GHz processor with 4.0G RAM. We evaluate our tracking approach on video sequences coming from the object tracking video benchmark in [8], which contains the scenarios including surveillance environments (Couple, Crossing and Woman), human faces (Boy, David, Faceocc2, Trellis and Shaking) and other type of cases (Mountainbike and Singer2). These videos are utilized to
Conclusion
Accurate object tracking is able to provide more help in salient information extraction for big vision surveillance. In this paper, we proposed a novel online tracking method for deformable object tracking based on accurate spatiotemporal segmentation with motion-appearance model. By employing the structural SVM to carry out the online learning process with more precise samples in the segmented areas, the proposed tracking has shown a very satisfactory accuracy in different realistic scenarios.
Acknowledgments
This research is supported by Research Fund for the Doctoral Program of Higher Education of China 20126102120055, National Natural Science Foundation of China (61301194, 61571362 and 61175018), Foundation Grant from NWPU 3102014JSJ0014.
Peng Zhang received the B.E. degree from the Xian Jiaotong University, China in 2001. He received his Ph.D. from Nanyang Technological University, Singapore in 2011. He is now an associate professor in School of Computer Science, Northwestern Polytechnical University, China. His current research interests include signal processing, multimedia security and pattern recognition. He is a member of IEEE.
References (45)
- et al.
The robust estimation of multiple motionsparametric and piecewise-smooth flow fields
Comput. Vis. Image Underst. (CVIU)
(1996) - et al.
Multi-task pose-invariant face recognition
IEEE Trans. Image Process. (T-IP)
(2015) - et al.
General tensor discriminant analysis and gabor features for gait recognition
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2007) - et al.
Large-margin multi-view information bottleneck
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2014) - et al.
Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2006) - et al.
Multiview Hessian regularization for image annotation
ACM Comput. Surv.
(2013) - H. Yang, L. Shao, F. Zheng, L. Wang, Z. Song, Recent advances and trends in visual tracking: a review, Neurocomputing...
- et al.
Adaptive appearance modeling for video trackingsurvey and evaluation
IEEE Trans. Image Process. (T-IP)
(2012) - Y. Wu, J. Lim, M.-H. Yang, Online object tracking: a benchmark, in: IEEE Conference on Computer Vision and Pattern...
- et al.
Slow feature analysis for human action recognition
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
(2012)
Tracking learning detection
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
Double shrinking sparse dimension reduction
IEEE Trans. Image Process. (T-IP)
Ensemble manifold regularization
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
Geometric mean for subspace selection
IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI)
Lucas/kanade meets horn/schunkcombining local and global optical flow methods
Int. J. Comput. Vis. (IJCV)
Cited by (16)
Adaptive segmentation model for liver CT images based on neural network and level set method
2021, NeurocomputingCitation Excerpt :With the advancement of medicine, medical image segmentation is getting more and more attention. Accurate and efficient image segmentation plays a vital role in many research, including recognition [1,2], object tracking [3–5], image analysis [6,7]. To assist doctors in diagnosis and improve the efficiency of identifying tissues and lesions, many segmentation methods have been given, such as manual segmentation and automatic segmentation.
Robust visual tracking via Laplacian Regularized Random Walk Ranking
2019, NeurocomputingCitation Excerpt :However, the bounding box can not describe the target object accurately due to irregular object shapes, scale variations and occlusions, and the trackers will be disturbed by the introduced background information, which makes the tracker undertake the risk of model drifting. In order to overcome the above challenges, a lot of efforts have been developed to alleviate the undesirable effects of background information [3–7,9–16]. For example, some methods [4–6,9] update the object classifiers by further considering the distances of candidate bounding box with respect to the bounding box center and assigning higher weights to the candidate bounding box when they are close to the center.
Body part boosting model for carried baggage detection and classification
2017, NeurocomputingCitation Excerpt :Many object detection and recognition problems have been successfully implemented using part-based models, such as face [13], human [14], and general object detection [9], with incredible results. Recently, [15] introduced the utilization of a part model for deformable objects tracking in a vision surveillance system. As mentioned above, although our model is independent from the feature descriptors, in practice, two feature descriptors are used.
A Novel Subpixel Industrial Chip Detection Method Based on the Dual-Edge Model for Surface Mount Equipment
2023, IEEE Transactions on Industrial InformaticsObject Tracking Based on a Time-Varying Spatio-Temporal Regularized Correlation Filter With Aberrance Repression
2022, IEEE Photonics JournalImplement of an automated unmanned recording system for tracking objects on mobile phones by image processing method
2021, Multimedia Tools and Applications
Peng Zhang received the B.E. degree from the Xian Jiaotong University, China in 2001. He received his Ph.D. from Nanyang Technological University, Singapore in 2011. He is now an associate professor in School of Computer Science, Northwestern Polytechnical University, China. His current research interests include signal processing, multimedia security and pattern recognition. He is a member of IEEE.
Tao Zhuo received the B.S. degree in Computer Science and Technology from the Xian Shiyou University, Xian, China, in 2009. In 2012 and 2016, he received the masters degree and Ph.D. degree respectively in Computer Science and Technology from Northwestern Polytechnical University, Xian, China. Currently, he is a research fellow in School of Computing, National University of Singapore. His current research interests include visual object tracking, machine learning and computer vision.
Lei Xie received the Ph.D. degree in computer science from Northwestern Polytechnical University, Xian, China, in 2004. He is currently a professor with School of Computer Science, Northwestern Polytechnical University, Xian, China. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium, as a visiting scientist. From 2004 to 2006, he was a senior research associate in the Center for Media Technology (RCMT), School of Creative Media, City University of Hong Kong, Hong Kong. From 2006 to 2007, he was a postdoctoral fellow in the Human-Computer Communications Laboratory (HCCL), Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong. He has published more than 100 papers in major journals and proceedings, such as the IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE TRANSACTIONS ON MULTIMEDIA, INFORMATION SCIENCES, PATTERN RECOGNITION, ACM/Springer Multimedia Systems, Springer Multimedia Tools and Applications, ACL, Interspeech, ICPR, and ICASSP. He has served as program chair, organizing chair, program and organizing committee members in major conferences. He is a senior member of IEEE, a member of ISCA, a member of ACM, a member of APSIPA and a senior member of China Computer Federation (CCF). He is a board-of-governor of the Chinese Information Processing Society of China (CIPSC), a board member of the APSIPA Speech, Language and Audio (SLA) technical committee, a board member of the multimedia technical committee of CCF, a board member of the multimedia technical committee of China Society of Image and Graphics (CSIG). His current research interests include speech and language processing, multimedia and human computer interaction.
Yanning Zhang is currently a professor in the School of Computer Science, Northwestern Polytechnical University, China. She received her Ph.D. from the Northwestern Polytechnical University, China in 1996. Her current research interests are in signal processing, multimedia and computer vision. Zhang has been an active member of the technical program committee of several international conferences and a reviewer of several reputed journals and conference, such as reviewer of IEEE Transactions on Systems, Man and Cybernetics (T-SMC), Pattern Recognition Letter. She has also been the organization chair of the Ninth Asian Conference on Computer Vision (ACCV09). She is currently a member of IEEE.