Journal of Visual Communication and Image Representation
Model-based approach to spatial–temporal sampling of video clips for video object detection by classification☆
Introduction
Delineating the spatial–temporal boundaries of video objects in a video clip is one of the most important problems in computer vision due to its potential for many vision-based applications, such as video surveillance, man–machine interfaces, video indexing and retrieval, recognition of postures, analysis of sports events, and authoring of video games [1]. Recently, semantic-based video analysis tended to model a video clip as a graph whose nodes are high-level video objects performing a specific action individually [2]. Techniques of graph matching are then applied to annotate the event type of the input video clip [3]. Detection and classification of video objects from video clips help for bridging the semantic gap between high-level features and low-level features.
Conventional video object detection (VOD) algorithms, which characterize spatially cohesive objects with locally smooth trajectories, use the techniques for tracking or body pose estimation to extract spatial–temporal tubes from the input video clip [4], [5], [6]. However, when applied to real world videos, traditional techniques of tracking or body pose estimation are generally not reliable due to object occlusion, distortion and changes in lighting. Instead, we formulate the tracking process for VOD as a classification problem because objects across consecutive frames are, in general, spatially and temporally cohesive. Also, by assuming relatively slow camera motions, the shape and location of objects vary slowly from frame to frame. Thus, the size of the search space to track an object across many frames is reduced significantly by exploiting this coherence. By considering a parameter set in the feasible search space as a class, the object tracking for VOD is cast into a classification framework [7].
The image of an object often consists of several parts arranged in a deformable configuration [8]. The use of visual patterns of local patches in shape modeling is related to several ideas, including the approach of local appearance codebooks [9] and the generalized Hough transform (GHT) [10] for object detection. At training time, these methods learn a model of the spatial occurrence distributions of local patches with respect to object centers. At testing time, based on the trained object models, the appearances of points of interest in images or videos are matched against visual codebooks to detect a specific object using the Hough voting framework. The effectiveness of visual pattern grouping by Hough voting is heavily dependent on the quality of the learnt visual model, and thus the ability to precisely locate the target objects and extract typical features from training samples is very important.
In this paper, we formulate a video object as a sequence of key-objects, where each is modeled by an effective visual dictionary [11]. A key-object is defined as the image object inside a key-frame when we use key-frames to present a video clip [12], [13], [53]. Given an activity class, we model the video object that performs the activity as a sequence of key-object visual dictionaries. At testing time, the system applies a well-known dynamic programming approach to detect the target video objects by optimally aligning the frames of the input video clip with the key-object dictionary sequence. More specifically, in this paper, the visual dictionary sequence of a class plays the role as an object template and the template matching approach with dynamic programming is used to locate the video objects in the video clips. The semantics of the detected video objects is further verified by a trained class-specific classifier.
The first challenge of the approach comes from the computational complexity at worst O(n3) for video objects of n frames, which leads to the learning algorithm of high time complexity. Thus, it is crucial to find a more efficient key-object representation for video objects. Another challenge in key-object representation of the video object is the computation of temporal boundaries among key-objects. To tackle the above problems, this paper presents the following contributions to video object detection and classification. Firstly, a learning algorithm used to generate a codebook sequence is proposed to detect and classify video objects from a video clip simultaneously. The computational issue of the learning approach is also discussed. Secondly, the use of dynamic programming together with the learnt key-object codebooks allows the optima alignment between the frames of a video sequence and the codebook sequence of a specific object class. Thirdly, based on the alignment result, every key-object codebook locates potential objects in the corresponding frames using the Hough voting. A simple object correspondence follows to track the detected objects across frames and locate multiple video objects in the input video. Next, a class-specific SVM classifier is used to filter the detected video objects that do not belong to the activity class. Thus, the model-based object tracking approach provides the ability to detect multiple video objects from the input video clip. Finally, the user interaction to manually delineate the target video object to learn a specific model sequence is reduced to a minimum. Experimental results show that the proposed method gives good performance on several publicly available datasets in terms of detection accuracy and recognition rate.
The remainder of this paper is organized as follows. Section 2 presents the related work for semantic video object detection and recognition. Section 3 define the problem of video object detection by classification and related background knowledge. Section 4 deals with the computational issues to create the video object model for an activity class in the learning phase and to detect video objects with the learnt codebooks in the testing phase. In this section, we also present the proposed codebook matching algorithm using dynamic programming to optimally select proper codebooks for interpreting the target objects of frames in a test video sequence. Section 5 describes the experimental tests to illustrate the effectiveness of the proposed method. Finally, conclusions are drawn in Section 6.
Section snippets
Related work
A common element of all previous VOD methods in the literature is that they estimate the object boundaries to delineate all objects in video frames by tracking [4], [11]. A primary motivation for the work presented here is to question the benefits of tracking object boundaries across frames for video-based applications, such as activity analysis. In practice, the accuracy of any boundary estimate is limited by a number of systemic factors such as image resolution, noise, motion skew and the
Background knowledge
The essential module to implement the proposed semantic video object detection is the construction of a visual codebook to detect objects in a 2D image. This problem has been extensively studied [8], [9]. A common method is to extract feature vectors from local appearances, i.e., patches in all training static objects. A codebook of local appearances is then built up using a clustering algorithm on the feature vectors of patches to learn the variation in the appearance of the target object. The
The implementation
Fig. 3 shows the block diagram of the proposed semantic video object detection by classification. The approach is divided into two phases: (1) in the training phase, for each class, the algorithm returns the optimal time-ordered moving key-object codebooks and the activity classifier Φ; (2) in the testing phase, the system detects video objects with specific semantics using Γ and Φ. The learnt key-object codebooks are time-ordered if the order information in the training video clips
The datasets
A series of experiments was conducted on an Intel Processor 3.0 GHz PC and four video datasets. The KTH dataset [39], the Weizmann dataset [40], the UCF sports [41] and the 50-class UCF50 dataset [50] are constructed to evaluate the performance of the human action detection and recognition system. The KTH video sequences have been used in many human action recognition studies. It contains six types of human actions: walking, jogging, running, boxing, hand waving and hand clapping. Each action is
Conclusions
In this paper, we have presented a video object detection by classification method, based on the fusion of time-series posture codebooks and dynamic programming. The proposed time-series posture codebooks encode every class-specific video object as a sequence of BoW histogram. For each class, a set of training video objects was also used to train a classifier, which verifies the correctness of the candidate detected video objects in the testing time. The dynamic programming framework optimally
References (53)
Generalizing the Hough transform to detect arbitrary shapes
IEEE Comput. Vision Pattern Recogn.
(1981)A survey on vision-based human action recognition
Image Vision Comput.
(2010)- et al.
Machine recognition of human activities: a survey
IEEE Trans. Circuits Syst. Video Technol.
(2008) - et al.
Sports video analysis: semantics extraction, editorial content creation and adaptation
J. Multimed.
(2009) - et al.
A generic framework for event detection in various video domains
Proc. ACM Multimed.
(2010) - W. Brendel, S. Todorovic, Video object segmentation by tracking regions, in: Proc. Int’l Conf. Computer Vision (ICCV),...
- et al.
Large displacement optical flow: descriptor matching in variational motion estimation
IEEE Trans. Pattern Anal. Mach. Intell.
(2011) - T. Brox, J. Malik, Object segmentation by long term analysis of point trajectories, in: Proc. European Conference on...
- et al.
Model-based tracking by classification in a tiny discrete pose space
IEEE Trans. Pattern Anal. Mach. Intell.
(2007) - P. Felzenszwalb, R. Girshick, D. McAllester, Cascade object detection with deformable part models, in: Proc. IEEE Conf....
Robust object detection with interleaved categorization and segmentation
Int. J. Comput. Vision
Actions as space–time shapes
IEEE Trans. Pattern Anal. Mach. Intell.
Salient stills
ACM Trans. Multimed. Comput. Commun. Appl.
Computational approaches to temporal sampling of video sequences
ACM Trans. Multimed. Comput. Commun. Appl. (TOMCCAP)
A voting-based computational framework for visual motion analysis and interpretation
IEEE Trans. Pattern Anal. Mach. Intell.
Unsupervised learning of human action categories using spatial–temporal words
Proc. Int. Conf. Comput. Vision
Understanding bag-of-words model: a statistical framework
Int. J. Mach. Learn. Cybern.
Event detection and recognition for semantic annotation of video
Multimed. Tools Appl.
Graph edit distance from spectral seriation
IEEE Trans. Pattern Recogn. Mach. Intell.
Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video
IEEE Trans. Syst. Man Cybern.
Cited by (13)
Boosting content based image retrieval performance through integration of parametric & nonparametric approaches
2019, Journal of Visual Communication and Image RepresentationCitation Excerpt :With the explosion of digital technologies and greater storage capabilities, vast volumes of digital media now exist in various fields like, multimedia and spatial information system [1], medical image [2], time series data analysis [3], compression techniques [4].
Systematic review of virtual speech therapists for speech disorders
2016, Computer Speech and LanguageCitation Excerpt :This information helps the VST to select the next operational phase which can be delivering a new practice, providing new feedback, etc. To accomplish this goal, the face of the person under therapy is constantly tracked by cameras, and the emotional and clinical models are built by the program to infer the current status of the user (Chuang et al., 2014; Tjondronegoro et al., 2008). Facial feature tracking can be merged with speech features to detect speech impairments, since the correlation between jaw and lip position, for instance, can be detected for the corresponding speech acoustics (Danubianu et al., 2009; Engwall et al., 2006).
Indexing and encoding based image feature representation with bin overlapped similarity measure for CBIR applications
2016, Journal of Visual Communication and Image RepresentationCitation Excerpt :Enormous growth in digital technology and higher storage capacity with low cost has resulted in the development of huge digital media devices. These devices deploy new applications in the area of multimedia information systems [1], spatial information systems [2], medical imaging [3], time-series analysis [4], image retrieval systems [5], storage and compression [6,7] etc. The explosion of these inexpensive digital equipment and storage devices made users to easily own and access huge amount of digital images.
An ultra-fast human detection method for color-depth camera
2015, Journal of Visual Communication and Image RepresentationCitation Excerpt :Robust and efficient human detection has been a very active research area for several decades [1–6].
Dynamic Objects Detection and Tracking from Videos for Surveillance Applications
2022, 8th International Conference on Advanced Computing and Communication Systems, ICACCS 2022Systematic analysis and review of video object retrieval techniques
2020, Control and Cybernetics
- ☆
This work was supported in part by National Science Council Taiwan under Grant Nos. NSC 100-2221-E-019-054-MY3 and 101-2918-1-019-003.
- 1
Fax: +886 2 24623249.
- 2
Fax: +61 394793060.