Model-based approach to spatial–temporal sampling of video clips for video object detection by classification

https://doi.org/10.1016/j.jvcir.2014.02.014Get rights and content

Highlights

  • A computational approach to build a class-specific optimal key-object model sequence is proposed.

  • The approach is robust to detect video objects from a video clip with a cluttered background.

  • We propose an automatic training procedure by multiple alignment with dynamic programming.

  • Techniques to smartly detect video objects by classification are implemented.

Abstract

For a variety of applications such as video surveillance and event annotation, the spatial–temporal boundaries between video objects are required for annotating visual content with high-level semantics. In this paper, we define spatial–temporal sampling as a unified process of extracting video objects and computing their spatial–temporal boundaries using a learnt video object model. We first provide a computational approach for learning an optimal key-object codebook sequence from a set of training video clips to characterize the semantics of the detected video objects. Then, dynamic programming with the learnt codebook sequence is used to locate the video objects with spatial–temporal boundaries in a test video clip. To verify the performance of the proposed method, a human action detection and recognition system is constructed. Experimental results show that the proposed method gives good performance on several publicly available datasets in terms of detection accuracy and recognition rate.

Introduction

Delineating the spatial–temporal boundaries of video objects in a video clip is one of the most important problems in computer vision due to its potential for many vision-based applications, such as video surveillance, man–machine interfaces, video indexing and retrieval, recognition of postures, analysis of sports events, and authoring of video games [1]. Recently, semantic-based video analysis tended to model a video clip as a graph whose nodes are high-level video objects performing a specific action individually [2]. Techniques of graph matching are then applied to annotate the event type of the input video clip [3]. Detection and classification of video objects from video clips help for bridging the semantic gap between high-level features and low-level features.

Conventional video object detection (VOD) algorithms, which characterize spatially cohesive objects with locally smooth trajectories, use the techniques for tracking or body pose estimation to extract spatial–temporal tubes from the input video clip [4], [5], [6]. However, when applied to real world videos, traditional techniques of tracking or body pose estimation are generally not reliable due to object occlusion, distortion and changes in lighting. Instead, we formulate the tracking process for VOD as a classification problem because objects across consecutive frames are, in general, spatially and temporally cohesive. Also, by assuming relatively slow camera motions, the shape and location of objects vary slowly from frame to frame. Thus, the size of the search space to track an object across many frames is reduced significantly by exploiting this coherence. By considering a parameter set in the feasible search space as a class, the object tracking for VOD is cast into a classification framework [7].

The image of an object often consists of several parts arranged in a deformable configuration [8]. The use of visual patterns of local patches in shape modeling is related to several ideas, including the approach of local appearance codebooks [9] and the generalized Hough transform (GHT) [10] for object detection. At training time, these methods learn a model of the spatial occurrence distributions of local patches with respect to object centers. At testing time, based on the trained object models, the appearances of points of interest in images or videos are matched against visual codebooks to detect a specific object using the Hough voting framework. The effectiveness of visual pattern grouping by Hough voting is heavily dependent on the quality of the learnt visual model, and thus the ability to precisely locate the target objects and extract typical features from training samples is very important.

In this paper, we formulate a video object as a sequence of key-objects, where each is modeled by an effective visual dictionary [11]. A key-object is defined as the image object inside a key-frame when we use key-frames to present a video clip [12], [13], [53]. Given an activity class, we model the video object that performs the activity as a sequence of key-object visual dictionaries. At testing time, the system applies a well-known dynamic programming approach to detect the target video objects by optimally aligning the frames of the input video clip with the key-object dictionary sequence. More specifically, in this paper, the visual dictionary sequence of a class plays the role as an object template and the template matching approach with dynamic programming is used to locate the video objects in the video clips. The semantics of the detected video objects is further verified by a trained class-specific classifier.

The first challenge of the approach comes from the computational complexity at worst O(n3) for video objects of n frames, which leads to the learning algorithm of high time complexity. Thus, it is crucial to find a more efficient key-object representation for video objects. Another challenge in key-object representation of the video object is the computation of temporal boundaries among key-objects. To tackle the above problems, this paper presents the following contributions to video object detection and classification. Firstly, a learning algorithm used to generate a codebook sequence is proposed to detect and classify video objects from a video clip simultaneously. The computational issue of the learning approach is also discussed. Secondly, the use of dynamic programming together with the learnt key-object codebooks allows the optima alignment between the frames of a video sequence and the codebook sequence of a specific object class. Thirdly, based on the alignment result, every key-object codebook locates potential objects in the corresponding frames using the Hough voting. A simple object correspondence follows to track the detected objects across frames and locate multiple video objects in the input video. Next, a class-specific SVM classifier is used to filter the detected video objects that do not belong to the activity class. Thus, the model-based object tracking approach provides the ability to detect multiple video objects from the input video clip. Finally, the user interaction to manually delineate the target video object to learn a specific model sequence is reduced to a minimum. Experimental results show that the proposed method gives good performance on several publicly available datasets in terms of detection accuracy and recognition rate.

The remainder of this paper is organized as follows. Section 2 presents the related work for semantic video object detection and recognition. Section 3 define the problem of video object detection by classification and related background knowledge. Section 4 deals with the computational issues to create the video object model for an activity class in the learning phase and to detect video objects with the learnt codebooks in the testing phase. In this section, we also present the proposed codebook matching algorithm using dynamic programming to optimally select proper codebooks for interpreting the target objects of frames in a test video sequence. Section 5 describes the experimental tests to illustrate the effectiveness of the proposed method. Finally, conclusions are drawn in Section 6.

Section snippets

Related work

A common element of all previous VOD methods in the literature is that they estimate the object boundaries to delineate all objects in video frames by tracking [4], [11]. A primary motivation for the work presented here is to question the benefits of tracking object boundaries across frames for video-based applications, such as activity analysis. In practice, the accuracy of any boundary estimate is limited by a number of systemic factors such as image resolution, noise, motion skew and the

Background knowledge

The essential module to implement the proposed semantic video object detection is the construction of a visual codebook to detect objects in a 2D image. This problem has been extensively studied [8], [9]. A common method is to extract feature vectors from local appearances, i.e., patches in all training static objects. A codebook of local appearances is then built up using a clustering algorithm on the feature vectors of patches to learn the variation in the appearance of the target object. The

The implementation

Fig. 3 shows the block diagram of the proposed semantic video object detection by classification. The approach is divided into two phases: (1) in the training phase, for each class, the algorithm returns the optimal time-ordered moving key-object codebooks Γ={ϕi}i=1k and the activity classifier Φ; (2) in the testing phase, the system detects video objects with specific semantics using Γ and Φ. The learnt key-object codebooks are time-ordered if the order information in the training video clips

The datasets

A series of experiments was conducted on an Intel Processor 3.0 GHz PC and four video datasets. The KTH dataset [39], the Weizmann dataset [40], the UCF sports [41] and the 50-class UCF50 dataset [50] are constructed to evaluate the performance of the human action detection and recognition system. The KTH video sequences have been used in many human action recognition studies. It contains six types of human actions: walking, jogging, running, boxing, hand waving and hand clapping. Each action is

Conclusions

In this paper, we have presented a video object detection by classification method, based on the fusion of time-series posture codebooks and dynamic programming. The proposed time-series posture codebooks encode every class-specific video object as a sequence of BoW histogram. For each class, a set of training video objects was also used to train a classifier, which verifies the correctness of the candidate detected video objects in the testing time. The dynamic programming framework optimally

References (53)

  • D. Ballard

    Generalizing the Hough transform to detect arbitrary shapes

    IEEE Comput. Vision Pattern Recogn.

    (1981)
  • R. Poppe

    A survey on vision-based human action recognition

    Image Vision Comput.

    (2010)
  • P. Turaga et al.

    Machine recognition of human activities: a survey

    IEEE Trans. Circuits Syst. Video Technol.

    (2008)
  • C. Xu et al.

    Sports video analysis: semantics extraction, editorial content creation and adaptation

    J. Multimed.

    (2009)
  • T. Zhang et al.

    A generic framework for event detection in various video domains

    Proc. ACM Multimed.

    (2010)
  • W. Brendel, S. Todorovic, Video object segmentation by tracking regions, in: Proc. Int’l Conf. Computer Vision (ICCV),...
  • T. Brox et al.

    Large displacement optical flow: descriptor matching in variational motion estimation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • T. Brox, J. Malik, Object segmentation by long term analysis of point trajectories, in: Proc. European Conference on...
  • L. Shang et al.

    Model-based tracking by classification in a tiny discrete pose space

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • P. Felzenszwalb, R. Girshick, D. McAllester, Cascade object detection with deformable part models, in: Proc. IEEE Conf....
  • B. Leibe et al.

    Robust object detection with interleaved categorization and segmentation

    Int. J. Comput. Vision

    (2008)
  • L. Gorelick et al.

    Actions as space–time shapes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • L. Teodosio et al.

    Salient stills

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2005)
  • T. Liu et al.

    Computational approaches to temporal sampling of video sequences

    ACM Trans. Multimed. Comput. Commun. Appl. (TOMCCAP)

    (2007)
  • M. Nicolescu et al.

    A voting-based computational framework for visual motion analysis and interpretation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • A. Yao, J. Gall, L. V. Gool, A Hough transform-based voting framework for action recognition, in: Proc. IEEE Conf....
  • A. Oikonomopoulos, I. Patras, M. Pantic, An implicit spatiotemporal shape model for human activity localization and...
  • P. Scovanner, A. Ali, M. Shah, A 3-dimensional SIFT descriptor and its application to action recognition, in: Proc. ACM...
  • J. Niebles et al.

    Unsupervised learning of human action categories using spatial–temporal words

    Proc. Int. Conf. Comput. Vision

    (2008)
  • Y. Zhang et al.

    Understanding bag-of-words model: a statistical framework

    Int. J. Mach. Learn. Cybern.

    (2010)
  • L. Ballan et al.

    Event detection and recognition for semantic annotation of video

    Multimed. Tools Appl.

    (2011)
  • A. Robles-Kelly et al.

    Graph edit distance from spectral seriation

    IEEE Trans. Pattern Recogn. Mach. Intell.

    (2005)
  • G. Lavee et al.

    Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video

    IEEE Trans. Syst. Man Cybern.

    (2009)
  • A. Vedaldi, V. Gulshan, M. Varma, A. Zisserman, Multiple kernels for object detection,in: Proc. Int’l Conf. Computer...
  • S. Vijayanarasimhan, K. Grauman, Large-scale live active learning: Training object detectors with crawled data and...
  • M. Blaschko, A. Vedaldi, A. Zisserman, Simultaneous object detection and ranking with weak supervision, in: Proc. Int’l...
  • Cited by (13)

    • Boosting content based image retrieval performance through integration of parametric & nonparametric approaches

      2019, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      With the explosion of digital technologies and greater storage capabilities, vast volumes of digital media now exist in various fields like, multimedia and spatial information system [1], medical image [2], time series data analysis [3], compression techniques [4].

    • Systematic review of virtual speech therapists for speech disorders

      2016, Computer Speech and Language
      Citation Excerpt :

      This information helps the VST to select the next operational phase which can be delivering a new practice, providing new feedback, etc. To accomplish this goal, the face of the person under therapy is constantly tracked by cameras, and the emotional and clinical models are built by the program to infer the current status of the user (Chuang et al., 2014; Tjondronegoro et al., 2008). Facial feature tracking can be merged with speech features to detect speech impairments, since the correlation between jaw and lip position, for instance, can be detected for the corresponding speech acoustics (Danubianu et al., 2009; Engwall et al., 2006).

    • Indexing and encoding based image feature representation with bin overlapped similarity measure for CBIR applications

      2016, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Enormous growth in digital technology and higher storage capacity with low cost has resulted in the development of huge digital media devices. These devices deploy new applications in the area of multimedia information systems [1], spatial information systems [2], medical imaging [3], time-series analysis [4], image retrieval systems [5], storage and compression [6,7] etc. The explosion of these inexpensive digital equipment and storage devices made users to easily own and access huge amount of digital images.

    • An ultra-fast human detection method for color-depth camera

      2015, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Robust and efficient human detection has been a very active research area for several decades [1–6].

    • Dynamic Objects Detection and Tracking from Videos for Surveillance Applications

      2022, 8th International Conference on Advanced Computing and Communication Systems, ICACCS 2022
    View all citing articles on Scopus

    This work was supported in part by National Science Council Taiwan under Grant Nos. NSC 100-2221-E-019-054-MY3 and 101-2918-1-019-003.

    1

    Fax: +886 2 24623249.

    2

    Fax: +61 394793060.

    View full text