Evaluating visual query methods for articulated motion video search

https://doi.org/10.1016/j.ijhcs.2014.12.009Get rights and content

Highlights

  • We develop three interfaces for video search of articulated objects.

  • Qualitative and quantitative evaluation of each interface.

  • Constrained interfaces outperform the freehand sketch-based interface.

  • Users indicated strong preferences for search interfaces containing pre-defined models.

Abstract

We develop and evaluate three interfaces for video search of articulated objects, specifically humans performing common actions. The three interfaces, (1) a freehand interface with motion cues (e.g., arrows), (2) an articulated human stick figure with motion cues, and (3) a keyframe interface, were designed to allow users to quickly generate motion-based queries. We performed both quantitative and qualitative analyses of the interfaces through a formal user study by measuring accuracy and speed of user input and asking the users to complete a free-response questionnaire. Our results indicate that the constrained interfaces outperform the freehand sketch-based interface, in terms of both search accuracy and query completion time. Additionally, the users described strong preferences for the search interfaces containing pre-defined models, and the generated queries were rated higher, in terms of semantic matches to the query concept.

Introduction

With the proliferation of video capture devices and inexpensive, large-scale storage, video data is increasingly being aggregated for both entertainment and analytic (e.g., athletics, surveillance, medical) purposes. Developing efficient and robust methods for searching large video repositories is an ongoing challenge. Commercially available solutions (e.g., Google Video) generally match text queries to video metadata (e.g., keywords, title). Searching for data in these repositories typically requires a large investment of manual effort in either annotation or real-time observation, and the possibility of incomplete or incorrect metadata is a well-known limitation (Carson and Ogle, 1996). Even with extensive, accurate annotation, it is still difficult to capture all of the semantic information contained in even short video clips. A number of approaches (e.g., Suma et al., 2008, Chang et al., 1997) focus on non-textual input, or visual queries, for searching video. These approaches not only hold the promise of avoiding the database annotation step required for text-based matching, but also introduce new challenges that cut across multiple areas of computing, including video processing, data representation, and interface design.

Many of the domains in which repositories of data are stored, such as athletics or surveillance, contain video of human activity and would benefit from new methods for video search that accelerate the process of locating relevant videos, potentially aiding in physical therapy training or identifying specific security footage of interest. Therefore, in this paper, we focus on the problem of searching for video clips of humans performing common actions. We design and evaluate three different interfaces for generating visual queries. The first interface follows the sketch-based input paradigm, where the user can draw a stick figure with action arrows to indicate motion. The second interface extends the first by providing a pre-defined template of an articulated human figure (stick figure) for the user to pose and also using action arrows as motion cues. The third interface re-uses the pre-defined template, but avoids the use of motion cues; instead it defines a sequence of poses to represent the visual video query. Fig. 1 shows examples of each interface. These three interfaces span a range of approaches that are applicable to the typical keyboard–video–mouse interface, and also can be applied to touch interfaces found on smartphones and tablets. In order to evaluate the effectiveness of the different visual query interfaces, we conducted a formal user study where we measured the query generation time and accuracy of the resulting search in terms of the number of highly ranked results matching the search concept. Additionally, the users provided feedback on the positives and negatives of each interface through a post-experiment survey.

Section snippets

Related work

The literature on automated methods for content-based visual information retrieval (CBVIR) is extensive; see Lew et al. (2006) and Marchand-Maillet (2000) for surveys. This body of work includes both methods for image and video search, and various paradigms, such as text-based searching (e.g., Naphade and Huang, 2000, Zha et al., 2009) or search by example (e.g., Taskiran et al., 2004). With text-based approaches, which match the query to metadata associated with images or video, the quality of

Input interfaces and motion inference

To ground the evaluation, we developed three interfaces for generating visual queries for human actions in video. While the specific interface components (e.g., feature transform and matching algorithm described in Section 4.2) are not the focus of this work and could be replaced with other methods, they serve as means to allow a comparison of the three interface styles for searching for articulated motions. Fig. 1 shows an example of each interface: (1) Freehand (Fig. 1(a)) allows the user to

Video matching

For this problem, it is important to model the motion characteristics of the video rather than the appearance, since sketches do not share appearance characteristics with real video. Many feature descriptors have been specifically designed for human motion detection. They employ a variety of approaches such as using keypoints on the body to map movement to a model (Kovar et al., 2002, Arikan and Forsyth, 2002, Fujiyoshi et al., 2004), contour analysis (Wren et al., 1997), optical flow combined

User study

To compare the differences among the three interfaces, we conducted a study where participants were asked to search for common human actions from a publicly-available human action recognition database.

Results

Unless otherwise noted, all tests cited in this paper used a significance value of α=.05.

Discussion

While the results measured the differences between three specific implementations of user interfaces, some of the results may be more broadly applicable. In general, for the specific task of finding video clips of humans performing common actions, the interfaces most closely related to the task (Template and Keypose) outperformed the free-form sketch-based interface, in terms of search accuracy, query generation, and user satisfaction.

Conclusions

Free-form sketching has been attempted on many platforms and its popularity has ebbed and flowed over the years. Currently, sketch-based input is making a resurgence with the growing number of touch-based devices. We presented an evaluation of three interfaces for composing visual queries for video search. Starting with the sketch-based paradigm, we designed additional interfaces that made a trade-off between flexibility and being suited to the particular search task. Our results suggest that

References (41)

  • M.J. Egenhofer

    Query processing in spatial-query-by-sketch

    J. Vis. Lang. Comput.

    (1997)
  • L. Olsen et al.

    Technical section: sketch-based modeling: a survey

    Comput. Graph.

    (2009)
  • D. Weinland et al.

    Free viewpoint action recognition using motion history volumes

    Comput. Vis. Image Underst.

    (2006)
  • Arikan, O., Forsyth, D.A., 2002. Interactive motion generation from examples, In: ACM Transactions on Graphics (TOG)....
  • A. Bernhardt et al.

    Matisse: painting 2D regions for modeling free-form shapes

  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2001)
  • Cao, Y., Wang, C., Zhang, L., Zhang, L., 2011. Edgel index for large-scale sketch-based image search. In: IEEE...
  • Carson, C., Ogle, V.E., 1996. Storage and Retrieval of Feature Data for a Very Large Online Image Collection. Bulletin...
  • Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., Zhong, D., 1997. Videoq: an automated content based video search...
  • Collomosse, J., McNeill, G., Qian, Y., 2009. Storyboard sketches for content based video retrieval. In: Proceedings of...
  • Collomosse, J.P., Mcneill, G., Watts, L., 2008. Free-hand sketch grouping for video retrieval. In: International...
  • A. Farhadi et al.

    Learning to recognize activities from the wrong view point

    Computer Vision—ECCV 2008

    (2008)
  • Fonseca, M., James, S., Collomosse, J., 2012. Skeletons from sketches of dancing poses. In: 2012 IEEE Symposium on...
  • H. Fujiyoshi et al.

    Real-time human motion analysis by image skeletonization

    IEICE Trans. Inf. Syst.

    (2004)
  • A. Gersho et al.

    Vector Quantization and Signal Compression

    (1991)
  • Hu, R., James, S., Collomosse, J., 2012. Annotated free-hand sketches for video retrieval using object semantics and...
  • K. Huang et al.

    Human behavior analysis based on a new motion descriptor

    IEEE Trans. Circuits Syst. Video Technol.

    (2009)
  • Igarashi, T., Hughes, J.F., 2003. Smooth meshes for sketch-based freeform modeling. In: I3D ׳03: Proceedings of the...
  • C.E. Jacobs et al.

    Fast multiresolution image querying

    Comput. Graph.

    (1995)
  • L. Kovar et al.

    Motion graphs

    ACM Trans. Graph.

    (2002)
  • Cited by (0)

    This paper has been recommended for acceptance by Andrew Howes.

    View full text