Evaluating visual query methods for articulated motion video search☆
Introduction
With the proliferation of video capture devices and inexpensive, large-scale storage, video data is increasingly being aggregated for both entertainment and analytic (e.g., athletics, surveillance, medical) purposes. Developing efficient and robust methods for searching large video repositories is an ongoing challenge. Commercially available solutions (e.g., Google Video) generally match text queries to video metadata (e.g., keywords, title). Searching for data in these repositories typically requires a large investment of manual effort in either annotation or real-time observation, and the possibility of incomplete or incorrect metadata is a well-known limitation (Carson and Ogle, 1996). Even with extensive, accurate annotation, it is still difficult to capture all of the semantic information contained in even short video clips. A number of approaches (e.g., Suma et al., 2008, Chang et al., 1997) focus on non-textual input, or visual queries, for searching video. These approaches not only hold the promise of avoiding the database annotation step required for text-based matching, but also introduce new challenges that cut across multiple areas of computing, including video processing, data representation, and interface design.
Many of the domains in which repositories of data are stored, such as athletics or surveillance, contain video of human activity and would benefit from new methods for video search that accelerate the process of locating relevant videos, potentially aiding in physical therapy training or identifying specific security footage of interest. Therefore, in this paper, we focus on the problem of searching for video clips of humans performing common actions. We design and evaluate three different interfaces for generating visual queries. The first interface follows the sketch-based input paradigm, where the user can draw a stick figure with action arrows to indicate motion. The second interface extends the first by providing a pre-defined template of an articulated human figure (stick figure) for the user to pose and also using action arrows as motion cues. The third interface re-uses the pre-defined template, but avoids the use of motion cues; instead it defines a sequence of poses to represent the visual video query. Fig. 1 shows examples of each interface. These three interfaces span a range of approaches that are applicable to the typical keyboard–video–mouse interface, and also can be applied to touch interfaces found on smartphones and tablets. In order to evaluate the effectiveness of the different visual query interfaces, we conducted a formal user study where we measured the query generation time and accuracy of the resulting search in terms of the number of highly ranked results matching the search concept. Additionally, the users provided feedback on the positives and negatives of each interface through a post-experiment survey.
Section snippets
Related work
The literature on automated methods for content-based visual information retrieval (CBVIR) is extensive; see Lew et al. (2006) and Marchand-Maillet (2000) for surveys. This body of work includes both methods for image and video search, and various paradigms, such as text-based searching (e.g., Naphade and Huang, 2000, Zha et al., 2009) or search by example (e.g., Taskiran et al., 2004). With text-based approaches, which match the query to metadata associated with images or video, the quality of
Input interfaces and motion inference
To ground the evaluation, we developed three interfaces for generating visual queries for human actions in video. While the specific interface components (e.g., feature transform and matching algorithm described in Section 4.2) are not the focus of this work and could be replaced with other methods, they serve as means to allow a comparison of the three interface styles for searching for articulated motions. Fig. 1 shows an example of each interface: (1) Freehand (Fig. 1(a)) allows the user to
Video matching
For this problem, it is important to model the motion characteristics of the video rather than the appearance, since sketches do not share appearance characteristics with real video. Many feature descriptors have been specifically designed for human motion detection. They employ a variety of approaches such as using keypoints on the body to map movement to a model (Kovar et al., 2002, Arikan and Forsyth, 2002, Fujiyoshi et al., 2004), contour analysis (Wren et al., 1997), optical flow combined
User study
To compare the differences among the three interfaces, we conducted a study where participants were asked to search for common human actions from a publicly-available human action recognition database.
Results
Unless otherwise noted, all tests cited in this paper used a significance value of .
Discussion
While the results measured the differences between three specific implementations of user interfaces, some of the results may be more broadly applicable. In general, for the specific task of finding video clips of humans performing common actions, the interfaces most closely related to the task (Template and Keypose) outperformed the free-form sketch-based interface, in terms of search accuracy, query generation, and user satisfaction.
Conclusions
Free-form sketching has been attempted on many platforms and its popularity has ebbed and flowed over the years. Currently, sketch-based input is making a resurgence with the growing number of touch-based devices. We presented an evaluation of three interfaces for composing visual queries for video search. Starting with the sketch-based paradigm, we designed additional interfaces that made a trade-off between flexibility and being suited to the particular search task. Our results suggest that
References (41)
Query processing in spatial-query-by-sketch
J. Vis. Lang. Comput.
(1997)- et al.
Technical section: sketch-based modeling: a survey
Comput. Graph.
(2009) - et al.
Free viewpoint action recognition using motion history volumes
Comput. Vis. Image Underst.
(2006) - Arikan, O., Forsyth, D.A., 2002. Interactive motion generation from examples, In: ACM Transactions on Graphics (TOG)....
- et al.
Matisse: painting 2D regions for modeling free-form shapes
- et al.
The recognition of human movement using temporal templates
IEEE Trans. Pattern Anal. Mach. Intell.
(2001) - Cao, Y., Wang, C., Zhang, L., Zhang, L., 2011. Edgel index for large-scale sketch-based image search. In: IEEE...
- Carson, C., Ogle, V.E., 1996. Storage and Retrieval of Feature Data for a Very Large Online Image Collection. Bulletin...
- Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., Zhong, D., 1997. Videoq: an automated content based video search...
- Collomosse, J., McNeill, G., Qian, Y., 2009. Storyboard sketches for content based video retrieval. In: Proceedings of...
Learning to recognize activities from the wrong view point
Computer Vision—ECCV 2008
Real-time human motion analysis by image skeletonization
IEICE Trans. Inf. Syst.
Vector Quantization and Signal Compression
Human behavior analysis based on a new motion descriptor
IEEE Trans. Circuits Syst. Video Technol.
Fast multiresolution image querying
Comput. Graph.
Motion graphs
ACM Trans. Graph.
Cited by (0)
- ☆
This paper has been recommended for acceptance by Andrew Howes.