Skip to main content
Log in

Seeing the Objects Behind the Dots: Recognition in Videos from a Moving Camera

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Category-level object recognition, segmentation, and tracking in videos becomes highly challenging when applied to sequences from a hand-held camera that features extensive motion and zooming. An additional challenge is then to develop a fully automatic video analysis system that works without manual initialization of a tracker or other human intervention, both during training and during recognition, despite background clutter and other distracting objects. Moreover, our working hypothesis states that category-level recognition is possible based only on an erratic, flickering pattern of interest point locations without extracting additional features. Compositions of these points are then tracked individually by estimating a parametric motion model. Groups of compositions segment a video frame into the various objects that are present and into background clutter. Objects can then be recognized and tracked based on the motion of their compositions and on the shape they form. Finally, the combination of this flow-based representation with an appearance-based one is investigated. Besides evaluating the approach on a challenging video categorization database with significant camera motion and clutter, we also demonstrate that it generalizes to action recognition in a natural way.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Avidan, S. (2005). Ensemble tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 494–501).

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings of the IEEE international conference on computer vision (pp. 1395–1402).

  • Brostow, G. J., & Cipolla, R. (2006). Unsupervised Bayesian detection of independent motion in crowds. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 594–601).

  • Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings of the European conference on computer vision, (pp. 44–57).

  • Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines.

  • Comaniciu, D., Ramesh, V., & Meer, P. (2003). Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 564–575.

    Article  Google Scholar 

  • Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In Proceedings of the European conference on computer vision. Workshop stat. learn. in comp. vis.

  • Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In Proceedings of the European conference on computer vision (pp. 428–441).

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. J. (2005). Behavior recognition via sparse spatio-temporal features. In International workshop on performance evaluation of tracking and surveillance (pp. 65–72).

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  • Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 264–271).

  • Goldberger, J., & Greenspann, H. (2006). Context-based segmentation of image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3), 463–468.

    Article  Google Scholar 

  • Grabner, M., Grabner, H., & Bischof, H. (2007). Learning features for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Hartley, R. I., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.

    Google Scholar 

  • Irani, M., Rousso, B., & Peleg, S. (1994). Computing occluding and transparent motions. International Journal of Computer Vision, 12(1), 5–16.

    Article  Google Scholar 

  • Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings of the IEEE international conference on computer vision.

  • Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2145–2152).

  • Pawan Kumar, M., Torr, P. H., & Zisserman, A. (2008). Learning layered motion segmentations of video. International Journal of Computer Vision, 76(3), 301–319.

    Article  Google Scholar 

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2169–2178).

  • Leibe, B., Cornelis, N., Cornelis, K., & Van Gool, L. (2007). Dynamic 3D scene analysis from a moving vehicle. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Proceedings of the European conference on computer vision. Workshop stat. learn. in comp. vis.

  • Lepetit, V., Lagger, P., & Fua, P. (2005). Randomized trees for real-time keypoint recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 775–781).

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Lucas, B., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the international joint conference on artificial intelligence (pp. 674–679).

  • Magee, D. R., & Boyle, R. D. (2002). Detecting lameness using ‘re-sampling condensation’ and ‘multi-stream cyclic hidden Markov models’. Image and Vision Computing, 20(8), 581–594.

    Article  Google Scholar 

  • Mahindroo, A., Bose, B., Chaudhury, S., & Harit, G. (2002). Enhanced video representation using objects. In Proceedings of the Indian conference on computer vision (pp. 105–112).

  • Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2), 431–441.

    Article  MATH  MathSciNet  Google Scholar 

  • McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: John Wiley.

    MATH  Google Scholar 

  • Niebles, J. C., & Fei Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Ommer, B., & Buhmann, J. M. (2006). Learning compositional categorization models. In Proceedings of the European conference on computer vision (pp. 316–329).

  • Ommer, B., & Buhmann, J. M. (2007). Compositional object recognition, segmentation, and tracking in video. In Energy minimization methods in computer vision and pattern recognition (pp. 318–333).

  • Ommer, B., & Buhmann, J. M. (2007). Learning the compositional nature of visual objects. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Perera, A. G. A., Brooksby, G., Hoogs, A., & Doretto, G. (2006). Moving object segmentation using scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on perceptual organization in computer vision.

  • Pontil, M., Rogai, S., & Verri, A. (1998). Recognizing 3-d objects with linear support vector machines. In Proceedings of the European conference on computer vision (pp. 469–483).

  • Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In Proceedings of the international conference on pattern recognition (pp. 32–36).

  • Seemann, E., & Schiele, B. (2006). Cross-articulation learning for robust detection of pedestrians. In Pattern recognition (symposium of the DAGM) (pp. 242–252).

  • Shi, J., & Tomasi, C. (1994). Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600).

  • Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their localization in images. In Proceedings of the IEEE international conference on computer vision (pp. 370–377).

  • Sivic, J., Schaffalitzky, F., & Zisserman, A. (2006). Object level grouping for video shots. International Journal of Computer Vision, 67(2), 189–210.

    Article  Google Scholar 

  • Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models for real-time tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 246–252).

  • Vidal, R., Ma, Y., & Sastry, S. (2003). Generalized principal component analysis (GPCA). In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 621–628).

  • Vidal, R., & Ravichandran, A. (2005). Optical flow estimation and segmentation of multiple moving dynamic textures. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 516–521).

  • Viola, P., Jones, M. J., & Snow, D. (2003). Detecting pedestrians using patterns of motion and appearance. In Proceedings of the IEEE international conference on computer vision (pp. 734–741).

  • Wallraven, C., & Bülthoff, H. H. (2001). Automatic acquisition of exemplar-based representations for recognition from image sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on models vs. exemplars.

  • Wang, J. Y. A., & Adelson, E. H. (1994). Representing moving images with layers. IEEE Transactions on Image Processing, 3(5), 625–638.

    Article  Google Scholar 

  • Yan, J. Y., & Pollefeys, M. (2006). A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In Proceedings of the European conference on computer vision (pp. 94–106).

  • Zhang, H., Berg, A. C., Maire, M., & Malik, J. (2006). SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2126–2133).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Björn Ommer.

Additional information

This work was supported in part by the Swiss national science foundation under contract no. 200021-107636.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ommer, B., Mader, T. & Buhmann, J.M. Seeing the Objects Behind the Dots: Recognition in Videos from a Moving Camera. Int J Comput Vis 83, 57–71 (2009). https://doi.org/10.1007/s11263-009-0211-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-009-0211-7

Keywords

Navigation