Abstract
Category-level object recognition, segmentation, and tracking in videos becomes highly challenging when applied to sequences from a hand-held camera that features extensive motion and zooming. An additional challenge is then to develop a fully automatic video analysis system that works without manual initialization of a tracker or other human intervention, both during training and during recognition, despite background clutter and other distracting objects. Moreover, our working hypothesis states that category-level recognition is possible based only on an erratic, flickering pattern of interest point locations without extracting additional features. Compositions of these points are then tracked individually by estimating a parametric motion model. Groups of compositions segment a video frame into the various objects that are present and into background clutter. Objects can then be recognized and tracked based on the motion of their compositions and on the shape they form. Finally, the combination of this flow-based representation with an appearance-based one is investigated. Besides evaluating the approach on a challenging video categorization database with significant camera motion and clutter, we also demonstrate that it generalizes to action recognition in a natural way.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Avidan, S. (2005). Ensemble tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 494–501).
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings of the IEEE international conference on computer vision (pp. 1395–1402).
Brostow, G. J., & Cipolla, R. (2006). Unsupervised Bayesian detection of independent motion in crowds. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 594–601).
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings of the European conference on computer vision, (pp. 44–57).
Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines.
Comaniciu, D., Ramesh, V., & Meer, P. (2003). Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 564–575.
Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In Proceedings of the European conference on computer vision. Workshop stat. learn. in comp. vis.
Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In Proceedings of the European conference on computer vision (pp. 428–441).
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. J. (2005). Behavior recognition via sparse spatio-temporal features. In International workshop on performance evaluation of tracking and surveillance (pp. 65–72).
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 264–271).
Goldberger, J., & Greenspann, H. (2006). Context-based segmentation of image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3), 463–468.
Grabner, M., Grabner, H., & Bischof, H. (2007). Learning features for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Hartley, R. I., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.
Irani, M., Rousso, B., & Peleg, S. (1994). Computing occluding and transparent motions. International Journal of Computer Vision, 12(1), 5–16.
Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings of the IEEE international conference on computer vision.
Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2145–2152).
Pawan Kumar, M., Torr, P. H., & Zisserman, A. (2008). Learning layered motion segmentations of video. International Journal of Computer Vision, 76(3), 301–319.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2169–2178).
Leibe, B., Cornelis, N., Cornelis, K., & Van Gool, L. (2007). Dynamic 3D scene analysis from a moving vehicle. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Proceedings of the European conference on computer vision. Workshop stat. learn. in comp. vis.
Lepetit, V., Lagger, P., & Fua, P. (2005). Randomized trees for real-time keypoint recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 775–781).
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Lucas, B., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the international joint conference on artificial intelligence (pp. 674–679).
Magee, D. R., & Boyle, R. D. (2002). Detecting lameness using ‘re-sampling condensation’ and ‘multi-stream cyclic hidden Markov models’. Image and Vision Computing, 20(8), 581–594.
Mahindroo, A., Bose, B., Chaudhury, S., & Harit, G. (2002). Enhanced video representation using objects. In Proceedings of the Indian conference on computer vision (pp. 105–112).
Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2), 431–441.
McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: John Wiley.
Niebles, J. C., & Fei Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Ommer, B., & Buhmann, J. M. (2006). Learning compositional categorization models. In Proceedings of the European conference on computer vision (pp. 316–329).
Ommer, B., & Buhmann, J. M. (2007). Compositional object recognition, segmentation, and tracking in video. In Energy minimization methods in computer vision and pattern recognition (pp. 318–333).
Ommer, B., & Buhmann, J. M. (2007). Learning the compositional nature of visual objects. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Perera, A. G. A., Brooksby, G., Hoogs, A., & Doretto, G. (2006). Moving object segmentation using scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on perceptual organization in computer vision.
Pontil, M., Rogai, S., & Verri, A. (1998). Recognizing 3-d objects with linear support vector machines. In Proceedings of the European conference on computer vision (pp. 469–483).
Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In Proceedings of the international conference on pattern recognition (pp. 32–36).
Seemann, E., & Schiele, B. (2006). Cross-articulation learning for robust detection of pedestrians. In Pattern recognition (symposium of the DAGM) (pp. 242–252).
Shi, J., & Tomasi, C. (1994). Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600).
Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their localization in images. In Proceedings of the IEEE international conference on computer vision (pp. 370–377).
Sivic, J., Schaffalitzky, F., & Zisserman, A. (2006). Object level grouping for video shots. International Journal of Computer Vision, 67(2), 189–210.
Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models for real-time tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 246–252).
Vidal, R., Ma, Y., & Sastry, S. (2003). Generalized principal component analysis (GPCA). In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 621–628).
Vidal, R., & Ravichandran, A. (2005). Optical flow estimation and segmentation of multiple moving dynamic textures. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 516–521).
Viola, P., Jones, M. J., & Snow, D. (2003). Detecting pedestrians using patterns of motion and appearance. In Proceedings of the IEEE international conference on computer vision (pp. 734–741).
Wallraven, C., & Bülthoff, H. H. (2001). Automatic acquisition of exemplar-based representations for recognition from image sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on models vs. exemplars.
Wang, J. Y. A., & Adelson, E. H. (1994). Representing moving images with layers. IEEE Transactions on Image Processing, 3(5), 625–638.
Yan, J. Y., & Pollefeys, M. (2006). A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In Proceedings of the European conference on computer vision (pp. 94–106).
Zhang, H., Berg, A. C., Maire, M., & Malik, J. (2006). SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2126–2133).
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the Swiss national science foundation under contract no. 200021-107636.
Rights and permissions
About this article
Cite this article
Ommer, B., Mader, T. & Buhmann, J.M. Seeing the Objects Behind the Dots: Recognition in Videos from a Moving Camera. Int J Comput Vis 83, 57–71 (2009). https://doi.org/10.1007/s11263-009-0211-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-009-0211-7