Skip to main content
Log in

Multiple Granularity Modeling: A Coarse-to-Fine Framework for Fine-grained Action Analysis

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Detecting fine-grained human action from video sequence is challenging. In this work, we propose to decompose this difficult analytic problem into two sequential tasks with increasing granularity. Firstly, we infer the coarse interaction status, i.e., which object is being manipulated and where the interaction occurs. To address the issue of frequent mutual occlusions during manipulation, we propose an interaction tracking framework in which hand (object) position and interaction status are jointly tracked by explicitly modeling the occlusion context. Secondly, for a given query sequence, the inferred interaction status is utilized to efficiently identify a small set of candidate matching sequences from the annotated training set. Frame-level action labels are then transferred to the query sequence by setting up the matching between the query and candidate sequences. Comprehensive experiments on two challenging fine-grained activity datasets show that: (1) the proposed interaction tracking approach achieves high tracking accuracy for multiple mutually occluded objects (hands) during manipulation action; and (2) the proposed multiple granularity analysis framework achieves superior action detection performance improvement over state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. We will make our toolbox, annotated datasets and code implementations available at: https://sites.google.com/site/bingbingni1983/ after publication

  2. For the OAB tracker, we use the implementation provided in  http://www.vision.ee.ethz.ch/boostingTrackers/download.htm

  3. For the TLD tracker, we use the implementation provided in http://personal.ee.surrey.ac.uk/Personal/Z.Kalal/tld.html

References

  • Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274–2282.

    Article  Google Scholar 

  • Berclaz, J., Fleuret, F., Turetken, E., & Fua, P. (2011). Multiple object tracking using k-shortest paths optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1806–1819.

    Article  Google Scholar 

  • Blackman, S. (2004). Multiple hypothesis tracking for multiple target tracking. IEEE Aerospace and Electronic Systems Magazine, 19(1), 5–18.

    Article  Google Scholar 

  • Brand, M., Oliver, N., & Pentland, A. (1997). Coupled hidden markov models for complex action recognition. In IEEE conference on computer vision and pattern recognition (pp. 994–999).

  • Chang, C., Ansari, R., & Khokhar, A. (2005). Multiple object tracking with kernel particle filter. IEEE Conference on Computer Vision and Pattern Recognition, 1, 566–573.

    Google Scholar 

  • Chen, H. T., Lin, H. H., & Liu, T. L. (2001). Multi-object tracking using dynamical graph matching. IEEE Conference on Computer Vision and Pattern Recognition, 2, 210–217.

    Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition (pp. 886–893).

  • de La Gorce, M., Paragios, N., & Fleet, D. (2008). Model-based hand tracking with texture, shading and self-occlusions. In IEEE conference on computer vision and pattern recognition.

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance.

  • Escorcia, V., & Niebles, J. (2013). Spatio-temporal human-object interactions for action recognition in videos. In IEEE international conference on computer vision workshops (ICCVW) (pp. 508–514).

  • Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In International conference on computer vision (pp. 407–414).

  • Grabner H, Grabner, M., & Bischof, H. (2006). Real-time tracking via online boosting. In British machine vision conference (pp. 47–56).

  • Gupta, A., & Davis, L. (2007). Objects in action: An approach for combining action understanding and object perception. In IEEE conference on computer vision and pattern recognition.

  • Han, M., Xu, W., Tao, H., & Gong, Y. (2004). An algorithm for multiple object trajectory tracking. IEEE Conference on Computer Vision and Pattern Recognition, 1, 864–871.

    Google Scholar 

  • Iwase, S., & Saito, H. (2004). Parallel tracking of all soccer players by integrating detected positions in multiple view images. In International conference on pattern recognition (pp. 751–754).

  • Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1), 27–59.

    Article  MATH  Google Scholar 

  • Kalal, Z., Matas, J., & Mikolajczyk, K. (2010). P-n learning: Bootstrapping binary classifiers by structural constraints. In IEEE conference on computer vision and pattern recognition (pp. 49–56).

  • Kjellström, H., Romero, J., Martínez, D., & Kragić, D. (2008). Simultaneous visual recognition of manipulation actions and manipulated objects. In European conference on computer vision (pp. 336–349).

  • Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d gradients. In British machine vision conference.

  • Koppula, H. S., Gupta, R., & Saxena, A. (2012). Learning human activities and object affordances from rgb-d videos. CoRR.

  • Kuehne, H., Arslan, A., & Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In IEEE Conference on computer vision and pattern recognition (pp. 780–787).

  • Kyriazis, N., & Argyros, A. (2014). Scalable 3d tracking of multiple interacting objects. In IEEE conference on computer vision and pattern recognition.

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (pp. 282–289).

  • Lan, T., Wang, Y., Yang, W., & Mori, G. (2010). Beyond actions: Discriminative models for contextual group activities. In Advances in neural information processing systems (pp. 1216–1224).

  • Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In IEEE international conference on computer vision.

  • Lei, J., Ren, X., & Fox, D. (2012). Fine-grained kitchen activity recognition using rgb-d. In ACM conference on ubiquitous computing (pp. 208–211).

  • Li, C., & Kitani, K. M. (2013). Pixel-level hand detection in ego-centric videos. In IEEE conference on computer vision and pattern recognition.

  • Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2368–2382.

    Article  Google Scholar 

  • Liu, J., Carr, P., Collins, R. T., & Liu, Y. (2013). Tracking sports players with context-conditioned motion models. In IEEE conference on computer vision and pattern recognition.

  • Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. In IEEE conference on computer vision and pattern recognition.

  • MacCormick, J., & Blake, A. (1999). A probabilistic exclusion principle for tracking multiple objects. IEEE International Conference on Computer Vision, 1, 572–578.

    MATH  Google Scholar 

  • Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Ensemble of exemplar-svms for object detection and beyond. In IEEE international conference on computer vision.

  • Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE conference on computer vision and pattern recognition.

  • Moore, D., Essa, I., & Hayes, M. (1999). Exploiting human actions and object context for recognition tasks. In IEEE international conference on computer vision.

  • Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for approximate inference: An empirical study. In Conference on uncertainty in artificial intelligence (pp. 467–475).

  • Ni, B., Paramathayalan, V., & Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In IEEE Conference on computer vision and pattern recognition (pp. 756–763).

  • Ni, B., Moulin, P., & Yan, S. (2015). Pose adaptive motion feature pooling for human action analysis. International Journal of Computer Vision, 111(2), 229–248.

    Article  Google Scholar 

  • Niebles, J. C., Wang, H., & Fei-fei, L. (2006). Unsupervised learning of human action categories using spatial-temporal words. In British machine vision conference.

  • Packer, B., & Koller, D. (2012). A combined pose, object, and feature model for action understanding. In IEEE conference on computer vision and pattern recognition.

  • Patron-Perez, A., Marszalek, M., Zisserman, A., & Reid, I. (2010). High five: Recognising human interactions in tv shows. In The British machine vision conference.

  • Patterson, D. J., Fox, D., Kautz, H., & Philipose, M. (2005). Fine-grained activity recognition by aggregating abstract object usage. In IEEE international symposium on wearable computers (pp. 44–51).

  • Prest, A., Ferrari, V., & Schmid, C. (2013). Explicit modeling of human-object interactions in realistic videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4), 835–848.

    Article  Google Scholar 

  • Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In IEEE conference on computer vision and pattern recognition (pp. 2650–2657).

  • Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In IEEE conference on computer vision and pattern recognition.

  • Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In IEEE conference on computer vision and pattern recognition.

  • Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. In International conference on computer vision (pp. 433–440).

  • Ryoo, M. S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE international conference on computer vision (pp. 1593–1600).

  • Shimada, A., Kondo, K., Deguchi, D., Morin, G., & Stern, H. (2013). Kitchen scene context based gesture recognition: A contest in icpr2012. In Advances in depth image analysis and applications, vol 7854 (pp. 168–185), http://www.murase.m.is.nagoya-u.ac.jp/KSCGR/index.html.

  • Trinh, H., Fan, Q., Gabbur, P., & Pankanti, S. (2012) Hand tracking by binary quadratic programming and its application to retail activity recognition. In IEEE conference on computer vision and pattern recognition (pp. 1902–1909).

  • Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In ICCV workshop (pp. 1729–1736).

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE international conference on computer vision.

  • Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2011). Action recognition by dense trajectories. In IEEE conference on computer vision and pattern recognition (pp. 3169–3176).

  • Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7), 1310–1323.

    Article  Google Scholar 

  • Wesierski, D., & Horain, P. (2013). Pose-configurable generic tracking of elongated objects. In IEEE international conference on computer vision (pp. 2920–2927).

  • Wu, J., Osuntogun, A., Choudhury, T., Philipose, M., & Rehg, J. (2007). A scalable approach to activity recognition based on object use. In IEEE international conference on computer vision.

  • Xu, M., Orwell, J., & Jones, G. (2004). Tracking football players with multiple cameras. Internaltional Conference on Image Processing, 5, 2909–2912.

    Google Scholar 

  • Yang, C., Duraiswami, R., & Davis, L. (2005). Fast multiple object tracking via a hierarchical particle filter. IEEE International Conference on Computer Vision, 1, 212–219.

    Google Scholar 

  • Yang, M., Wu, Y., & Hua, G. (2009). Context-aware visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(7), 1195–1209.

    Article  Google Scholar 

  • Yang, Y., Fermuller, C., & Aloimonos, Y. (2013). Detection of manipulation action consequences (mac). In IEEE conference on computer vision and pattern recognition.

  • Yao, B., Khosla, A., & Fei-fei, L. (2011). Classifying actions and measuring action similarity by modeling the mutual context of objects and human poses. In International conference on machine learning.

  • Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1728–1743.

    Article  Google Scholar 

  • Zhao, T., & Nevatia, R. (2004). Tracking multiple humans in crowded environment. IEEE Conference on Computer Vision and Pattern Recognition, 2, 406–413.

    Google Scholar 

  • Zhou, Y., Ni, B., Yan, S., Moulin, P., & Tian, Q. (2014). Pipelining localized semantic features for fine-grained action recognition. In European conference on computer vision (pp. 481–496).

Download references

Acknowledgments

This study is supported by the research grant for the Human Sixth Sense Programme at the Advanced Digital Sciences Center from Singapore’s Agency for Science, Technology and Research (A*STAR).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bingbing Ni.

Additional information

Communicated by Ivan Laptev.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ni, B., Paramathayalan, V.R., Li, T. et al. Multiple Granularity Modeling: A Coarse-to-Fine Framework for Fine-grained Action Analysis. Int J Comput Vis 120, 28–43 (2016). https://doi.org/10.1007/s11263-016-0891-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0891-8

Keywords

Navigation