Skip to main content
Log in

Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a novel approach, max-margin heterogeneous information machine (MMHIM), for human action recognition from RGB-D videos. MMHIM fuses heterogeneous RGB visual features and depth features, and learns effective action classifiers using the fused features. Rich heterogeneous visual and depth data are effectively compressed and projected to a learned shared space and independent private spaces, in order to reduce noise and capture useful information for recognition. Knowledge from various sources can then be shared with others in the learned space to learn cross-modal features. This guides the discovery of valuable information for recognition. To capture complex spatiotemporal structural relationships in visual and depth features, we represent both RGB and depth data in a matrix form. We formulate the recognition task as a low-rank bilinear model composed of row and column parameter matrices. The rank of the model parameter is minimized to build a low-rank classifier, which is beneficial for improving the generalization power. We also extend MMHIM to a structured prediction model that is capable of making structured outputs. Extensive experiments on a new RGB-D action dataset and two other public RGB-D action datasets show that our approaches achieve state-of-the-art results. Promising results are also shown if RGB or depth data are missing in training or testing procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://forge.lip6.fr/projects/nrbm

  2. Please refer to the supplemental material for details.

  3. Please refer to the supplemental material for formulations of bilinear SVM, BHIM, and MMHIM in single modality learning

  4. Technically, the feature O here is not shared between two modalities as it is only computed from RGB data.

References

  • Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. In ICML.

  • Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. In IJCV.

  • Bo, L., Lai, K., Ren, X., & Fox, D. (2011). Object recognition with hierarchical kernel descriptors. In CVPR.

  • Chen, L., Li, W., & Xu, D. (2014). Recognizing RGB images by learning from RGB-D data. In CVPR.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on CVPR 2005 (Vol. 1, pp. 886–893). doi:10.1109/CVPR.2005.177.

  • Do, T. M. T., & Artieres, T. (2009). Large margin training for hidden markov models with partially observed states. In ICML.

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In ICCV VS-PETS.

  • Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In CVPR.

  • El, R. O., Rosman, G., Wetzler, A., Kimmel, R., & Bruckstein, A. M. (2015). Rgbd-fusion: Real-time high precision depth recovery. In CVPR.

  • Fernando, B., Anderson, P., Hutter, M., & Gould, S. (2016). Discriminative hierarchical rank pooling for activity recognition. In CVPR.

  • Fernando, B., Gavves, E., Ghodrati, J. O. M. A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In CVPR.

  • Hadfield, S., & Bowden, R. (2013). Hollywood 3D: Recognizing actions in 3D natural scenes. In CVPR.

  • Hu, J. F., Zheng, W. S., Lai, J., & Zhang, J. (2015). Jointly learning heterogeneous features for rgb-d activity recognition. In CVPR.

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. In PAMI.

  • Jia, C., Kong, Y., Ding, Z., & Fu, Y. (2014). Latent tensor transfer learning for RGB-D action recognition. In ACM Multimedia.

  • Joachims, T., Finley, T., & Yu, C. N. (2009). Cutting-plane training of structural SVMs. Machine Learning, 77(1), 27–59.

    Article  MATH  Google Scholar 

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., & Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  • Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC (pp. 1–10).

  • Kobayashi, T. (2014). Low-rank biliner classification: Efficient convex optimization and extensions. In IJCV.

  • Kong, Y., & Fu, Y. (2015). Bilinear heterogeneous information machine for rgb-d action recognition. In CVPR.

  • Kong, Y., Jia, Y., & Fu, Y. (2014). Interactive phrases: Semantic descriptions for human interaction recognition. In PAMI.

  • Kong, Y., Kit, D., & Fu, Y. (2014). A discriminative model with multiple temporal scales for action prediction. In ECCV.

  • Koppula, H.S., & Saxena, A. (2013).Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In ICML.

  • Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8), 1549–1562.

    Article  Google Scholar 

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2/3), 107–123.

    Article  Google Scholar 

  • Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3D points. In CVPR workshop.

  • Lin, Y. Y., Hua, J. H., Tang, N. C., Chen, M. H., & Liao, H. Y. M. (2014). Depth and skeleton associated action recognition without online accessible RGB-D cameras. In CVPR.

  • Liu, J., Ali, S., & Shah, M. (2008). Recognizing human actions using multiple features. In CVPR (pp. 1–8).

  • Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp. 3337–3344).

  • Liu, L., & Shao, L. (2013). Learning discriminative representations from RGB-D video data. In IJCAI.

  • Lu, C., Jia, J., & Tang, C. K. (2014). Range-sample depth feature for action recognition. In CVPR.

  • Luo, J., Wang, W., & Qi, H. (2013). Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In ICCV.

  • Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In CVPR.

  • Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In ICML.

  • Ni, B., Moulin, P., Yang, X., & Yan, S. (2015). Motion part regularization: Improving action recognition via trajectory group selection. In CVPR.

  • Ni, B., Wang, G., & Moulin, P. (2011). RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In ICCV Workshop on CDC3CV.

  • Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2013). Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the IEEE Workshop on Applications on Computer Vision.

  • Oreifej, O., & Liu, Z. (2013). HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In CVPR,. doi:10.1109/CVPR.2013.98.

    Google Scholar 

  • Pirsiavash, H., Ramanan, D., & Fowlkes, C. (2009). Bilinear classifiers for visual recognition. In NIPS.

  • Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In CVPR.

  • Raptis, M., & Soatto, S. (2010). Tracklet descriptors for action modeling and video analysis. In ECCV.

  • Ryoo, M., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV (pp. 1593–1600).

  • Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.

  • Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., & Blake, A. (2013). Efficient human pose estimation from single depth images. In PAMI.

  • Simonyan, K., & Zisserman, A. (2014). two-stream convolutional networks for action recognition in videos. In NIPS.

  • Srivastava, N., & Salakhutdinov, R. (2014). Multimodal learning with deep boltzmann machines. JMLR, 15, 2949–2980.

    MathSciNet  MATH  Google Scholar 

  • Sung, J., Ponce, C., Selman, B., & Saxena, A. (2012). Unstructured human activity detection from rgbd images. In ICRA.

  • Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.

  • Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation.

  • Teo, C.H., Le, Q., Smola, A., & Vishwanathan, S. (2007). A scalable modular convex solver for regularized risk minimization. In KDD.

  • Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In Proceedings of the 37-th annual allerton conference on communication, control and computing, pp. 368–377.

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In CVPR.

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, Sydney, Australia. http://hal.inria.fr/hal-00873267

  • Wang, J., Liu, Z., Chorowski, J., Chen, Z., & Wu, Y. (2012a). Robust 3D action recognition with random occupancy patterns. In ECCV (pp. 872–885).

  • Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012b). Mining actionlet ensemble for action recognition with depth cameras. In CVPR.

  • Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In CVPR.

  • Wang, W., Arora, R., Livescu, K., & Bilmes, J. (2015). On deep multi-view representation learning. In ICML.

  • Wolf, L., Jhuang, H., & Hazan, T. (2007). Modeling appearances with low-rank svm. In CVPR.

  • Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised understanding of actions and relations. In CVPR.

  • Xia, L., & Aggarwal, J. (2013). Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR.

  • Xie, P., & Xing, E. P. (2013). Multi-modal distance metric learning. In IJCAI.

  • Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. In ICCV.

  • Xu, C., Tao, D., & Xu, C. (2014). Large-margin multi-view information bottleneck. PAMI, 36(8), 1559–1572.

  • Yang, X., & Tian, Y. (2014). Super normal vector for activity recognition using depth sequences. In CVPR.

  • Yang, X., Zhang, C., & Tian, Y. (2012). Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM Multimedia,. doi:10.1145/2393347.2396382.

    Google Scholar 

  • Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection. In ICCV.

  • Zhang, J., Kan, C., Schwing, A. G., & Urtasun, R. (2013). Estimating the 3d layout of indoor scenes and its clutter from depth sensors. In ICCV.

  • Zhou, Y., Ni, B., Hong, R., Wang, M., & Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR.

Download references

Acknowledgements

This work is supported in part by the NSF IIS Award 1651902, NSF CNS Award 1314484, ONR Award N00014-12-1-1028, ONR Young Investigator Award N00014-14-1-0484, and U.S. Army Research Office Young Investigator Award W911NF-14-1-0218.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Kong.

Additional information

Communicated by M. Hebert.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kong, Y., Fu, Y. Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition. Int J Comput Vis 123, 350–371 (2017). https://doi.org/10.1007/s11263-016-0982-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0982-6

Keywords

Navigation