Abstract
Although traditional bag-of-words model, together with local spatiotemporal features, has shown promising results for human action recognition, it ignores all structural information of features, which carries important information of motion structures in videos. Recent methods usually characterize the relationship of quantized spatiotemporal features to overcome this drawback. However, the propagation of quantization error leads to an unreliable representation. To alleviate the propagation of quantization error, we present a coding method, which considers not only the spatial similarity but also the reconstruction ability of visual words after giving a probabilistic interpretation of coding coefficients. Based on our coding method, a new type of feature called cumulative probability histogram is proposed to robustly characterize contextual structural information around interest points, which are extracted from multi-layered contexts and assumed to be complementary to local spatiotemporal features. The proposed method is verified on four benchmark datasets. Experiment results show that our method can achieve better performance than previous methods in action recognition.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
For the case of \(k=1\), CPH features based on our coding method is actually based on hard-assignment coding method and we use the result of hard-assignment coding method as the accuracy of \(k = 1\).
References
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Niebles, J.C., Wang, H.C., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3222–3229 (2008)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference, pp. 1–11 (2009)
Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of British Machine Vision Conference, pp. 1–10 (2008)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1593–1600 (2009)
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2046–2053 (2010)
Bilinski, P., Bremond, F.: Contextual statistics of space-time ordered features for human action recognition. In: Proceedings of 9th IEEE International Conference on Advanced Video and Signal-Based Surveillance, pp. 228–233 (2012)
Liu, J., Yang, Y., Saleemi, I., Shah, M.: Learning semantic features for action recognition via diffusion maps. Comput. Vis. Image Underst. 116(3), 361–377 (2012)
Savarese, S., DelPozo, A., Niebles, J.C., Fei-Fei, L.: Spatial-temporal correlatons for unsupervised action classification. In: IEEE Workshop on Motion and Video Computing, pp. 1–8 (2008)
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3169–3176 (2011)
Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of IEEE International Conference on Computer Vision, pp. 104–111 (2009)
Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2004–2011 (2009)
Ahad, M.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Motion history image: its variants and applications. Mach. Vis. Appl. 23(2), 255–281 (2012)
Choi, J., Jeon, W.J., Lee, S.C.: Spatio-temporal pyramid matching for sports videos. In: Proceedings of 1st ACM International Conference on Multimedia Information Retrieval, pp. 291–297 (2008)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2169–2178 (2006)
Zhao, D., Shao, L., Zhen, X., Liu, Y.: Combining appearance and structural features for human action recognition. Neurocomputing 113, 88–96 (2013)
Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 489–496 (2011)
Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D r transform on spatio-temporal interest points for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 724–730 (2013)
Bregonzio, M., Xiang, T., Gong, S.: Fusing appearance and distribution information of interest points for action recognition. Pattern Recognit. 45(3), 1220–1234 (2012)
Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1948–1955 (2009)
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis. Res. 37(23), 3311–3325 (1997)
Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3360–3367 (2010)
Van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32(7), 1271–1283 (2010)
Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2486–2493 (2011)
Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and holistic features. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 58–65 (2009)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of IEEE International Conference on Pattern Recognition, pp. 32–36 (2004)
Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporal contexts. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3185–3192 (2011)
Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: Proceedings of European Conference on Computer Vision, pp. 536–548 (2010)
Raptis, M., Soatto, S.: Tracklet descriptors for action modeling and video analysis. In: Proceedings of European Conference on Computer Vision, pp. 577–590 (2010)
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre T.: HMDB: a large video database for human motion recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2556–2563 (2011)
Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1–8 (2007)
Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Proceedings of European Conference on Computer Vision, pp. 256–269 (2012)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Jiang, Y.G., Dai, Q., Xue, X., Liu, W., Ngo, C.W.: Trajectory-based modeling of human actions with motion reference points. In: Proceedings of European Conference on Computer Vision, pp. 425–438 (2012)
Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2555–2562 (2013)
Acknowledgments
This work was supported by the Fundamental Research Funds for the Central Universities of China under Grant 106112013CDJZR120014 and Scientific and Technological Research Program of Chongqing Municipal Education Commission of China under Grant KJ1401207.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., Ye, J., Wang, T. et al. Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition. Vis Comput 31, 1383–1394 (2015). https://doi.org/10.1007/s00371-014-1020-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-014-1020-8