Abstract
In this paper, we propose a novel method of extracting spatio-temporal features from videos. Given a video, we extract its features according to every set of N frames. The value of N is small enough to guarantee the temporal denseness of our features. For each frame set, we first extract dense SURF keypoints from its first frame. We then select points with the most likely dominant and reliable movements, and consider them as interest points. In the next step, we form triangles of interest points using Delaunay triangulation and track points within each triple through the frame set. We extract one spatio-temporal feature from each triangle based on its shape feature along with the visual features and optical flows of its points. This enables us to extract spatio-temporal features based on groups of related points and their trajectories. Hence the features can be expected to be robust and informative. We apply Fisher Vector encoding to represent videos using the proposed spatio-temporal features. We conduct experiments on several challenging benchmarks, and show the effectiveness of our proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 1932–1939 (2009)
Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103(1), 60–79 (2013)
Dollar, P., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proc. of Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proc. of IEEE Computer Vision and Pattern Recognition (2008)
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proc. of ACM International Conference Multimedia, pp. 357–360 (2007)
Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: Proc. of British Machine Vision Conference, pp. 995–1004 (2008)
Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: Proc. of IEEE International Conference on Computer Vision, pp. 492–497 (2009)
Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proc. of IEEE Computer Vision and Pattern Recognition (2013)
Brox, T., Bregler, C., Malik, J.: Large displacement optical flow. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 41–48 (2009)
Jensen, F.V., Christensen, H.I., Nielsen, J.: Bayesian methods for interpretation and control in multi-agent vision systems. In: Proc. of SPIE 1708, Applications of Artificial Intelligence X: Machine Vision and Robotics, pp. 536–548 (1994)
Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503. Springer, Heidelberg (2006)
Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008)
Noguchi, A., Yanai, K.: A surf-based spatio-temporal feature for feature-fusion-based action recognition. In: Kutulakos, K.N. (ed.) ECCV 2010 Workshops, Part I. LNCS, vol. 6553, pp. 153–167. Springer, Heidelberg (2012)
Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp. 1–8 (2007)
Atmosukarto, I., Ghanem, B., Ahuja, N.: Trajectory-based fisher kernel representation for action recognition in videos. In: Proc. of IAPR International Conference on Pattern Recognition, pp. 3333–3336 (2012)
Laptev, I., Lindeberg, T.: Local descriptors for spatio-temporal recognition. In: Proc. of IEEE International Conference on Computer Vision (2003)
Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: Action recognition through the motion analysis of tracked features. In: ICCV Workshop on Video-Oriented Object and Event Classification (2009)
Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: Combining multiple features for human action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 494–507. Springer, Heidelberg (2010)
Shandong, W., Omar, O., Mubarak, S.: Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In: Proc. of IEEE International Conference on Computer Vision, pp. 1419–1426 (2011)
Uijlings, J.R.R., Smeulders, A.W.M., Scha, R.J.H.: Real-time visual concept classification. IEEE Transactions on Multimedia (2010)
Jhuang, H.A., Garrote, H.A., Poggio, E.A., Serre, T.A., Hmdb, T.: A large video database for human motion recognition. In: Proc. of IEEE International Conference on Computer Vision (2011)
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. The Journal of Machine Learning Research 6, 1453–1484 (2005)
Reddy, K.K., Shar, M.: Recognizing 50 human action categories of web videos. Machine Vision and Applications 24, 971–981
Laptev, I., Marszalek, A., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proc. of IEEE Computer Vision and Pattern Recognition (2008)
Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: Proc. of IEEE Computer Vision and Pattern Recognition (2012)
Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 256–269. Springer, Heidelberg (2012)
Khurram, S., Amir, R.Z., Shar, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Nga, D.H., Yanai, K. (2014). A Dense SURF and Triangulation Based Spatio-temporal Feature for Action Recognition. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds) MultiMedia Modeling. MMM 2014. Lecture Notes in Computer Science, vol 8325. Springer, Cham. https://doi.org/10.1007/978-3-319-04114-8_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-04114-8_32
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04113-1
Online ISBN: 978-3-319-04114-8
eBook Packages: Computer ScienceComputer Science (R0)