Abstract
This paper addresses the problem of activity localization and recognition in large scale video datasets by the collaborative use of holistic and motion based information (called motion cues). The concept of salient objects is used to obtain the holistic information while the motion cues are obtained by affine motion model and optical flow. The motion cues compensate the camera motion and localize the object of interest in a set of object proposals. Furthermore, the holistic information and motion cues are fused to get a reliable object of interest. In recognition phase, the holistic and motion based features are extracted from the object of interest for the training and testing of classifier. The extreme learning machine is adopted as a classifier to reduce the training and testing time and increase the classification accuracy. The effectiveness of the proposed approach is tested on UCF sports dataset. The detailed experimentation reveals that the proposed approach performs better than state-of-the-art action localization and recognition approaches.










Similar content being viewed by others
References
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition
Viola, Paul, Jones, Michael: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Lampert, C., Blaschko, M., Hofmann, T.: Beyond sliding windows: object localization by efficient subwindow search. In: IEEE Conference on Computer Vision and Pattern Recognition, June 2008
Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2189–2202 (2012)
Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: a new learning scheme of feedforward neural networks. In: Proceedings of IEEE International Joint Conference on Neural Networks
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2008)
Everts, I., van Gemert, J., Gevers, T.: Evaluation of color stips for human action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)
Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)
Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
Van Gemert, J.C., Veenman, C.J., Geusebroek, J.-M.: Episode-constrained cross-validation in video concept retrieval. IEEE Trans. Multimed. 11(4), 780–786 (2009)
Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, June, 2011
Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: In Proceedings of IEEE CVPR, pp. 2642–2649 (2013)
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1728–1743 (2011)
Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2010)
Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: IEEE Conference on Computer Vision, pp. 2003–2010, Nov 2011
Tran, D., Yuan, J.: Max-margin structured output regression for spatio-temporal action localization. In: In Proceedings of Neural Information Process (NIPS), pp. 350–358, Dec 2012
Tran, D., Yuan, J., Forsyth, D.: Video event detection: from subvolume localization to spatio-temporal path search. IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 404–416 (2013)
Derpanis, K., Sizintsev, M., Cannons, K., Wildes, R.: Efficient action spotting based on a spacetime oriented structure representation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1990–1997. (2010)
Sapienza, M., Cuzzolin, F. Torr, P.H.: Learning discriminative spacetime actions from weakly labelled videos. In: In Proceedings of BMVC (2012)
Laptev, I., Perez, P.: Retrieving actions in movies. In: In Proceedings of ICCV, pp. 1–8. (2007)
Kläser, A., Marszałek, M., Schmid, C., Zisserman, A.: Human focused action localization in video. In International Workshop on Sign, Gesture, Activity (2010)
Zhao, S., Precioso, F., Cord, M.: Spatio-temporal tube data representation and kernel design for svm-based video object retrieval system. Multimed. Tools Appl. 55(1), 105–125 (2011)
Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1242–1249. June, 2012
Tran, D., Yuan, J.: Optimal spatio-temporal path discovery for video event detection. In: IEEE Conference on Computer Vision and Pattern Recognition, June, 2011
Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: In Proceedings of European Conference on Computer Vision, Sep, 2010
Duchenne, O., Laptev, I., Sivic, J., Bach, E., Ponce, J.: Automatic annotation of human actions in video. In: IEEE International Conference on Computer Vision (2009)
Gaidon, A., Harchaoui, Z., Schmid, C.: Actom sequence models for efficient action detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3201–3208. (2011)
Yao, A., Gall, J., Van Gool, L.: A hough transform-based voting framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2061–2068. (2010)
Willems, G., Becker, J.H., Tuytelaars, T., Van Gool, L.: Exemplarbased action recognition in video. In: In Proceedings of BMVC, pp. 1–11. (2009)
Gilbert, A., Illingworth, J., Bowden, R.: Action recognition using mined hierarchical compound features. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 883–897 (2011)
Liu, j., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1996–2003. (2009)
Yu, G., Goussies, N., Yuan, J., Liu, Z.: Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans. Multimed. 13(3), 507–517 (2011)
Wang, T., Wang, S., Ding, X.: Detecting human action as the spatio-temporal tube of maximum mutual information. IEEE Trans. Circuits Syst. Video Technol. 24(2), 277–290 (2014)
Trichet, R., Nevatia, R.: Video segmentation with spatio-temporal tubes. In: In IEEE International Conference on Advanced Video and Signal Based Surveillance (2013)
Jain, M., Gemert, J., Jegou, H., Bouthemy, P., Snoek C.: Action localization with tubelets from motion. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 740–747, June, 2014
Alexe, B., Deselaers, V., Ferrari, T.: What is an object? in: IEEE Conference on Computer Vision and Pattern Recognition (2010)
Endres, I., Hoiem, D.: Category independent object proposals. In: In European Conference on Computer Vision (2010)
Manen, S., Guillaumin, M., Van Gool, L. : Prime object proposals with randomized prims algorithm. In: IEEE Conference on Computer Vision (2013)
Rahtu, E., Kannala, J., Blaschko, M.: Learning a category independent object detection cascade. In: IEEE Conference on Computer Vision (2011)
Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., Van Gool, l.: Deepproposals: hunting objects and actions by cascading deep convolutional layers. In: IEEE Conference on Computer Vision (2015)
Arbelaez, P., Pont-Tuset, J., Barron, J., Marques, E., Malik, J.: Multiscale combinatorial grouping. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Zhou, Z., Shi, F., Wu, W.: Learning spatial and temporal extents of human actions for action detection. IEEE Trans. Multimed. 17(4), 512–525 (2015)
Gemert, J., Jain, M., Gati, E., Snoek, C.: Apt: action localization proposals from dense trajectories. In: In Proceedings of BMVC (2015)
Sultani, W., Shah, M.: What if we do not have multiple videos of the same action? video action localization using web images. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE Conference on Computer Vision, pp. 3551–3558, Dec, 2013
Wang, H., Ullah, M., Klaser, A., Laptev, I., Schmid, C., Lear, I., Vista, I., Liama, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of BMVC, pp. 1–11, 2009
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2046–2053. (2010)
Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: ECCV (2010)
Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: IEEE Conference on Computer Vision (2013)
Yeffet, L., Wolf, l.: Local trinary patterns for human action recognition. In: Proceedings of ICCV, pp. 492–497. (2009)
Oneata, D. Revaud, J., Verbeek, J., Schmid, C.: Spatiotemporal object detection proposals. In: European Conference on Computer Vision (2014)
Gati, E., Schavemaker, J., Gemert, J.: Bing3d: fast spatio-temporal proposals for action localization. In: Netherlands Conference on Computer Vision (2015)
Odobez, J., Bouthemy, P.: Robust multiresolution estimation of parametric motion models. J. Vis. Commun. Image Represent. 6(4), 348–365 (1995)
Krahenbuhl, P., Koltun, V.: Learning to propose objects. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Khan, A., Ullah, J., Jaffar, M.A., Chai, T.: Color image segmentation: a novel spatial fuzzy genetic algorithm. Signal Image Video Process. 8(7), 1233–1243 (2014)
Chaudhry, R. Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Oliva, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: In Proceedings of the 6th ACM international conference on Image and video retrieval, pp. 401–408, 2007
Broomhead, D.S., Lowe, D.: Multivariable functional interpolation and adaptive networks. Complex Syst. 2(3), 321–355 (1988)
Chen, S., Cowan, C., Grant, P.: Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans. Neural Netw. 2(2), 302–309 (1991)
Huang, G., Huang, G.-B., Song, S., You, K.: Trends in extreme learning machines: a review. Neural Netw. 61, 32–48 (2015)
Minhas, R., Baradarani, A., Seifzadeh, S., Jonathan, W.Q.: Human action recognition using extreme learning machine based on visual vocabularies. Int. J. Neurocomput. 73(10–12), 1906–1917 (2010)
Iosifidis,A., Tefas, A., Pitas, I.: Multi-view human action recognition under occlusion based on fuzzy distances and neural networks. In: European Signal Processing Conference (2013)
Iosifidis, A., Tefas, A., Pitas, I.: Minimum class variance extreme learning machine for human action recognition. IEEE Trans. Circuits Syst. Video Technol. 23(1), 1968–1979 (2013)
Iosifidis, A., Tefas, A., Pitas, I.: Dynamic action recognition based on dynemes and extreme learning machine. Pattern Recogn. Lett. 34(15), 1890–1898 (2013)
Ma, S., Zhang, J., Ikizler-Cinbis, N., Sclaroff, S.: Action recognition and localization by hierarchical space-time segments. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2744–2751, Dec, 2013
Acknowledgements
This work is supported by Higher Education Commission of Pakistan.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ullah, J., Jaffar, M.A. Object and motion cues based collaborative approach for human activity localization and recognition in unconstrained videos. Cluster Comput 21, 311–322 (2018). https://doi.org/10.1007/s10586-017-0825-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0825-4