skip to main content
10.1145/3126686.3126705acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Spatiotemporal Multi-Task Network for Human Activity Understanding

Published:23 October 2017Publication History

ABSTRACT

Recently, remarkable progress has been achieved in human action recognition and detection by using deep learning techniques. However, for action detection in real-world untrimmed videos, the accuracies of most existing approaches are still far from satisfactory, due to the difficulties in temporal action localization. On the other hand, the spatiotempoal features are not well utilized in recent work for video analysis. To tackle these problems, we propose a spatiotemporal, multi-task, 3D deep convolutional neural network to detect (including temporally localize and recognition) actions in untrimmed videos. First, we introduce a fusion framework which aims to extract video-level spatiotemporal features in the training phase. And we demonstrate the effectiveness of video-level features by evaluating our model on human action recognition task. Then, under the fusion framework, we propose a spatiotemporal multi-task network, which has two sibling output layers for action classification and temporal localization, respectively. To obtain precise temporal locations, we present a novel temporal regression method to revise the proposal window which contains an action. Meanwhile, in order to better utilize the rich motion information in videos, we introduce a novel video representation, interlaced images, as an additional network input stream. As a result, our model outperforms state-of-the-art methods for both action recognition and detection on standard benchmarks.

References

  1. Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. 2016. Dynamic image networks for action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  2. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961--970.Google ScholarGoogle ScholarCross RefCross Ref
  3. César Roberto de Souza, Adrien Gaidon, Eleonora Vig, and Antonio Manuel López. 2016. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In European Conference on Computer Vision. Springer, 697--716.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ali Diba, Ali Mohammad Pazandeh, and Luc Van Gool. 2016. Efficient Two- Stream Motion and Appearance 3D CNNs for Video Classification. arXiv preprint arXiv:1608.08851 (2016).Google ScholarGoogle Scholar
  6. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  7. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional Two-Stream Network Fusion for Video Action Recognition. arXiv preprint arXiv:1604.06573 (2016).Google ScholarGoogle Scholar
  8. Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir Ghodrati, and Tinne Tuytelaars. 2015. Modeling video evolution for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5378--5387.Google ScholarGoogle ScholarCross RefCross Ref
  9. Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. 2011. Actom sequence models for efficient action detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3201--3208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. 2013. Temporal localization of actions with actoms. IEEE transactions on pattern analysis and machine intelligence 35, 11 (2013), 2782--2795. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 759--768.Google ScholarGoogle ScholarCross RefCross Ref
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. Springer, 346--361.Google ScholarGoogle ScholarCross RefCross Ref
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).Google ScholarGoogle Scholar
  15. Mihir Jain, Jan Van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees GM Snoek. 2014. Action localization with tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 740--747. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2013), 221--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Svebor Karaman, Lorenzo Seidenari, and Alberto Del Bimbo. 2014. Fast saliency based pooling of Fisher encoded dense trajectories. In ECCV THUMOS Workshop, Vol. 1. 6.Google ScholarGoogle Scholar
  19. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2--3 (2005), 107--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yanghao Li, Cuiling Lan, Junliang Xing, Wenjun Zeng, Chunfeng Yuan, and Jiaying Liu. 2016. Online Human Action Detection using Joint Classification- Regression Recurrent Neural Networks. arXiv preprint arXiv:1604.05633 (2016).Google ScholarGoogle Scholar
  24. Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision. Springer, 392--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dan Oneata, Jakob Verbeek, and Cordelia Schmid. 2013. Action and event recognition with fisher vectors on a compact feature set. In Proceedings of the IEEE International Conference on Computer Vision. 1817--1824. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dan Oneata, Jakob Verbeek, and Cordelia Schmid. 2014. Efficient action localization with approximately normalized fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2545--2552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hongwei Qin, Junjie Yan, Xiu Li, and Xiaolin Hu. 2016. Joint Training of Cascaded CNN for Face Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3456--3465.Google ScholarGoogle ScholarCross RefCross Ref
  28. Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei-Fei. 2015. Detecting events and key actors in multiperson videos. arXiv preprint arXiv:1511.02917 (2015).Google ScholarGoogle Scholar
  29. Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. 2016. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. arXiv preprint arXiv:1603.01249 (2016).Google ScholarGoogle Scholar
  30. Michalis Raptis, Iasonas Kokkinos, and Stefano Soatto. 2012. Discovering discriminative action parts from mid-level video representations. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 1242--1249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. (????).Google ScholarGoogle Scholar
  33. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  35. Khurram Soomro, Haroon Idrees, and Mubarak Shah. 2015. Action localization in videos through context walk. In Proceedings of the IEEE International Conferenceon Computer Vision. 3280--3288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. Soomro, A. Roshan Zamir, and M. Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. In CRCV-TR-12-01.Google ScholarGoogle Scholar
  37. Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. 2015. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4597--4605. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  39. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567 (2015).Google ScholarGoogle Scholar
  40. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Gül Varol, Ivan Laptev, and Cordelia Schmid. 2016. Long-term Temporal Convolutions for Action Recognition. arXiv preprint arXiv:1604.04494 (2016).Google ScholarGoogle Scholar
  42. HengWang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision 103, 1 (2013), 60--79.Google ScholarGoogle Scholar
  43. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1 (2014), 2.Google ScholarGoogle Scholar
  45. Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing 23, 2 (2014), 810--822. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. LiminWang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectorypooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google ScholarGoogle Scholar
  47. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision. Springer, 20--36.Google ScholarGoogle ScholarCross RefCross Ref
  48. Peng Wang, Yuanzhouhan Cao, Chunhua Shen, Lingqiao Liu, and Heng Tao Shen. 2015. Temporal pyramid pooling based convolutional neural networks for action recognition. arXiv preprint arXiv:1503.01224 (2015).Google ScholarGoogle Scholar
  49. Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, Xiangyang Xue, and Jun Wang. 2015. Fusing Multi-Stream Deep Networks for Video Classification. arXiv preprint arXiv:1509.06086 (2015).Google ScholarGoogle Scholar
  50. Zhongwen Xu, Yi Yang, and Alex G Hauptmann. 2015. A discriminative CNN video representation for event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1798--1807.Google ScholarGoogle ScholarCross RefCross Ref
  51. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2015. End-to-end Learning of Action Detection from Frame Glimpses in Videos. arXiv preprint arXiv:1511.06984 (2015).Google ScholarGoogle Scholar
  53. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google ScholarGoogle ScholarCross RefCross Ref
  54. Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium. Springer, 214--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained cnn architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015).Google ScholarGoogle Scholar
  56. Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Realtime Action Recognition with Enhanced Motion Vector CNNs. arXiv preprint arXiv:1604.07669 (2016).Google ScholarGoogle Scholar
  57. Chen-Lin Zhang, Hao Zhang, Xiu-ShenWei, and JianxinWu. 2016. Deep Bimodal Regression for Apparent Personality Analysis?. In ChaLearn Looking at People Workshop on Apparent Personality Analysis, ECCV Workshop proceedings, p. in press. Springer Science+ Business Media.Google ScholarGoogle Scholar
  58. Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. arXiv preprint arXiv:1604.02878 (2016).Google ScholarGoogle Scholar
  59. Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Spatiotemporal Multi-Task Network for Human Activity Understanding

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017
      October 2017
      558 pages
      ISBN:9781450354165
      DOI:10.1145/3126686

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 October 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader