Temporal Interval Regression Network for Video Action Detection

Wang, Qing; Qing, Laiyun; Miao, Jun; Duan, Lijuan

doi:10.1007/978-3-319-77380-3_25

Qing Wang^19,20,
Laiyun Qing^19,20,
Jun Miao²¹ &
…
Lijuan Duan²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10735))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2758 Accesses

Abstract

Temporal action detection in untrimmed video is an important and challenging task in computer vision. In this paper, a straightforward and efficient regression model is proposed by us to detect action instance and refine action interval in long untrimmed videos. We train a single 3D Convolutional Networks (3D ConvNets) jointly with two sibling output layers: a classification layer to predict the class label and a temporal interval regression layer to modify the temporal localization of input proposal. We also introduce an effective method to sample negative and positive proposals which are discriminative to feature extractor and classifier during training. On THUMOS 2014 dataset, our method achieves competitive performance compared with recent state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 155.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3042 (2016)
Google Scholar
Caba Heilbron, F., Carlos Niebles, J., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1914–1923 (2016)
Google Scholar
Dan, O., Jakob, V., Cordelia, S.: The LEAR submission at Thumos 2014 (2014)
Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Idrees, H., Zamir, A.R., Jiang, Y.G., Gorban, A., Laptev, I., Sukthankar, R., Shah, M.: The Thumos challenge on action recognition for videos “in the wild”. Comput. Vis. Image Underst. 155, 1–23 (2017)
Article Google Scholar
Jaakkola, T.S., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. In: Advances in Neural Information Processing Systems, pp. 487–493 (1999)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Jiang, Y., Liu, J., Zamir, A.R., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: Thumos challenge: action recognition with a large number of classes (2014)
Google Scholar
Kang, S., Wildes, R.P.: Review of action recognition and detection methods (2016). http://arxiv.org/abs/1610.06906
Limin, W., Yu, Q., Xiaoou, T.: Action recognition and detection by combining motion and appearance features (2014)
Google Scholar
Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016)
Google Scholar
Ni, B., Yang, X., Gao, S.: Progressively parsing interactional objects for fine grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1020–1028 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos, pp. 568–576 (2014)
Google Scholar
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream Bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016)
Google Scholar
Svebor, K., Lorenzo, S., Alberto, Del, B.: Fast saliency based pooling of fisher encoded dense trajectories (2014)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Google Scholar
Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115(2), 224–241 (2011)
Article Google Scholar
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687 (2016)
Google Scholar
Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1302–1311 (2015)
Google Scholar
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Google Scholar
Zhu, Y., Newsam, S.: Efficient action detection in untrimmed videos via multi-task learning. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 197–206. IEEE (2017)
Google Scholar

Download references

Acknowledgments

This research is partially sponsored by Natural Science Foundation of China (Nos. 61472387, 61650201 and 61370113), Beijing Natural Science Foundation (Nos. 4152005 and 4162058), the Science and Technology Program of Tianjin (15YFXQGX0050), and the Qinghai Natural Science Foundation (2016-ZJ-Y04).

Author information

Authors and Affiliations

School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 101400, China
Qing Wang & Laiyun Qing
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China
Qing Wang & Laiyun Qing
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, School of Computer Science, Beijing Information Science and Technology University, Beijing, 100101, China
Jun Miao
Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Lijuan Duan

Authors

Qing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Laiyun Qing
View author publications
You can also search for this author in PubMed Google Scholar
Jun Miao
View author publications
You can also search for this author in PubMed Google Scholar
Lijuan Duan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laiyun Qing .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Bing Zeng
University of Chinese Academy of Sciences, Beijing, China
Qingming Huang
University of Ottawa, Ottawa, Ontario, Canada
Abdulmotaleb El Saddik
University of Electronic Science and Technology of China, Chengdu, China
Hongliang Li
Chinese Academy of Sciences, Beijing, China
Shuqiang Jiang
Harbin Institute of Technology, Harbin, China
Xiaopeng Fan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Q., Qing, L., Miao, J., Duan, L. (2018). Temporal Interval Regression Network for Video Action Detection. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10735. Springer, Cham. https://doi.org/10.1007/978-3-319-77380-3_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-77380-3_25
Published: 10 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77379-7
Online ISBN: 978-3-319-77380-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics