Weakly supervised action anticipation without object annotations

Zhong, Yi; Pan, Jia-Hui; Li, Haoxin; Zheng, Wei-Shi

doi:10.1007/s11704-022-1167-9

Weakly supervised action anticipation without object annotations

Research Article
Published: 08 August 2022

Volume 17, article number 172313, (2023)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Yi Zhong¹,
Jia-Hui Pan¹,
Haoxin Li¹ &
…
Wei-Shi Zheng^1,2

78 Accesses
1 Citation
Explore all metrics

Abstract

Anticipating future actions without observing any partial videos of future actions plays an important role in action prediction and is also a challenging task. To obtain abundant information for action anticipation, some methods integrate multimodal contexts, including scene object labels. However, extensively labelling each frame in video datasets requires considerable effort. In this paper, we develop a weakly supervised method that integrates global motion and local finegrained features from current action videos to predict next action label without the need for specific scene context labels. Specifically, we extract diverse types of local features with weakly supervised learning, including object appearance and human pose representations without ground truth. Moreover, we construct a graph convolutional network for exploiting the inherent relationships of humans and objects under present incidents. We evaluate the proposed model on two datasets, the MPII-Cooking dataset and the EPIC-Kitchens dataset, and we demonstrate the generalizability and effectiveness of our approach for action anticipation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Discovering Multi-label Actor-Action Association in a Weakly Supervised Setting

Weakly Supervised 3D Object Detection via Multi-level Visual Guidance

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Mahmud T, Hasan M, Roy-Chowdhury A K. Joint prediction of activity labels and starting times in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5784–5793
Mahmud T, Billah M, Hasan M, Roy-Chowdhury A K. Captioning near-future activity sequences. 2019, arXiv preprint arXiv: 1908.00943
Rohrbach M, Amin S, Andriluka M, Schiele B. A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1194–1201
Baradel F, Neverova N, Wolf C, Mille J, Mori G. Object level visual reasoning in videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 106–122
Ryoo M S. Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2011, 1036–1043
Xu Z, Qing L, Miao J. Activity auto-completion: predicting human activities from partial videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 3191–3199
Kong Y, Fu Y. Max-margin action prediction machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(9): 1844–1858
Article Google Scholar
Lan T, Chen T C, Savarese S. A hierarchical representation for future action prediction. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 689–704
Hu J F, Zheng W S, Ma L, Wang G, Lai J. Real-time RGB-D activity prediction by soft regression. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 280–296
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497
Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724–4733
Kong Y, Tao Z, Fu Y. Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3662–3670
Qin J, Liu L, Shao L, Ni B, Chen C, Shen F, Wang Y. Binary coding for partial action analysis with limited observation ratios. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6728–6737
Lee D G, Lee S W. Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition, 2019, 85: 198–206
Article Google Scholar
Zolfaghari M, Singh K, Brox T. ECO: efficient convolutional network for online video understanding. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 713–730
Singh G, Saha S, Sapienza M, Torr P, Cuzzolin F. Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3657–3666
Lai S, Zheng W S, Hu J F, Zhang J. Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 2018, 27(5): 2272–2285
Article MathSciNet Google Scholar
Kong Y, Gao S, Sun B, Fu Y. Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 7000–7007
Vondrick C, Pirsiavash H, Torralba A. Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 98–106
Zhong Y, Zheng W S. Unsupervised learning for forecasting action representations. In: Proceedings of the 25th IEEE International Conference on Image Processing. 2018, 1073–1077
Furnari A, Farinella G M. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 6251–6260
Gao J, Yang Z, Nevatia R. RED: reinforced encoder-decoder networks for action anticipation. In: Proceedings of the British Machine Vision Conference. 2017
Zeng K H, Shen W B, Huang D A, Sun M, Niebles J C. Visual forecasting by imitating dynamics in natural sequences. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3018–3027
Ng Y B, Fernando B. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. 2019, arXiv preprint arXiv: 1912.04608
Pirri F, Mauro L, Alati E, Ntouskos V, Izadpanahkakhk M, Omrani E. Anticipation and next action forecasting in video: an end-to-end model with memory. 2019, arXiv preprint arXiv: 1901.03728
Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017, 4080–4090
Farha Y A, Richard A, Gall J. When will you do what?-Anticipating temporal occurrences of activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5343–5352
Ke Q, Fritz M, Schiele B. Time-conditioned action anticipation in one shot. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 9917–9926
Wu T Y, Chien T A, Chan C S, Hu C W, Sun M. Anticipating daily intention using on-wrist motion triggered sensing. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 48–56
Sun C, Shrivastava A, Vondrick C, Sukthankar R, Murphy K, Schmid C. Relational action forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 273–283
Zhang J, Elhoseiny M, Cohen S, Chang W, Elgammal A. Relationship proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5226–5234
Hu H, Gu J, Zhang Z, Dai J, Wei Y. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3588–3597
Gkioxari G, Girshick R, Dollár P, He K. Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 8359–8367
Veličković P, Casanova A, Lio P, Cucurull G, Romero A, Bengio Y. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018
Kato K, Li Y, Gupta A. Compositional learning for human object interaction. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 247–264
Wang X, Gupta A. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413–431
Qi S, Wang W, Jia B, Shen J, Zhu S C. Learning human-object interactions by graph parsing neural networks. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 407–423
Zhang Q, Chang J, Meng G, Xu S, Xiang S, Pan C. Learning graph structure via graph convolutional networks. Pattern Recognition, 2019, 95: 308–318
Article Google Scholar
Chao Y W, Yang J, Price B, Cohen S, Deng J. Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3643–3651
Li C, Zhang Z, Lee W S, Lee G H. Convolutional sequence to sequence model for human dynamics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5226–5234
Martinez J, Black M J, Romero J. On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4674–4683
Bütepage J, Black M J, Kragic D, Kjellström H. Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1591–1599
Bloom V, Argyriou V, Makris D. Linear latent low dimensional space for online early action recognition and prediction. Pattern Recognition, 2017, 72: 532–547
Article Google Scholar
Redmon J, Farhadi A. YOLOv3: an incremental improvement. 2018, arXiv preprint arXiv: 1804.02767
Fang H S, Xie S, Tai Y W, Lu C. RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 2353–2362
Xiu Y, Li J, Wang H, Fang Y, Lu C. Pose flow: efficient online pose tracking. In: Proceedings of the British Machine Vision Conference. 2018
Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision. 2006, 428–441

Download references

Acknowledgements

This work was supported partially by the National Natural Science Foundation of China (NSFC) (Grant Nos. U1911401 and U1811461), Guangdong NSF Project (2020B1515120085, 2018B030312002), Guangzhou Research Project (201902010037), Research Projects of Zhejiang Lab (2019KD0AB03), and the Key-Area Research and Development Program of Guangzhou (202007030004).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
Yi Zhong, Jia-Hui Pan, Haoxin Li & Wei-Shi Zheng
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou, 510006, China
Wei-Shi Zheng

Authors

Yi Zhong
View author publications
Search author on:PubMed Google Scholar
Jia-Hui Pan
View author publications
Search author on:PubMed Google Scholar
Haoxin Li
View author publications
Search author on:PubMed Google Scholar
Wei-Shi Zheng
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Wei-Shi Zheng.

Additional information

Yi Zhong received the BS degree in the School of Mathematics from Sun Yat-Sen University, China in 2015. He is pursuing PhD degree with the School of Computer Science and Engineering in Sun Yat-Sen University, China. His research interests are action prediction.

Jia-Hui Pan received her bachelor’s degree in computer science and techonology from Sun Yat-Sen University, China in 2018. She is now a MS student in the School of Computer Science and Engineering in Sun Yat-Sen University, China. Her research interest includes computer vision and machine learning.

Haoxin Li received the BS degree in the School of Electronics and Information Technology from Sun Yat-Sen University, China in 2018. He is pursuing MS degree in Sun Yat-Sen University, China. His research interests include action and interaction analysis.

Wei-Shi Zheng is now a Professor with Sun Yat-Sen University, China. Dr. Zheng received the PhD degree in applied mathematics from Sun Yat-Sen University, China in 2008. He is now a full Professor at Sun Yat-Sen University, China. He has now published more than 100 papers, including more than 80 publications in main journals (TPAMI, TNN/TNNLS, TIP, TSMC-B, PR) and top conferences (ICCV, CVPR, IJCAI, AAAI). He has joined the organisation of four tutorial presentations in ACCV 2012, ICPR 2012, ICCV 2013 and CVPR 2015. His research interests include person/object association and activity understanding in visual surveillance, and the related large-scale machine learning algorithm. Especially, Dr. Zheng has active research on person re-identification in the last five years. He serves a lot for many journals and conference, and he was announced to perform outstanding review in recent top conferences (ECCV 2016 & CVPR 2017). He has ever joined Microsoft Research Asia Young Faculty Visiting Programme. He has ever served as a senior PC/area chair/associate editor of AVSS 2012, ICPR 2018, IJCAI 2019/2020, AAAI 2020 and BMVC 2018/2019. He is an IEEE MSA TC member. He is an associate editor of Pattern Recognition. He is a recipient of Royal Society-Newton Advanced Fellowship of United Kingdom.

Electronic Supplementary Material