Skip to main content
Log in

Weakly supervised action anticipation without object annotations

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Anticipating future actions without observing any partial videos of future actions plays an important role in action prediction and is also a challenging task. To obtain abundant information for action anticipation, some methods integrate multimodal contexts, including scene object labels. However, extensively labelling each frame in video datasets requires considerable effort. In this paper, we develop a weakly supervised method that integrates global motion and local finegrained features from current action videos to predict next action label without the need for specific scene context labels. Specifically, we extract diverse types of local features with weakly supervised learning, including object appearance and human pose representations without ground truth. Moreover, we construct a graph convolutional network for exploiting the inherent relationships of humans and objects under present incidents. We evaluate the proposed model on two datasets, the MPII-Cooking dataset and the EPIC-Kitchens dataset, and we demonstrate the generalizability and effectiveness of our approach for action anticipation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Mahmud T, Hasan M, Roy-Chowdhury A K. Joint prediction of activity labels and starting times in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5784–5793

  2. Mahmud T, Billah M, Hasan M, Roy-Chowdhury A K. Captioning near-future activity sequences. 2019, arXiv preprint arXiv: 1908.00943

  3. Rohrbach M, Amin S, Andriluka M, Schiele B. A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1194–1201

  4. Baradel F, Neverova N, Wolf C, Mille J, Mori G. Object level visual reasoning in videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 106–122

  5. Ryoo M S. Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2011, 1036–1043

  6. Xu Z, Qing L, Miao J. Activity auto-completion: predicting human activities from partial videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 3191–3199

  7. Kong Y, Fu Y. Max-margin action prediction machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(9): 1844–1858

    Article  Google Scholar 

  8. Lan T, Chen T C, Savarese S. A hierarchical representation for future action prediction. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 689–704

  9. Hu J F, Zheng W S, Ma L, Wang G, Lai J. Real-time RGB-D activity prediction by soft regression. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 280–296

  10. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497

  11. Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724–4733

  12. Kong Y, Tao Z, Fu Y. Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3662–3670

  13. Qin J, Liu L, Shao L, Ni B, Chen C, Shen F, Wang Y. Binary coding for partial action analysis with limited observation ratios. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6728–6737

  14. Lee D G, Lee S W. Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition, 2019, 85: 198–206

    Article  Google Scholar 

  15. Zolfaghari M, Singh K, Brox T. ECO: efficient convolutional network for online video understanding. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 713–730

  16. Singh G, Saha S, Sapienza M, Torr P, Cuzzolin F. Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3657–3666

  17. Lai S, Zheng W S, Hu J F, Zhang J. Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 2018, 27(5): 2272–2285

    Article  MathSciNet  Google Scholar 

  18. Kong Y, Gao S, Sun B, Fu Y. Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 7000–7007

  19. Vondrick C, Pirsiavash H, Torralba A. Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 98–106

  20. Zhong Y, Zheng W S. Unsupervised learning for forecasting action representations. In: Proceedings of the 25th IEEE International Conference on Image Processing. 2018, 1073–1077

  21. Furnari A, Farinella G M. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 6251–6260

  22. Gao J, Yang Z, Nevatia R. RED: reinforced encoder-decoder networks for action anticipation. In: Proceedings of the British Machine Vision Conference. 2017

  23. Zeng K H, Shen W B, Huang D A, Sun M, Niebles J C. Visual forecasting by imitating dynamics in natural sequences. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3018–3027

  24. Ng Y B, Fernando B. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. 2019, arXiv preprint arXiv: 1912.04608

  25. Pirri F, Mauro L, Alati E, Ntouskos V, Izadpanahkakhk M, Omrani E. Anticipation and next action forecasting in video: an end-to-end model with memory. 2019, arXiv preprint arXiv: 1901.03728

  26. Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017, 4080–4090

  27. Farha Y A, Richard A, Gall J. When will you do what?-Anticipating temporal occurrences of activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5343–5352

  28. Ke Q, Fritz M, Schiele B. Time-conditioned action anticipation in one shot. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 9917–9926

  29. Wu T Y, Chien T A, Chan C S, Hu C W, Sun M. Anticipating daily intention using on-wrist motion triggered sensing. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 48–56

  30. Sun C, Shrivastava A, Vondrick C, Sukthankar R, Murphy K, Schmid C. Relational action forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 273–283

  31. Zhang J, Elhoseiny M, Cohen S, Chang W, Elgammal A. Relationship proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5226–5234

  32. Hu H, Gu J, Zhang Z, Dai J, Wei Y. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3588–3597

  33. Gkioxari G, Girshick R, Dollár P, He K. Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 8359–8367

  34. Veličković P, Casanova A, Lio P, Cucurull G, Romero A, Bengio Y. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018

  35. Kato K, Li Y, Gupta A. Compositional learning for human object interaction. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 247–264

  36. Wang X, Gupta A. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413–431

  37. Qi S, Wang W, Jia B, Shen J, Zhu S C. Learning human-object interactions by graph parsing neural networks. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 407–423

  38. Zhang Q, Chang J, Meng G, Xu S, Xiang S, Pan C. Learning graph structure via graph convolutional networks. Pattern Recognition, 2019, 95: 308–318

    Article  Google Scholar 

  39. Chao Y W, Yang J, Price B, Cohen S, Deng J. Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3643–3651

  40. Li C, Zhang Z, Lee W S, Lee G H. Convolutional sequence to sequence model for human dynamics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5226–5234

  41. Martinez J, Black M J, Romero J. On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4674–4683

  42. Bütepage J, Black M J, Kragic D, Kjellström H. Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1591–1599

  43. Bloom V, Argyriou V, Makris D. Linear latent low dimensional space for online early action recognition and prediction. Pattern Recognition, 2017, 72: 532–547

    Article  Google Scholar 

  44. Redmon J, Farhadi A. YOLOv3: an incremental improvement. 2018, arXiv preprint arXiv: 1804.02767

  45. Fang H S, Xie S, Tai Y W, Lu C. RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 2353–2362

  46. Xiu Y, Li J, Wang H, Fang Y, Lu C. Pose flow: efficient online pose tracking. In: Proceedings of the British Machine Vision Conference. 2018

  47. Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision. 2006, 428–441

Download references

Acknowledgements

This work was supported partially by the National Natural Science Foundation of China (NSFC) (Grant Nos. U1911401 and U1811461), Guangdong NSF Project (2020B1515120085, 2018B030312002), Guangzhou Research Project (201902010037), Research Projects of Zhejiang Lab (2019KD0AB03), and the Key-Area Research and Development Program of Guangzhou (202007030004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Shi Zheng.

Additional information

Yi Zhong received the BS degree in the School of Mathematics from Sun Yat-Sen University, China in 2015. He is pursuing PhD degree with the School of Computer Science and Engineering in Sun Yat-Sen University, China. His research interests are action prediction.

Jia-Hui Pan received her bachelor’s degree in computer science and techonology from Sun Yat-Sen University, China in 2018. She is now a MS student in the School of Computer Science and Engineering in Sun Yat-Sen University, China. Her research interest includes computer vision and machine learning.

Haoxin Li received the BS degree in the School of Electronics and Information Technology from Sun Yat-Sen University, China in 2018. He is pursuing MS degree in Sun Yat-Sen University, China. His research interests include action and interaction analysis.

Wei-Shi Zheng is now a Professor with Sun Yat-Sen University, China. Dr. Zheng received the PhD degree in applied mathematics from Sun Yat-Sen University, China in 2008. He is now a full Professor at Sun Yat-Sen University, China. He has now published more than 100 papers, including more than 80 publications in main journals (TPAMI, TNN/TNNLS, TIP, TSMC-B, PR) and top conferences (ICCV, CVPR, IJCAI, AAAI). He has joined the organisation of four tutorial presentations in ACCV 2012, ICPR 2012, ICCV 2013 and CVPR 2015. His research interests include person/object association and activity understanding in visual surveillance, and the related large-scale machine learning algorithm. Especially, Dr. Zheng has active research on person re-identification in the last five years. He serves a lot for many journals and conference, and he was announced to perform outstanding review in recent top conferences (ECCV 2016 & CVPR 2017). He has ever joined Microsoft Research Asia Young Faculty Visiting Programme. He has ever served as a senior PC/area chair/associate editor of AVSS 2012, ICPR 2018, IJCAI 2019/2020, AAAI 2020 and BMVC 2018/2019. He is an IEEE MSA TC member. He is an associate editor of Pattern Recognition. He is a recipient of Royal Society-Newton Advanced Fellowship of United Kingdom.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, Y., Pan, JH., Li, H. et al. Weakly supervised action anticipation without object annotations. Front. Comput. Sci. 17, 172313 (2023). https://doi.org/10.1007/s11704-022-1167-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-1167-9

Keywords

Navigation