Skip to main content

Advertisement

Log in

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

  • Research Article
  • Published:
Machine Intelligence Research Aims and scope Submit manuscript

Abstract

Action recognition and localization in untrimmed videos is important for many applications and have attracted a lot of attention. Since full supervision with frame-level annotation places an overwhelming burden on manual labeling effort, learning with weak video-level supervision becomes a potential solution. In this paper, we propose a novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously. Considering that there are abundant trimmed videos publicly available and well-segmented with semantic descriptions, the instructive knowledge learned on trimmed videos can be fully leveraged to analyze untrimmed videos. We present an effective knowledge transfer strategy based on inter-class semantic relevance. We also take advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated. A learning architecture is designed with twin networks for trimmed and untrimmed videos, to facilitate transferable self-attentive representation learning. Extensive experiments are conducted on three untrimmed benchmark datasets (i.e., THUMOS14, ActivityNet1.3, and MEXaction2), and the experimental results clearly corroborate the efficacy of our method. It is especially encouraging to see that the proposed weakly supervised method even achieves comparable results to some fully supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. K. Simonyan, A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 568–576, 2014. DOI: https://doi.org/10.5555/2968826.2968890.

    Google Scholar 

  2. D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4489–4497, 2015. DOI: https://doi.org/10.1109/ICCV.2015.510.

    Google Scholar 

  3. H. Wang, A. Kläser, C. Schmid, C. L. Liu. Action recognition by dense trajectories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, USA, pp. 3169–3176, 2011. DOI: https://doi.org/10.1109/CVPR.2011.5995407.

    Google Scholar 

  4. H. Wang, C. Schmid. Action recognition with improved trajectories. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 3551–3558, 2013. DOI: https://doi.org/10.1109/ICCV.2013.441.

    Google Scholar 

  5. L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20–36, 2016. DOI: https://doi.org/10.1007/978-3-319-46484-8_2.

    Google Scholar 

  6. C. Feichtenhofer, A. Pinz, A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1933–1941, 2016. DOI: https://doi.org/10.1109/CVPR.2016.213.

    Google Scholar 

  7. J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 4724–4733, 2017. DOI: https://doi.org/10.1109/CVPR.2017.502.

    Google Scholar 

  8. J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4694–4702, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7299101.

    Google Scholar 

  9. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2625–2634, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298878.

    Google Scholar 

  10. K. Soomro, A. R. Zamir, M. Shah. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. Technical Report CRCV-TR-12-01, Center for Research in Computer Vision, University of Central Florida, USA, 2012.

    Google Scholar 

  11. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2556–2563, 2011. DOI: https://doi.org/10.1109/ICCV.2011.6126543.

    Google Scholar 

  12. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725–1732, 2014. DOI: https://doi.org/10.1109/CVPR.2014.223.

    Google Scholar 

  13. S. Karaman, L. Seidenari, A. Del Bimbo. Fast saliency based pooling of fisher encoded dense trajectories. In Proceedings of ECCV THUMOS Workshop, Zurich, Switzerland, 2014.

  14. D. Oneata, J. Verbeek, C. Schmid. The LEAR submission at THUMOS 2014. 2014.

  15. G. Singh, F. Cuzzolin. Untrimmed video classification for activity detection: Submission to activityNet challenge. [Online], Available: https://arxiv.org/abs/1607.01979, 2016.

  16. L. M. Wang, Y. Qiao, X. O. Tang. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge, vol. 1, no. 2, Article number 2, 2014.

  17. V. Escorcia, F. C. Heilbron, J. C. Niebles, B. Ghanem. DAPs: Deep action proposals for action understanding. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 768–784, 2016. DOI: https://doi.org/10.1007/978-3-319-46487-9_47.

    Google Scholar 

  18. P. Mettes, J. C. Van Gemert, C. G. M. Snoek. Spot on: Action localization from pointly-supervised proposals. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 437–453, 2016. DOI: https://doi.org/10.1007/978-3-319-46454-1_27.

    Google Scholar 

  19. T. W. Lin, X. Zhao, H. S. Su, C. J. Wang, M. Yang. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 3–21, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_1.

    Google Scholar 

  20. S. G. Ma, L. Sigal, S. Sclaroff. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1942–1950, 2016. DOI: https://doi.org/10.1109/CVPR.2016.214.

    Book  Google Scholar 

  21. B. Singh, T. K. Marks, M. Jones, O. Tuzel, M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1961–1970, 2016. DOI: https://doi.org/10.1109/CVPR.2016.216.

    Google Scholar 

  22. S. Yeung, O. Russakovsky, G. Mori, F. F. Li. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2678–2687, 2016. DOI: https://doi.org/10.1109/CVPR.2016.293.

    Google Scholar 

  23. K. K. Singh, Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 3544–3553, 2017. DOI: https://doi.org/10.1109/ICCV.2017.381.

    Google Scholar 

  24. L. M. Wang, Y. J. Xiong, D. H. Lin, L. Van Gool. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6402–6411, 2017. DOI: https://doi.org/10.1109/CVPR.2017.678.

  25. P. Nguyen, B. Han, T. Liu, G. Prasad. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6752–6761, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00706.

    Google Scholar 

  26. Z. Shou, H. Gao, L. Zhang, K. Miyazawa, S. F. Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 162–179, 2018. DOI: https://doi.org/10.1007/978-3-030-01270-010.

    Google Scholar 

  27. S. Paul, S. Roy, A. K. Roy-Chowdhury. W-TALC: Weakly-supervised temporal activity localization and classification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 588–607, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_35.

    Google Scholar 

  28. Y. G. Jiang, J. G. Liu, A. R. Zamir, G. Toderici. THUMOS Challenge: Action recognition with a large number of classes. In Proceedings of ECCV2014 International Workshop and Competition. [Online], Available: http://crcv.ucf.edu/THUMOS14/home.html, 2014.

  29. F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 961–970, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298698.

    Google Scholar 

  30. M. Crucianu. MEXaction2: Action Detection and Localization Dataset.[Online], Available: http://mexculture.cnam.fr/Datasets/mex+action+dataset.html, July 15, 2015.

  31. J. E. Van Engelen, H. H. Hoos. A survey on semi-supervised learning. Machine Learning, vol. 109, no. 2, pp. 373–440, 2020. DOI: https://doi.org/10.1007/s10994-019-05855-6.

    Article  MathSciNet  MATH  Google Scholar 

  32. Z. H. Zhou, M. Li. Semi-supervised learning by disagreement. Knowledge and Information Systems, vol. 24, no. 3, pp. 415–439, 2010. DOI: https://doi.org/10.1007/s10115-009-0209-z.

    Article  Google Scholar 

  33. J. Foulds, E. Frank. A review of multi-instance learning assumptions. The Knowledge Engineering Review, vol. 25, no. 1, pp. 1–25, 2010. DOI: https://doi.org/10.1017/S026988890999035X.

    Article  Google Scholar 

  34. Z. H. Zhou. Multi-instance learning from supervised view. Journal of Computer Science and Technology, vol. 21, no. 5, pp. 800–809, 2006. DOI: https://doi.org/10.1007/s11390-006-0800-7.

    Article  MathSciNet  Google Scholar 

  35. B. Frenay, M. Verleysen. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, 2014. DOI: https://doi.org/10.1109/TNNLS.2013.2292894.

    Article  MATH  Google Scholar 

  36. W. Gao, L. Wang, Y. F. Li, Z. H. Zhou. Risk minimization in the presence of label noise. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI Press, Phoenix, USA, pp. 1575–1581, 2016. DOI: https://doi.org/10.5555/3016100.3016119.

    Google Scholar 

  37. D. Bahdanau, K. Cho, Y. Bengio. Neural machine translation by jointly learning to align and translate. [Online], Available: https://arxiv.org/abs/1409.0473, 2014.

  38. K. Gregor, I. Danihelka, A. Graves, D. Rezende, D. Wierstra. Draw: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1462–1471, 2015.

  39. K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 2048–2057, 2015. DOI: https://doi.org/10.5555/3045118.3045336.

    Google Scholar 

  40. Z. C. Yang, X. D. He, J. F. Gao, L. Deng, A. Smola. Stacked attention networks for image question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 21–29, 2016. DOI: https://doi.org/10.1109/CVPR.2016.10.

    Google Scholar 

  41. J. P. Cheng, L. Dong, M. Lapata. Long short-term memory-networks for machine reading. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Austin, USA, pp. 551–561, 2016.

    Google Scholar 

  42. A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Austin, USA, pp. 2249–2255, 2016. DOI: https://doi.org/10.18653/vl/D16-1244.

    Google Scholar 

  43. Z. H. Lin, M. W. Feng, C. N. Dos Santos, M. Yu, B. Xiang, B. W. Zhou, Y. Bengio. A structured self-attentive sentence embedding. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.

  44. R. Paulus, C. M. Xiong, R. Socher. A deep reinforced model for abstractive summarization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.

  45. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 6000–6010, 2017. DOI: https://doi.org/10.5555/3295222.3295349.

    Google Scholar 

  46. N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, D. Tran. Image transformer. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 4055–4064, 2018.

    Google Scholar 

  47. X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7794–7803, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00813.

    Google Scholar 

  48. T. Xu, P. C. Zhang, Q. Y. Huang, H. Zhang, Z. Gan, X. L. Huang, X. D. He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1316–1324, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00143.

    Google Scholar 

  49. H. Zhang, I. Goodfellow, D. Metaxas, A. Odena. Self-attention generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, USA, pp. 7354–7363, 2019.

    Google Scholar 

  50. Y. Liu, C. J. Sun, L. Lin, X. L. Wang. Learning natural language inference using bidirectional LSTM model and inner-attention. [Online], Available: https://arxiv.org/abs/1605.09090, 2016.

  51. S. Thrun, L. Pratt. Learning to Learn, New York, USA: Springer, 2012.

    MATH  Google Scholar 

  52. S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010. DOI: https://doi.org/10.1109/TKDE.2009.191.

    Article  Google Scholar 

  53. S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011. DOI: https://doi.org/10.1109/TNN.2010.2091281.

    Article  Google Scholar 

  54. B. Q. Gong, K. Grauman, F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In Proceedings of the 30th International Conference on Machine Learning, ACM, Atlanta, USA, pp. I–222–I–230, 2013. DOI: https://doi.org/10.5555/3042817.3042844.

    Google Scholar 

  55. K. Saenko, B. Kulis, M. Fritz, T. Darrell. Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision, Springer, Heraklion, Greece, pp. 213–226, 2010. DOI: https://doi.org/10.1007/978-3-642-15561-116.

    Google Scholar 

  56. M. S. Long, J. M. Wang, G. G. Ding, J. G. Sun, P. S. Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 2200–2207, 2013. DOI: https://doi.org/10.1109/ICCV.2013.274.

    Google Scholar 

  57. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf. A kernel two-sample test. The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012. DOI: https://doi.org/10.5555/2188385.2188410.

    MathSciNet  MATH  Google Scholar 

  58. M. S. Long, H. Zhu, J. M. Wang, M. I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning, ACM, Sydney, Australia, pp. 2208–2217, 2017. DOI: https://doi.org/10.5555/3305890.3305909.

    Google Scholar 

  59. M. S. Long, Y. Cao, J. M. Wang, et al. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 97–105, 2015. DOI: https://doi.org/10.5555/3045118.3045130.

    Google Scholar 

  60. J. Yosinski, J. Clune, Y. Bengio, H. Lipson. How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3320–3328, 2014.

  61. X. Glorot, A. Bordes, Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on Machine Learning, ACM, Bellevue, USA, pp. 513–520, 2011. DOI: https://doi.org/10.5555/3104482.3104547.

    Google Scholar 

  62. M. M. Chen, Z. X. Xu, K. Q. Weinberger, F. Sha. Marginalized denoising autoencoders for domain adaptation. In Proceedings of the 29th International Conference on Machine Learning, ACM, Edinburgh, UK, pp. 1627–1634, 2012. DOI: https://doi.org/10.5555/3042573.3042781.

    Google Scholar 

  63. L. Ge, J. Gao, X. Y. Li, A. D. Zhang. Multi-source deep learning for information trustworthiness estimation. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Chicago, USA, pp. 766–774, 2013. DOI: https://doi.org/10.1145/2487575.2487612.

    Google Scholar 

  64. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, ACM, Bellevue, USA, pp. 689–696, 2011. DOI: https://doi.org/10.5555/3104482.3104569.

    Google Scholar 

  65. E. Tzeng, J. Hoffman, K. Saenko, T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 2962–2971, 2017. DOI: https://doi.org/10.1109/CVPR.2017.316.

    Google Scholar 

  66. E. Tzeng, J. Hoffman, T. Darrell, K. Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4068–4076, 2015. DOI: https://doi.org/10.1109/ICCV.2015.463.

    Google Scholar 

  67. Y. Ganin, V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 1180–1189, 2015. DOI: https://doi.org/10.5555/3045118.3045244.

    Google Scholar 

  68. M. S. Long, H. Zhu, J. M. Wang, M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, ACM, Barcelona, Spain, pp. 136–144, 2016. DOI: https://doi.org/10.5555/3157096.3157112.

    Google Scholar 

  69. K. Zhang, B. Schölkopf, K. Muandet, Z. K. Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, ACM, Atlanta, USA, 2013.

    Google Scholar 

  70. N. Inoue, R. Furuta, T. Yamasaki, K. Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 5001–5009, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00525.

    Google Scholar 

  71. X. L. Chen, A. Gupta. Webly supervised learning of convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1431–1439, 2015. DOI: https://doi.org/10.1109/ICCV.2015.168.

    Google Scholar 

  72. B. W. Zhang, L. M. Wang, Z. Wang, Y. Qiao, H. L. Wang. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2718–2726, 2016. DOI: https://doi.org/10.1109/CVPR.2016.297.

  73. M. Jain, J. C. Van Gemert, C. G. M. Snoek. What do 15, 000 object categories tell us about classifying and localizing actions?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 46–55, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298599.

    Google Scholar 

  74. M. Jain, J. Van Gemert, C. G. M. Snoek. University of Amsterdam at THUMOS challenge 2014. In Proceddings of the 14th ECCV, THUMOS Challenge, ECCV, Orlando, USA, 2014.

    Google Scholar 

  75. G. Varol, A. A. Salah. Efficient large-scale action recognition in videos using extreme learning machines. Expert Systems with Applications, vol. 42, no. 21, pp. 8274–8282, 2015. DOI: https://doi.org/10.1016/j.eswa.2015.06.013.

    Article  Google Scholar 

  76. A. Richard, J. Gall. Temporal action detection using a statistical language model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3131–3140, 2016. DOI: https://doi.org/10.1109/CV-PR.2016.341.

    Google Scholar 

  77. Z. Shou, D. G. Wang, S. F. Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1049–1058, 2016. DOI: https://doi.org/10.1109/CVPR.2016.119.

    Google Scholar 

  78. H. Alwassel, F. C. Heilbron, B. Ghanem. Action search: Learning to search for human activities in untrimmed videos. [Online], Available: https://arxiv.org/abs/1706.04269, 2017.

  79. T. W. Lin, X. Zhao, Z. Shou. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia, ACM, Mountain View, USA, pp. 988–996, 2017. DOI: https://doi.org/10.1145/3123266.3123343.

    Google Scholar 

  80. J. Yuan, B. B. Ni, X. K. Yang, A. A. Kassim. Temporal action localization with pyramid of score distribution features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3093–3102, 2016. DOI: https://doi.org/10.1109/CVPR.2016.337.

    Google Scholar 

  81. Z. Shou, J. Chan, A. Zareian, K. Miyazawa, S. F. Chang. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 1417–1426, 2017. DOI: https://doi.org/10.1109/CVPR.2017.155.

    Google Scholar 

  82. H. J. Xu, A. Das, K. Saenko. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5794–5803, 2017. DOI: https://doi.org/10.1109/ICCV.2017.617.

    Google Scholar 

  83. H. J. Xu, B. Y. Kang, X. M. Sun, J. S. Feng, K. Saenko, T. Darrell. Similarity R-C3D for few-shot temporal activity detection. [Online], Available: https://arxiv.org/abs/1812.10000, 2018.

  84. Y. Zhao, Y. J. Xiong, L. M. Wang, Z. R. Wu, X. O. Tang, D. H. Lin. Temporal action detection with structured segment networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2933–2942, 2017. DOI: https://doi.org/10.1109/ICCV.2017.317.

    MATH  Google Scholar 

  85. F. C. Heilbron, W. Barrios, V. Escorcia, B. Ghanem. SCC: Semantic context cascade for efficient action detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 3175–3184, 2017. DOI: https://doi.org/10.1109/CVPR.2017.338.

    Google Scholar 

  86. Y. W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, R. Sukthankar. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1130–1139, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00124.

    Google Scholar 

  87. Y. J. Xiong, Y. Zhao, L. M. Wang, D. H. Lin, X. O. Tang. A pursuit of temporal accuracy in general activity detection. [Online], Available: https://arxiv.org/abs/1703.02716, 2017.

Download references

6 Acknowledgements

This work was supported by National Natural Science Foundation of China (Nos. 61871378, U2003111, 62122013 and U2001211).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai-Chao Shi.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Xiao-Yu Zhang received the B.Sc. degree in computer science from Nanjing University of Science and Technology, China in 2005, and the Ph.D. degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, China in 2010. He is currently an associate professor with Institute of Information Engineering, Chinese Academy of Sciences, China. He has authored or coauthored more than 60 refereed publications in international journals and conferences. He is a Senior Member of the ACM, CCF, and CSIG. His awards and honors include the Silver Prize of Microsoft Cup of the IEEE China Student Paper Contest in 2009, the Second Prize of Wu Wen-Jun AI Science & Technology Innovation Award in 2016, the CCCV Best Paper Nominate Award in 2017, the Third Prize of BAST Beijing Excellent S&T Paper Award in 2018, and the Second Prize of CSIG Science & Technology Award in 2019.

His research interests include artificial intelligence, data mining, and computer vision.

Hai-Chao Shi received the B.Sc. degree in software engineering from Beijing Technology and Business University, China in 2017. He is currently a Ph.D. degree candidate in cyberspace security with National Engineering Laboratory of Information Content Security Technology, Institute of Information Engineering, Chinese Academy of Sciences, China.

His research interests include temporal action localization, action recognition, and pattern recognition.

Chang-Sheng Li received the B.Eng. degree in electronic engineering from University of Electronic Science and Technology of China (UESTC), China in 2008, and the Ph.D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences, China in 2013. He was a research assistant with The Hong Kong Polytechnic University, China from 2009 to 2010. He worked with IBM Research-China, Alibaba Group, and UESTC, respectively. He is currently a professor with Beijing Institute of Technology, China. He has authored or coauthored more than 40 refereed publications in international journals and conferences.

His research interests include machine learning, data mining, and computer vision.

Li-Xin Duan received the B.Eng. degree in electronic and information science from University of Science and Technology of China (USTC), China in 2008, the Ph.D. degree in computer engineering from Nanyang Technological University (NTU), Singapore in 2012. He is currently a full professor with School of Computer Science and Engineering, University of Electronic Science and Technology of China. He has also been selected into “The Thousand Talents Plan for Young Professionals” by the Organization Department of the Communist Party of China in 2017. Prior to that, He worked as research scientist at Amazon’s Seattle headquarters in the United States and Institute for Infocomm Research (I2R) in Singapore. He has published 35+ papers in international journals and conferences, which have received 2200+ citations with H-index as 18, according to Google Scholar. Among those, 9 papers (7 of them are first-authored) have received 100+ citations, and 3 are selected as High-Cited Papers by Essential Science Indicators (ESI). He received the MSRA Fellowship Award in 2009, Best Student Paper Award at CVPR 2010 and Outstanding Reviewer Award at CVPR 2012. For academic services, he served as Journal Track Chair at IJCAI 2015, Area Chair at ICPR 2016 and Senior Program Committee Member at IJCAI 2017. He also organized and hosted a workshop on Practical Transfer Learning at ICDM 2015. He has been serving as Program Committee Member at various international conferences.

His research interests include machine learning algorithms (especially in transfer learning and domain adaptation) and their applications in object recognition/detection/segmentation, video event recognition, ocular image analysis.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, XY., Shi, HC., Li, CS. et al. TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization. Mach. Intell. Res. 19, 227–246 (2022). https://doi.org/10.1007/s11633-022-1333-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-022-1333-4

Keywords