TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Zhang, Xiao-Yu; Shi, Hai-Chao; Li, Chang-Sheng; Duan, Li-Xin

doi:10.1007/s11633-022-1333-4

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Research Article
Published: 28 May 2022

Volume 19, pages 227–246, (2022)
Cite this article

Machine Intelligence Research Aims and scope Submit manuscript

176 Accesses
1 Altmetric
Explore all metrics

Abstract

Action recognition and localization in untrimmed videos is important for many applications and have attracted a lot of attention. Since full supervision with frame-level annotation places an overwhelming burden on manual labeling effort, learning with weak video-level supervision becomes a potential solution. In this paper, we propose a novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously. Considering that there are abundant trimmed videos publicly available and well-segmented with semantic descriptions, the instructive knowledge learned on trimmed videos can be fully leveraged to analyze untrimmed videos. We present an effective knowledge transfer strategy based on inter-class semantic relevance. We also take advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated. A learning architecture is designed with twin networks for trimmed and untrimmed videos, to facilitate transferable self-attentive representation learning. Extensive experiments are conducted on three untrimmed benchmark datasets (i.e., THUMOS14, ActivityNet1.3, and MEXaction2), and the experimental results clearly corroborate the efficacy of our method. It is especially encouraging to see that the proposed weakly supervised method even achieves comparable results to some fully supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly-Supervised Action Recognition and Localization via Knowledge Transfer

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-supervised Action Recognition in Videos

Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

K. Simonyan, A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 568–576, 2014. DOI: https://doi.org/10.5555/2968826.2968890.
Google Scholar
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4489–4497, 2015. DOI: https://doi.org/10.1109/ICCV.2015.510.
Google Scholar
H. Wang, A. Kläser, C. Schmid, C. L. Liu. Action recognition by dense trajectories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, USA, pp. 3169–3176, 2011. DOI: https://doi.org/10.1109/CVPR.2011.5995407.
Google Scholar
H. Wang, C. Schmid. Action recognition with improved trajectories. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 3551–3558, 2013. DOI: https://doi.org/10.1109/ICCV.2013.441.
Google Scholar
L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20–36, 2016. DOI: https://doi.org/10.1007/978-3-319-46484-8_2.
Google Scholar
C. Feichtenhofer, A. Pinz, A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1933–1941, 2016. DOI: https://doi.org/10.1109/CVPR.2016.213.
Google Scholar
J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 4724–4733, 2017. DOI: https://doi.org/10.1109/CVPR.2017.502.
Google Scholar
J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4694–4702, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7299101.
Google Scholar
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2625–2634, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298878.
Google Scholar
K. Soomro, A. R. Zamir, M. Shah. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. Technical Report CRCV-TR-12-01, Center for Research in Computer Vision, University of Central Florida, USA, 2012.
Google Scholar
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2556–2563, 2011. DOI: https://doi.org/10.1109/ICCV.2011.6126543.
Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725–1732, 2014. DOI: https://doi.org/10.1109/CVPR.2014.223.
Google Scholar
S. Karaman, L. Seidenari, A. Del Bimbo. Fast saliency based pooling of fisher encoded dense trajectories. In Proceedings of ECCV THUMOS Workshop, Zurich, Switzerland, 2014.
D. Oneata, J. Verbeek, C. Schmid. The LEAR submission at THUMOS 2014. 2014.
G. Singh, F. Cuzzolin. Untrimmed video classification for activity detection: Submission to activityNet challenge. [Online], Available: https://arxiv.org/abs/1607.01979, 2016.
L. M. Wang, Y. Qiao, X. O. Tang. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge, vol. 1, no. 2, Article number 2, 2014.
V. Escorcia, F. C. Heilbron, J. C. Niebles, B. Ghanem. DAPs: Deep action proposals for action understanding. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 768–784, 2016. DOI: https://doi.org/10.1007/978-3-319-46487-9_47.
Google Scholar
P. Mettes, J. C. Van Gemert, C. G. M. Snoek. Spot on: Action localization from pointly-supervised proposals. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 437–453, 2016. DOI: https://doi.org/10.1007/978-3-319-46454-1_27.
Google Scholar
T. W. Lin, X. Zhao, H. S. Su, C. J. Wang, M. Yang. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 3–21, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_1.
Google Scholar
S. G. Ma, L. Sigal, S. Sclaroff. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1942–1950, 2016. DOI: https://doi.org/10.1109/CVPR.2016.214.
Book Google Scholar
B. Singh, T. K. Marks, M. Jones, O. Tuzel, M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1961–1970, 2016. DOI: https://doi.org/10.1109/CVPR.2016.216.
Google Scholar
S. Yeung, O. Russakovsky, G. Mori, F. F. Li. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2678–2687, 2016. DOI: https://doi.org/10.1109/CVPR.2016.293.
Google Scholar
K. K. Singh, Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 3544–3553, 2017. DOI: https://doi.org/10.1109/ICCV.2017.381.
Google Scholar
L. M. Wang, Y. J. Xiong, D. H. Lin, L. Van Gool. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6402–6411, 2017. DOI: https://doi.org/10.1109/CVPR.2017.678.
P. Nguyen, B. Han, T. Liu, G. Prasad. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6752–6761, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00706.
Google Scholar
Z. Shou, H. Gao, L. Zhang, K. Miyazawa, S. F. Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 162–179, 2018. DOI: https://doi.org/10.1007/978-3-030-01270-010.
Google Scholar
S. Paul, S. Roy, A. K. Roy-Chowdhury. W-TALC: Weakly-supervised temporal activity localization and classification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 588–607, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_35.
Google Scholar
Y. G. Jiang, J. G. Liu, A. R. Zamir, G. Toderici. THUMOS Challenge: Action recognition with a large number of classes. In Proceedings of ECCV2014 International Workshop and Competition. [Online], Available: http://crcv.ucf.edu/THUMOS14/home.html, 2014.
F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 961–970, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298698.
Google Scholar
M. Crucianu. MEXaction2: Action Detection and Localization Dataset.[Online], Available: http://mexculture.cnam.fr/Datasets/mex+action+dataset.html, July 15, 2015.
J. E. Van Engelen, H. H. Hoos. A survey on semi-supervised learning. Machine Learning, vol. 109, no. 2, pp. 373–440, 2020. DOI: https://doi.org/10.1007/s10994-019-05855-6.
Article MathSciNet MATH Google Scholar
Z. H. Zhou, M. Li. Semi-supervised learning by disagreement. Knowledge and Information Systems, vol. 24, no. 3, pp. 415–439, 2010. DOI: https://doi.org/10.1007/s10115-009-0209-z.
Article Google Scholar
J. Foulds, E. Frank. A review of multi-instance learning assumptions. The Knowledge Engineering Review, vol. 25, no. 1, pp. 1–25, 2010. DOI: https://doi.org/10.1017/S026988890999035X.
Article Google Scholar
Z. H. Zhou. Multi-instance learning from supervised view. Journal of Computer Science and Technology, vol. 21, no. 5, pp. 800–809, 2006. DOI: https://doi.org/10.1007/s11390-006-0800-7.
Article MathSciNet Google Scholar
B. Frenay, M. Verleysen. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, 2014. DOI: https://doi.org/10.1109/TNNLS.2013.2292894.
Article MATH Google Scholar
W. Gao, L. Wang, Y. F. Li, Z. H. Zhou. Risk minimization in the presence of label noise. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI Press, Phoenix, USA, pp. 1575–1581, 2016. DOI: https://doi.org/10.5555/3016100.3016119.
Google Scholar
D. Bahdanau, K. Cho, Y. Bengio. Neural machine translation by jointly learning to align and translate. [Online], Available: https://arxiv.org/abs/1409.0473, 2014.
K. Gregor, I. Danihelka, A. Graves, D. Rezende, D. Wierstra. Draw: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1462–1471, 2015.
K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 2048–2057, 2015. DOI: https://doi.org/10.5555/3045118.3045336.
Google Scholar
Z. C. Yang, X. D. He, J. F. Gao, L. Deng, A. Smola. Stacked attention networks for image question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 21–29, 2016. DOI: https://doi.org/10.1109/CVPR.2016.10.
Google Scholar
J. P. Cheng, L. Dong, M. Lapata. Long short-term memory-networks for machine reading. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Austin, USA, pp. 551–561, 2016.
Google Scholar
A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Austin, USA, pp. 2249–2255, 2016. DOI: https://doi.org/10.18653/vl/D16-1244.
Google Scholar
Z. H. Lin, M. W. Feng, C. N. Dos Santos, M. Yu, B. Xiang, B. W. Zhou, Y. Bengio. A structured self-attentive sentence embedding. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
R. Paulus, C. M. Xiong, R. Socher. A deep reinforced model for abstractive summarization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 6000–6010, 2017. DOI: https://doi.org/10.5555/3295222.3295349.
Google Scholar
N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, D. Tran. Image transformer. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 4055–4064, 2018.
Google Scholar
X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7794–7803, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00813.
Google Scholar
T. Xu, P. C. Zhang, Q. Y. Huang, H. Zhang, Z. Gan, X. L. Huang, X. D. He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1316–1324, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00143.
Google Scholar
H. Zhang, I. Goodfellow, D. Metaxas, A. Odena. Self-attention generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, USA, pp. 7354–7363, 2019.
Google Scholar
Y. Liu, C. J. Sun, L. Lin, X. L. Wang. Learning natural language inference using bidirectional LSTM model and inner-attention. [Online], Available: https://arxiv.org/abs/1605.09090, 2016.
S. Thrun, L. Pratt. Learning to Learn, New York, USA: Springer, 2012.
MATH Google Scholar
S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010. DOI: https://doi.org/10.1109/TKDE.2009.191.
Article Google Scholar
S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011. DOI: https://doi.org/10.1109/TNN.2010.2091281.
Article Google Scholar
B. Q. Gong, K. Grauman, F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In Proceedings of the 30th International Conference on Machine Learning, ACM, Atlanta, USA, pp. I–222–I–230, 2013. DOI: https://doi.org/10.5555/3042817.3042844.
Google Scholar
K. Saenko, B. Kulis, M. Fritz, T. Darrell. Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision, Springer, Heraklion, Greece, pp. 213–226, 2010. DOI: https://doi.org/10.1007/978-3-642-15561-116.
Google Scholar
M. S. Long, J. M. Wang, G. G. Ding, J. G. Sun, P. S. Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 2200–2207, 2013. DOI: https://doi.org/10.1109/ICCV.2013.274.
Google Scholar
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf. A kernel two-sample test. The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012. DOI: https://doi.org/10.5555/2188385.2188410.
MathSciNet MATH Google Scholar
M. S. Long, H. Zhu, J. M. Wang, M. I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning, ACM, Sydney, Australia, pp. 2208–2217, 2017. DOI: https://doi.org/10.5555/3305890.3305909.
Google Scholar
M. S. Long, Y. Cao, J. M. Wang, et al. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 97–105, 2015. DOI: https://doi.org/10.5555/3045118.3045130.
Google Scholar
J. Yosinski, J. Clune, Y. Bengio, H. Lipson. How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3320–3328, 2014.
X. Glorot, A. Bordes, Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on Machine Learning, ACM, Bellevue, USA, pp. 513–520, 2011. DOI: https://doi.org/10.5555/3104482.3104547.
Google Scholar
M. M. Chen, Z. X. Xu, K. Q. Weinberger, F. Sha. Marginalized denoising autoencoders for domain adaptation. In Proceedings of the 29th International Conference on Machine Learning, ACM, Edinburgh, UK, pp. 1627–1634, 2012. DOI: https://doi.org/10.5555/3042573.3042781.
Google Scholar
L. Ge, J. Gao, X. Y. Li, A. D. Zhang. Multi-source deep learning for information trustworthiness estimation. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Chicago, USA, pp. 766–774, 2013. DOI: https://doi.org/10.1145/2487575.2487612.
Google Scholar
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, ACM, Bellevue, USA, pp. 689–696, 2011. DOI: https://doi.org/10.5555/3104482.3104569.
Google Scholar
E. Tzeng, J. Hoffman, K. Saenko, T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 2962–2971, 2017. DOI: https://doi.org/10.1109/CVPR.2017.316.
Google Scholar
E. Tzeng, J. Hoffman, T. Darrell, K. Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4068–4076, 2015. DOI: https://doi.org/10.1109/ICCV.2015.463.
Google Scholar
Y. Ganin, V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 1180–1189, 2015. DOI: https://doi.org/10.5555/3045118.3045244.
Google Scholar
M. S. Long, H. Zhu, J. M. Wang, M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, ACM, Barcelona, Spain, pp. 136–144, 2016. DOI: https://doi.org/10.5555/3157096.3157112.
Google Scholar
K. Zhang, B. Schölkopf, K. Muandet, Z. K. Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, ACM, Atlanta, USA, 2013.
Google Scholar
N. Inoue, R. Furuta, T. Yamasaki, K. Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 5001–5009, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00525.
Google Scholar
X. L. Chen, A. Gupta. Webly supervised learning of convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1431–1439, 2015. DOI: https://doi.org/10.1109/ICCV.2015.168.
Google Scholar
B. W. Zhang, L. M. Wang, Z. Wang, Y. Qiao, H. L. Wang. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2718–2726, 2016. DOI: https://doi.org/10.1109/CVPR.2016.297.
M. Jain, J. C. Van Gemert, C. G. M. Snoek. What do 15, 000 object categories tell us about classifying and localizing actions?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 46–55, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298599.
Google Scholar
M. Jain, J. Van Gemert, C. G. M. Snoek. University of Amsterdam at THUMOS challenge 2014. In Proceddings of the 14th ECCV, THUMOS Challenge, ECCV, Orlando, USA, 2014.
Google Scholar
G. Varol, A. A. Salah. Efficient large-scale action recognition in videos using extreme learning machines. Expert Systems with Applications, vol. 42, no. 21, pp. 8274–8282, 2015. DOI: https://doi.org/10.1016/j.eswa.2015.06.013.
Article Google Scholar
A. Richard, J. Gall. Temporal action detection using a statistical language model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3131–3140, 2016. DOI: https://doi.org/10.1109/CV-PR.2016.341.
Google Scholar
Z. Shou, D. G. Wang, S. F. Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1049–1058, 2016. DOI: https://doi.org/10.1109/CVPR.2016.119.
Google Scholar
H. Alwassel, F. C. Heilbron, B. Ghanem. Action search: Learning to search for human activities in untrimmed videos. [Online], Available: https://arxiv.org/abs/1706.04269, 2017.
T. W. Lin, X. Zhao, Z. Shou. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia, ACM, Mountain View, USA, pp. 988–996, 2017. DOI: https://doi.org/10.1145/3123266.3123343.
Google Scholar
J. Yuan, B. B. Ni, X. K. Yang, A. A. Kassim. Temporal action localization with pyramid of score distribution features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3093–3102, 2016. DOI: https://doi.org/10.1109/CVPR.2016.337.
Google Scholar
Z. Shou, J. Chan, A. Zareian, K. Miyazawa, S. F. Chang. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 1417–1426, 2017. DOI: https://doi.org/10.1109/CVPR.2017.155.
Google Scholar
H. J. Xu, A. Das, K. Saenko. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5794–5803, 2017. DOI: https://doi.org/10.1109/ICCV.2017.617.
Google Scholar
H. J. Xu, B. Y. Kang, X. M. Sun, J. S. Feng, K. Saenko, T. Darrell. Similarity R-C3D for few-shot temporal activity detection. [Online], Available: https://arxiv.org/abs/1812.10000, 2018.
Y. Zhao, Y. J. Xiong, L. M. Wang, Z. R. Wu, X. O. Tang, D. H. Lin. Temporal action detection with structured segment networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2933–2942, 2017. DOI: https://doi.org/10.1109/ICCV.2017.317.
MATH Google Scholar
F. C. Heilbron, W. Barrios, V. Escorcia, B. Ghanem. SCC: Semantic context cascade for efficient action detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 3175–3184, 2017. DOI: https://doi.org/10.1109/CVPR.2017.338.
Google Scholar
Y. W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, R. Sukthankar. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1130–1139, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00124.
Google Scholar
Y. J. Xiong, Y. Zhao, L. M. Wang, D. H. Lin, X. O. Tang. A pursuit of temporal accuracy in general activity detection. [Online], Available: https://arxiv.org/abs/1703.02716, 2017.

Download references

6 Acknowledgements

This work was supported by National Natural Science Foundation of China (Nos. 61871378, U2003111, 62122013 and U2001211).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 100093, China
Xiao-Yu Zhang & Hai-Chao Shi
School of Computer Science, Beijing Institute of Technology, Beijing, 100081, China
Chang-Sheng Li
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
Li-Xin Duan

Authors

Xiao-Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Chao Shi
View author publications
You can also search for this author in PubMed Google Scholar
Chang-Sheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Li-Xin Duan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai-Chao Shi.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Xiao-Yu Zhang received the B.Sc. degree in computer science from Nanjing University of Science and Technology, China in 2005, and the Ph.D. degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, China in 2010. He is currently an associate professor with Institute of Information Engineering, Chinese Academy of Sciences, China. He has authored or coauthored more than 60 refereed publications in international journals and conferences. He is a Senior Member of the ACM, CCF, and CSIG. His awards and honors include the Silver Prize of Microsoft Cup of the IEEE China Student Paper Contest in 2009, the Second Prize of Wu Wen-Jun AI Science & Technology Innovation Award in 2016, the CCCV Best Paper Nominate Award in 2017, the Third Prize of BAST Beijing Excellent S&T Paper Award in 2018, and the Second Prize of CSIG Science & Technology Award in 2019.

His research interests include artificial intelligence, data mining, and computer vision.

Hai-Chao Shi received the B.Sc. degree in software engineering from Beijing Technology and Business University, China in 2017. He is currently a Ph.D. degree candidate in cyberspace security with National Engineering Laboratory of Information Content Security Technology, Institute of Information Engineering, Chinese Academy of Sciences, China.

His research interests include temporal action localization, action recognition, and pattern recognition.

Chang-Sheng Li received the B.Eng. degree in electronic engineering from University of Electronic Science and Technology of China (UESTC), China in 2008, and the Ph.D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences, China in 2013. He was a research assistant with The Hong Kong Polytechnic University, China from 2009 to 2010. He worked with IBM Research-China, Alibaba Group, and UESTC, respectively. He is currently a professor with Beijing Institute of Technology, China. He has authored or coauthored more than 40 refereed publications in international journals and conferences.

His research interests include machine learning, data mining, and computer vision.

Li-Xin Duan received the B.Eng. degree in electronic and information science from University of Science and Technology of China (USTC), China in 2008, the Ph.D. degree in computer engineering from Nanyang Technological University (NTU), Singapore in 2012. He is currently a full professor with School of Computer Science and Engineering, University of Electronic Science and Technology of China. He has also been selected into “The Thousand Talents Plan for Young Professionals” by the Organization Department of the Communist Party of China in 2017. Prior to that, He worked as research scientist at Amazon’s Seattle headquarters in the United States and Institute for Infocomm Research (I2R) in Singapore. He has published 35+ papers in international journals and conferences, which have received 2200+ citations with H-index as 18, according to Google Scholar. Among those, 9 papers (7 of them are first-authored) have received 100+ citations, and 3 are selected as High-Cited Papers by Essential Science Indicators (ESI). He received the MSRA Fellowship Award in 2009, Best Student Paper Award at CVPR 2010 and Outstanding Reviewer Award at CVPR 2012. For academic services, he served as Journal Track Chair at IJCAI 2015, Area Chair at ICPR 2016 and Senior Program Committee Member at IJCAI 2017. He also organized and hosted a workshop on Practical Transfer Learning at ICDM 2015. He has been serving as Program Committee Member at various international conferences.

His research interests include machine learning algorithms (especially in transfer learning and domain adaptation) and their applications in object recognition/detection/segmentation, video event recognition, ocular image analysis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, XY., Shi, HC., Li, CS. et al. TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization. Mach. Intell. Res. 19, 227–246 (2022). https://doi.org/10.1007/s11633-022-1333-4

Download citation

Received: 21 January 2022
Accepted: 22 April 2022
Published: 28 May 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11633-022-1333-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly-Supervised Action Recognition and Localization via Knowledge Transfer

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-supervised Action Recognition in Videos

Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Explore related subjects

References

6 Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now