Abstract
Action recognition and localization in untrimmed videos is important for many applications and have attracted a lot of attention. Since full supervision with frame-level annotation places an overwhelming burden on manual labeling effort, learning with weak video-level supervision becomes a potential solution. In this paper, we propose a novel weakly supervised framework to recognize actions and locate the corresponding frames in untrimmed videos simultaneously. Considering that there are abundant trimmed videos publicly available and well-segmented with semantic descriptions, the instructive knowledge learned on trimmed videos can be fully leveraged to analyze untrimmed videos. We present an effective knowledge transfer strategy based on inter-class semantic relevance. We also take advantage of the self-attention mechanism to obtain a compact video representation, such that the influence of background frames can be effectively eliminated. A learning architecture is designed with twin networks for trimmed and untrimmed videos, to facilitate transferable self-attentive representation learning. Extensive experiments are conducted on three untrimmed benchmark datasets (i.e., THUMOS14, ActivityNet1.3, and MEXaction2), and the experimental results clearly corroborate the efficacy of our method. It is especially encouraging to see that the proposed weakly supervised method even achieves comparable results to some fully supervised methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
K. Simonyan, A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 568–576, 2014. DOI: https://doi.org/10.5555/2968826.2968890.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4489–4497, 2015. DOI: https://doi.org/10.1109/ICCV.2015.510.
H. Wang, A. Kläser, C. Schmid, C. L. Liu. Action recognition by dense trajectories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, USA, pp. 3169–3176, 2011. DOI: https://doi.org/10.1109/CVPR.2011.5995407.
H. Wang, C. Schmid. Action recognition with improved trajectories. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 3551–3558, 2013. DOI: https://doi.org/10.1109/ICCV.2013.441.
L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20–36, 2016. DOI: https://doi.org/10.1007/978-3-319-46484-8_2.
C. Feichtenhofer, A. Pinz, A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1933–1941, 2016. DOI: https://doi.org/10.1109/CVPR.2016.213.
J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 4724–4733, 2017. DOI: https://doi.org/10.1109/CVPR.2017.502.
J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4694–4702, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7299101.
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2625–2634, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298878.
K. Soomro, A. R. Zamir, M. Shah. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. Technical Report CRCV-TR-12-01, Center for Research in Computer Vision, University of Central Florida, USA, 2012.
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2556–2563, 2011. DOI: https://doi.org/10.1109/ICCV.2011.6126543.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725–1732, 2014. DOI: https://doi.org/10.1109/CVPR.2014.223.
S. Karaman, L. Seidenari, A. Del Bimbo. Fast saliency based pooling of fisher encoded dense trajectories. In Proceedings of ECCV THUMOS Workshop, Zurich, Switzerland, 2014.
D. Oneata, J. Verbeek, C. Schmid. The LEAR submission at THUMOS 2014. 2014.
G. Singh, F. Cuzzolin. Untrimmed video classification for activity detection: Submission to activityNet challenge. [Online], Available: https://arxiv.org/abs/1607.01979, 2016.
L. M. Wang, Y. Qiao, X. O. Tang. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge, vol. 1, no. 2, Article number 2, 2014.
V. Escorcia, F. C. Heilbron, J. C. Niebles, B. Ghanem. DAPs: Deep action proposals for action understanding. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 768–784, 2016. DOI: https://doi.org/10.1007/978-3-319-46487-9_47.
P. Mettes, J. C. Van Gemert, C. G. M. Snoek. Spot on: Action localization from pointly-supervised proposals. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 437–453, 2016. DOI: https://doi.org/10.1007/978-3-319-46454-1_27.
T. W. Lin, X. Zhao, H. S. Su, C. J. Wang, M. Yang. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 3–21, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_1.
S. G. Ma, L. Sigal, S. Sclaroff. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1942–1950, 2016. DOI: https://doi.org/10.1109/CVPR.2016.214.
B. Singh, T. K. Marks, M. Jones, O. Tuzel, M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1961–1970, 2016. DOI: https://doi.org/10.1109/CVPR.2016.216.
S. Yeung, O. Russakovsky, G. Mori, F. F. Li. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2678–2687, 2016. DOI: https://doi.org/10.1109/CVPR.2016.293.
K. K. Singh, Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 3544–3553, 2017. DOI: https://doi.org/10.1109/ICCV.2017.381.
L. M. Wang, Y. J. Xiong, D. H. Lin, L. Van Gool. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6402–6411, 2017. DOI: https://doi.org/10.1109/CVPR.2017.678.
P. Nguyen, B. Han, T. Liu, G. Prasad. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6752–6761, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00706.
Z. Shou, H. Gao, L. Zhang, K. Miyazawa, S. F. Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 162–179, 2018. DOI: https://doi.org/10.1007/978-3-030-01270-010.
S. Paul, S. Roy, A. K. Roy-Chowdhury. W-TALC: Weakly-supervised temporal activity localization and classification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 588–607, 2018. DOI: https://doi.org/10.1007/978-3-030-01225-0_35.
Y. G. Jiang, J. G. Liu, A. R. Zamir, G. Toderici. THUMOS Challenge: Action recognition with a large number of classes. In Proceedings of ECCV2014 International Workshop and Competition. [Online], Available: http://crcv.ucf.edu/THUMOS14/home.html, 2014.
F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 961–970, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298698.
M. Crucianu. MEXaction2: Action Detection and Localization Dataset.[Online], Available: http://mexculture.cnam.fr/Datasets/mex+action+dataset.html, July 15, 2015.
J. E. Van Engelen, H. H. Hoos. A survey on semi-supervised learning. Machine Learning, vol. 109, no. 2, pp. 373–440, 2020. DOI: https://doi.org/10.1007/s10994-019-05855-6.
Z. H. Zhou, M. Li. Semi-supervised learning by disagreement. Knowledge and Information Systems, vol. 24, no. 3, pp. 415–439, 2010. DOI: https://doi.org/10.1007/s10115-009-0209-z.
J. Foulds, E. Frank. A review of multi-instance learning assumptions. The Knowledge Engineering Review, vol. 25, no. 1, pp. 1–25, 2010. DOI: https://doi.org/10.1017/S026988890999035X.
Z. H. Zhou. Multi-instance learning from supervised view. Journal of Computer Science and Technology, vol. 21, no. 5, pp. 800–809, 2006. DOI: https://doi.org/10.1007/s11390-006-0800-7.
B. Frenay, M. Verleysen. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, 2014. DOI: https://doi.org/10.1109/TNNLS.2013.2292894.
W. Gao, L. Wang, Y. F. Li, Z. H. Zhou. Risk minimization in the presence of label noise. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI Press, Phoenix, USA, pp. 1575–1581, 2016. DOI: https://doi.org/10.5555/3016100.3016119.
D. Bahdanau, K. Cho, Y. Bengio. Neural machine translation by jointly learning to align and translate. [Online], Available: https://arxiv.org/abs/1409.0473, 2014.
K. Gregor, I. Danihelka, A. Graves, D. Rezende, D. Wierstra. Draw: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1462–1471, 2015.
K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 2048–2057, 2015. DOI: https://doi.org/10.5555/3045118.3045336.
Z. C. Yang, X. D. He, J. F. Gao, L. Deng, A. Smola. Stacked attention networks for image question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 21–29, 2016. DOI: https://doi.org/10.1109/CVPR.2016.10.
J. P. Cheng, L. Dong, M. Lapata. Long short-term memory-networks for machine reading. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Austin, USA, pp. 551–561, 2016.
A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Austin, USA, pp. 2249–2255, 2016. DOI: https://doi.org/10.18653/vl/D16-1244.
Z. H. Lin, M. W. Feng, C. N. Dos Santos, M. Yu, B. Xiang, B. W. Zhou, Y. Bengio. A structured self-attentive sentence embedding. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
R. Paulus, C. M. Xiong, R. Socher. A deep reinforced model for abstractive summarization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 6000–6010, 2017. DOI: https://doi.org/10.5555/3295222.3295349.
N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, D. Tran. Image transformer. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 4055–4064, 2018.
X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7794–7803, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00813.
T. Xu, P. C. Zhang, Q. Y. Huang, H. Zhang, Z. Gan, X. L. Huang, X. D. He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1316–1324, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00143.
H. Zhang, I. Goodfellow, D. Metaxas, A. Odena. Self-attention generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, USA, pp. 7354–7363, 2019.
Y. Liu, C. J. Sun, L. Lin, X. L. Wang. Learning natural language inference using bidirectional LSTM model and inner-attention. [Online], Available: https://arxiv.org/abs/1605.09090, 2016.
S. Thrun, L. Pratt. Learning to Learn, New York, USA: Springer, 2012.
S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010. DOI: https://doi.org/10.1109/TKDE.2009.191.
S. J. Pan, I. W. Tsang, J. T. Kwok, Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011. DOI: https://doi.org/10.1109/TNN.2010.2091281.
B. Q. Gong, K. Grauman, F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In Proceedings of the 30th International Conference on Machine Learning, ACM, Atlanta, USA, pp. I–222–I–230, 2013. DOI: https://doi.org/10.5555/3042817.3042844.
K. Saenko, B. Kulis, M. Fritz, T. Darrell. Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision, Springer, Heraklion, Greece, pp. 213–226, 2010. DOI: https://doi.org/10.1007/978-3-642-15561-116.
M. S. Long, J. M. Wang, G. G. Ding, J. G. Sun, P. S. Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 2200–2207, 2013. DOI: https://doi.org/10.1109/ICCV.2013.274.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf. A kernel two-sample test. The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012. DOI: https://doi.org/10.5555/2188385.2188410.
M. S. Long, H. Zhu, J. M. Wang, M. I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning, ACM, Sydney, Australia, pp. 2208–2217, 2017. DOI: https://doi.org/10.5555/3305890.3305909.
M. S. Long, Y. Cao, J. M. Wang, et al. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 97–105, 2015. DOI: https://doi.org/10.5555/3045118.3045130.
J. Yosinski, J. Clune, Y. Bengio, H. Lipson. How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3320–3328, 2014.
X. Glorot, A. Bordes, Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on Machine Learning, ACM, Bellevue, USA, pp. 513–520, 2011. DOI: https://doi.org/10.5555/3104482.3104547.
M. M. Chen, Z. X. Xu, K. Q. Weinberger, F. Sha. Marginalized denoising autoencoders for domain adaptation. In Proceedings of the 29th International Conference on Machine Learning, ACM, Edinburgh, UK, pp. 1627–1634, 2012. DOI: https://doi.org/10.5555/3042573.3042781.
L. Ge, J. Gao, X. Y. Li, A. D. Zhang. Multi-source deep learning for information trustworthiness estimation. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Chicago, USA, pp. 766–774, 2013. DOI: https://doi.org/10.1145/2487575.2487612.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, ACM, Bellevue, USA, pp. 689–696, 2011. DOI: https://doi.org/10.5555/3104482.3104569.
E. Tzeng, J. Hoffman, K. Saenko, T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 2962–2971, 2017. DOI: https://doi.org/10.1109/CVPR.2017.316.
E. Tzeng, J. Hoffman, T. Darrell, K. Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4068–4076, 2015. DOI: https://doi.org/10.1109/ICCV.2015.463.
Y. Ganin, V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, ACM, Lille, France, pp. 1180–1189, 2015. DOI: https://doi.org/10.5555/3045118.3045244.
M. S. Long, H. Zhu, J. M. Wang, M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, ACM, Barcelona, Spain, pp. 136–144, 2016. DOI: https://doi.org/10.5555/3157096.3157112.
K. Zhang, B. Schölkopf, K. Muandet, Z. K. Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, ACM, Atlanta, USA, 2013.
N. Inoue, R. Furuta, T. Yamasaki, K. Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 5001–5009, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00525.
X. L. Chen, A. Gupta. Webly supervised learning of convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1431–1439, 2015. DOI: https://doi.org/10.1109/ICCV.2015.168.
B. W. Zhang, L. M. Wang, Z. Wang, Y. Qiao, H. L. Wang. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2718–2726, 2016. DOI: https://doi.org/10.1109/CVPR.2016.297.
M. Jain, J. C. Van Gemert, C. G. M. Snoek. What do 15, 000 object categories tell us about classifying and localizing actions?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 46–55, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298599.
M. Jain, J. Van Gemert, C. G. M. Snoek. University of Amsterdam at THUMOS challenge 2014. In Proceddings of the 14th ECCV, THUMOS Challenge, ECCV, Orlando, USA, 2014.
G. Varol, A. A. Salah. Efficient large-scale action recognition in videos using extreme learning machines. Expert Systems with Applications, vol. 42, no. 21, pp. 8274–8282, 2015. DOI: https://doi.org/10.1016/j.eswa.2015.06.013.
A. Richard, J. Gall. Temporal action detection using a statistical language model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3131–3140, 2016. DOI: https://doi.org/10.1109/CV-PR.2016.341.
Z. Shou, D. G. Wang, S. F. Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 1049–1058, 2016. DOI: https://doi.org/10.1109/CVPR.2016.119.
H. Alwassel, F. C. Heilbron, B. Ghanem. Action search: Learning to search for human activities in untrimmed videos. [Online], Available: https://arxiv.org/abs/1706.04269, 2017.
T. W. Lin, X. Zhao, Z. Shou. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia, ACM, Mountain View, USA, pp. 988–996, 2017. DOI: https://doi.org/10.1145/3123266.3123343.
J. Yuan, B. B. Ni, X. K. Yang, A. A. Kassim. Temporal action localization with pyramid of score distribution features. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3093–3102, 2016. DOI: https://doi.org/10.1109/CVPR.2016.337.
Z. Shou, J. Chan, A. Zareian, K. Miyazawa, S. F. Chang. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 1417–1426, 2017. DOI: https://doi.org/10.1109/CVPR.2017.155.
H. J. Xu, A. Das, K. Saenko. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5794–5803, 2017. DOI: https://doi.org/10.1109/ICCV.2017.617.
H. J. Xu, B. Y. Kang, X. M. Sun, J. S. Feng, K. Saenko, T. Darrell. Similarity R-C3D for few-shot temporal activity detection. [Online], Available: https://arxiv.org/abs/1812.10000, 2018.
Y. Zhao, Y. J. Xiong, L. M. Wang, Z. R. Wu, X. O. Tang, D. H. Lin. Temporal action detection with structured segment networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2933–2942, 2017. DOI: https://doi.org/10.1109/ICCV.2017.317.
F. C. Heilbron, W. Barrios, V. Escorcia, B. Ghanem. SCC: Semantic context cascade for efficient action detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 3175–3184, 2017. DOI: https://doi.org/10.1109/CVPR.2017.338.
Y. W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, R. Sukthankar. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1130–1139, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00124.
Y. J. Xiong, Y. Zhao, L. M. Wang, D. H. Lin, X. O. Tang. A pursuit of temporal accuracy in general activity detection. [Online], Available: https://arxiv.org/abs/1703.02716, 2017.
6 Acknowledgements
This work was supported by National Natural Science Foundation of China (Nos. 61871378, U2003111, 62122013 and U2001211).
Author information
Authors and Affiliations
Corresponding author
Additional information
Colored figures are available in the online version at https://link.springer.com/journal/11633
Xiao-Yu Zhang received the B.Sc. degree in computer science from Nanjing University of Science and Technology, China in 2005, and the Ph.D. degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences, China in 2010. He is currently an associate professor with Institute of Information Engineering, Chinese Academy of Sciences, China. He has authored or coauthored more than 60 refereed publications in international journals and conferences. He is a Senior Member of the ACM, CCF, and CSIG. His awards and honors include the Silver Prize of Microsoft Cup of the IEEE China Student Paper Contest in 2009, the Second Prize of Wu Wen-Jun AI Science & Technology Innovation Award in 2016, the CCCV Best Paper Nominate Award in 2017, the Third Prize of BAST Beijing Excellent S&T Paper Award in 2018, and the Second Prize of CSIG Science & Technology Award in 2019.
His research interests include artificial intelligence, data mining, and computer vision.
Hai-Chao Shi received the B.Sc. degree in software engineering from Beijing Technology and Business University, China in 2017. He is currently a Ph.D. degree candidate in cyberspace security with National Engineering Laboratory of Information Content Security Technology, Institute of Information Engineering, Chinese Academy of Sciences, China.
His research interests include temporal action localization, action recognition, and pattern recognition.
Chang-Sheng Li received the B.Eng. degree in electronic engineering from University of Electronic Science and Technology of China (UESTC), China in 2008, and the Ph.D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences, China in 2013. He was a research assistant with The Hong Kong Polytechnic University, China from 2009 to 2010. He worked with IBM Research-China, Alibaba Group, and UESTC, respectively. He is currently a professor with Beijing Institute of Technology, China. He has authored or coauthored more than 40 refereed publications in international journals and conferences.
His research interests include machine learning, data mining, and computer vision.
Li-Xin Duan received the B.Eng. degree in electronic and information science from University of Science and Technology of China (USTC), China in 2008, the Ph.D. degree in computer engineering from Nanyang Technological University (NTU), Singapore in 2012. He is currently a full professor with School of Computer Science and Engineering, University of Electronic Science and Technology of China. He has also been selected into “The Thousand Talents Plan for Young Professionals” by the Organization Department of the Communist Party of China in 2017. Prior to that, He worked as research scientist at Amazon’s Seattle headquarters in the United States and Institute for Infocomm Research (I2R) in Singapore. He has published 35+ papers in international journals and conferences, which have received 2200+ citations with H-index as 18, according to Google Scholar. Among those, 9 papers (7 of them are first-authored) have received 100+ citations, and 3 are selected as High-Cited Papers by Essential Science Indicators (ESI). He received the MSRA Fellowship Award in 2009, Best Student Paper Award at CVPR 2010 and Outstanding Reviewer Award at CVPR 2012. For academic services, he served as Journal Track Chair at IJCAI 2015, Area Chair at ICPR 2016 and Senior Program Committee Member at IJCAI 2017. He also organized and hosted a workshop on Practical Transfer Learning at ICDM 2015. He has been serving as Program Committee Member at various international conferences.
His research interests include machine learning algorithms (especially in transfer learning and domain adaptation) and their applications in object recognition/detection/segmentation, video event recognition, ocular image analysis.
Rights and permissions
About this article
Cite this article
Zhang, XY., Shi, HC., Li, CS. et al. TwinNet: Twin Structured Knowledge Transfer Network for Weakly Supervised Action Localization. Mach. Intell. Res. 19, 227–246 (2022). https://doi.org/10.1007/s11633-022-1333-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-022-1333-4