Abstract
Deep convolutional networks have become ubiquitous in computer vision owing to their success in visual recognition task on still images. However, their adaptations to video classification have not clearly established their superiority over conventional hand crafted features. Existing CNN methods for action recognition typically train multiple streams to independently deal with spatial and temporal information and then combine their prediction scores. But relatively little is known about the benefits of combining these modalities during the training process. In this work, we propose a novel semi-supervised learning approach that allows multiple streams to supervise each other in a co-training strategy, thus making the training simultaneous in the two modalities. We show that transferring information between the networks by predicting labels on an unlabeled set outperforms state-of-the-art methods. Furthermore, we also show that performance of our approach is comparable to existing methods but while using less data. We demonstrate the effectiveness of our approach through extensive experiments on the UCF 101 and HMDB datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Since we consider the spatial and temporal aspects as two views of the data, we use the terms streams and views interchangeably.
- 2.
References
Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. IEEE Trans. Pattern Anal. Mach. Intell. 22, 844–851 (2000)
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance, vol. 2003 (October 2003)
Laptev, I., Lindeberg, T.: Space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR (2011)
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Brand, M., Oliver, N., Pentland, A.: Coupled hidden Markov models for complex action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (1997)
Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circuits Syst. Video Technol. 18, 1473–1488 (2008)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. Crcv-tr-12-01, UCF (2012)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Sutskever, I., Krizhevsky, A., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Szegedy, C.: Going deeper with convolutions. In: CVPR (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
Zhang, L., et al.: Nonlinear regression via deep negative correlation learning. In: IEEE TPAMI (2019)
Shi, Z., et al.: Crowd counting with deep negative correlation learning. In: CVPR, pp. 5382–5390 (2018)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Liu, Y., et al.: DEL: deep embedding learning for efficient image segmentation. In: IJCAI, vol. 864, p. 870 (2018)
Zhang, L., Peng, S., Winkler, S.: Persemon: a deep network for joint analysis of apparent personality, emotion and their relationship. IEEE Trans. Affect. Comput. (2019)
Zhang, L., Varadarajan, J., Nagaratnam Suganthan, P., Ahuja, N., Moulin, P.: Robust visual tracking using oblique random forests. In: CVPR, pp. 5589–5598 (2017)
Zhang, L., Suganthan, P.N.: Robust visual tracking via co-trained kernelized correlation filters. PR 69, 82–93 (2017)
Zhang, L., Suganthan, P.N.: Visual tracking with convolutional random vector functional link network. IEEE Trans. Cybern. 47(10), 3243–3253 (2016)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp. 4305–4314 (2015)
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, pp. 2718–2726 (2016)
Misra, I., Shrivastava, A., Hebert, M.: Watch and learn: semi-supervised learning of object detectors from videos. CoRR arxiv:1505.05769 (2015)
Dai, D., Van Gool, L.: Ensemble projection for semi-supervised image classification (2013)
Dai, D., Van Gool, L.: Unsupervised high-level feature learning by ensemble projection for semi-supervised image classification and image clustering. Technical report, ETH Zurich (2016)
Carbonetto, P., Dorkó, G., Schmid, C., Kück, H., de Freitas, N.: A semi-supervised learning approach to object recognition with spatial integration of local features and segmentation cues. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds.) Toward Category-Level Object Recognition. LNCS, vol. 4170, pp. 277–300. Springer, Heidelberg (2006). https://doi.org/10.1007/11957959_15
Gupta, S., Kim, J., Grauman, K., Mooney, R.: Watch, listen and learn: co-training on captioned images and videos. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 457–472. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_48
Ji, S., Wei, X., Yang, M., Kai, Y.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: NIPS (2016)
Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream SR-CNNs for action recognition in videos. In: BMVC (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)
Park, E., Han, X., Berg, T.L., Berg, A.C.: Combining multiple sources of knowledge in deep CNNs for action recognition. In: WACV (2016)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100 (1998)
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: CIKM (2000)
Levin, A., Viola, P.A., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: ICCV (2003)
Christoudias, C.M., Urtasun, R., Kapoorz, A., Darrell, T.: Co-training with noisy perceptual observations. In: CVPR, pp. 2844–2851 (2009)
Goldman, S.A., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: ICML (2000)
Zhou, Z.-H., Li, M.: Semi-supervised regression with co-training. In: IJCAI (2005)
Yu, S., Krishnapuram, B., Steck, H., Rao, R.B., Rosales, R.: Bayesian co-training. In: JMLR, vol. 12 (November 2011)
Gorban, A., et al.: THUMOS challenge: action recognition with a large number of classes (2015). http://www.thumos.info/
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)
Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: IEEE conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014)
Wang, H., Schmid, C.: Lear-Inria submission for the thumos workshop. In: ICCV Workshop on Action Recognition with a Large Number of Classes, vol. 2, p. 8 (2013)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. Int. J. Comput. Vis. 119(3), 254–271 (2016)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–1999 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, L., Varadarajan, J., Pei, Y. (2020). Action Recognition Using Co-trained Deep Convolutional Neural Networks. In: El Fallah Seghrouchni, A., Sarne, D. (eds) Artificial Intelligence. IJCAI 2019 International Workshops. IJCAI 2019. Lecture Notes in Computer Science(), vol 12158. Springer, Cham. https://doi.org/10.1007/978-3-030-56150-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-56150-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-56149-9
Online ISBN: 978-3-030-56150-5
eBook Packages: Computer ScienceComputer Science (R0)