Abstract
Human activity recognition is a very important problem in computer vision that is still largely unsolved. While recent advances such as deep learning have given us great results on image related tasks, it is still difficult to recognize behavior in videos due to a great deal of disturbance in videos. We propose an architecture DT-3DResNet-LSTM to classify and temporally localize activities in videos. We detect objects in video frames and use these detected results as input to object tracking model, achieving data association information among adjacent frames of multiple objects. Then the clipped video frames of different objects are put into 3D Convolutional Neural Network (CNN) to achieve features, and a Recurrent Neural Network (RNN), specifically Long Short-Term Memory (LSTM), is trained to classify video clips. What’s more, we process the output of RNN (LSTM) model to get the final classification of input video and determine the temporal localization of input video.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
TRECVID Homepage. https://www-nlpir.nist.gov/projects/tv2018/
Wang H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558. IEEE (2013)
Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: deep networks for video classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4694–4702. IEEE (2015)
Feichtenhofer, C., Pinz A, Zisserman A P.: Convolutional two-stream network fusion for video action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition 2016, CVPR, pp. 1933–1941. IEEE (2016)
Wang, L., Xiong, Y., Wang, Z., et al.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)
Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515. IEEE (2015)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, pp. 18–22. IEEE, Salt Lake City (2018)
Kay, W., Carreira, J., Simonyan, K., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 770–778. IEEE (2016)
Dai, J., Li, Y., He, K., et al.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision 2015, pp. 1440–1448 (2015)
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Uijlings, J.R.R., Van De Sande, K.E.A., Gevers, T., et al.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Niebles, J.C., Han, B., Fei-Fei, L.: Efficient extraction of human motion volumes by tracking. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 655–662. IEEE (2010)
Oh, S., Russell, S., Sastry, S.: Markov chain Monte Carlo data association for multi-target tracking. IEEE Trans. Autom. Control 54(3), 481–491 (2009)
Kim, Suna, Kwak, Suha, Feyereisl, Jan, Han, Bohyung: Online multi-target tracking by large margin structured learning. In: Lee, Kyoung Mu, Matsushita, Yasuyuki, Rehg, James M., Hu, Zhanyi (eds.) ACCV 2012. LNCS, vol. 7726, pp. 98–111. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37431-9_8
Breitenstein, M.D., Reichlin, F., Leibe, B., et al.: Online multiperson tracking-by-detection from a single, uncalibrated camera. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1820–1833 (2011)
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition 2017, CVPR, pp. 7445–7454. IEEE (2017)
Ji, S., Xu, W., Yang, M., et al.: 3D convolutional neural networks for human action recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 221–231. IEEE (2013)
Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE (2015)
Karpathy, A., Toderici, G., Shetty S, et al.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE (2014)
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Yoon, J.H., Yang, M.H., Lim, J., et al.: Bayesian multi-object tracking using motion context from multiple objects. In: 2015 IEEE Winter Conference on Applications of Computer Vision, WACV, pp. 33–40. IEEE (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of the ICCV Workshop on Action, Gesture, and Emotion Recognition. vol. 2, No. 3, p. 4 (2017)
Xie, S., Girshick, R., Dollár, P., et al.: Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5987–5995. IEEE (2017)
Yeung, S., Russakovsky, O., Jin, N., et al.: Every moment counts: dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126(2–4), 375–389 (2018)
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, pp. 961–970. IEEE (2015)
Oh, S., Hoogs, A., Perera, A., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3153–3160. IEEE (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Yao, L., Qian, Y. (2018). DT-3DResNet-LSTM: An Architecture for Temporal Activity Recognition in Videos. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_57
Download citation
DOI: https://doi.org/10.1007/978-3-030-00776-8_57
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)