Abstract
Long term temporal representation methods demand high computational cost, restricting their practical use in real world applications. We propose a two-step deep residual method for efficiently learning long-term discriminative temporal representation, whilst significantly reducing computational cost. In the first step, a novel self-supervision deep temporal embedding method is presented to embed repetitive short-term motions at a cluster-friendly feature space. In the second step, an efficient temporal representation is made by leveraging the differences between the original data and its associated repetitive motion clusters as a novel deep residual method. Experimental results demonstrate that, the proposed method achieves competitive results on some challenging human action recognition datasets like UCF101, HMDB51, THUMOS14, and Kinetics-400.
Similar content being viewed by others
References
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Wang L et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer
Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision
Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Chen G et al (2015) Combining unsupervised learning and discrimination for 3D action recognition. Signal Process 110:67–81
Längkvist M, Karlsson L, Loutfi A (2014) A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognit Lett 42:11–24
Dosovitskiy A et al (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans Pattern Anal Mach Intell 38(9):1734–1747
Kallenberg M et al (2016) Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring. IEEE Trans Med Imaging 35(5):1322–1331
Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning
Madiraju NS et al (2018) Deep temporal clustering: fully unsupervised learning of time-domain features. arXiv preprint arXiv:1802.01059
Ouyang Y et al (2014) Autoencoder-based collaborative filtering. In: International conference on neural information processing. Springer
Ng A (2011) Sparse autoencoder. CS294A Lecture notes, vol 72. pp 1–19
Mikolov T et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset
Yu J et al (2017) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern 47(12):4014–4024
Yu J et al (2017) iPrivacy: image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Trans Inf Forensics Secur 12(5):1005–1016
Yu J et al (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Hong C et al (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inf 15(7):3952–3961
Hong C et al (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Bengio Y (2013) Deep learning of representations: looking forward, in statistical language and speech processing. In: International conference on statistical language and speech processing. Springer, Berlin, pp 1–37
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition
Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2329–2338
Zhu J, Zhu Z, Zou W (2018) End-to-end video-level representation learning for action recognition. In: 24th international conference on pattern recognition (ICPR), IEEE, pp 645–650
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European conference on computer vision. Springer, Berlin, pp 143–156
Lan Z et al (2017) Deep local video feature for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–7
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852
Yue-Hei Ng J et al (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 4694–4702
Girdhar R et al (2017) Actionvlad: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 971–980
Wang L et al (2016) Temporal segment networks: towards good practices for deep action recognition. Springer, Berlin
Wang L et al (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Bilen H et al (2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3034–3042
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Diba A et al (2018) Spatio-temporal channel correlation networks for action classification. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision – ECCV 2018, ECCV 2018. Lecture Notes in Computer Science, vol 11208. Springer, Cham
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th international conference on neural information processing systems, pp 3476–3484
Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR). IEEE
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning
Lev G et al (2016) RNN fisher vectors for action recognition and image annotation. In: European Conference on Computer Vision. Springer, Cham, pp 833–850
Koohzadi M, Charkari NM (2017) Survey on deep learning methods in human action recognition. IET Comput Vis 11(8):623–632
Yu J et al (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058
Raina R et al (2007) Self-taught learning: transfer learning from unlabeled data. ACM, New York
Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE international conference on computer vision, pp 2794–2802
Misra I, Zitnick CL, Hebert M (2016) Unsupervised learning using sequential verification for action recognition. 2(7):8. arXiv preprint arXiv:1603.08561
Brattoli B et al (2017) Lstm self-supervision for detailed behavior analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6466–6475
Buchler U, Brattoli B, Ommer B (2018) Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European conference on computer vision (ECCV), pp 770–786
Lee H-Y et al (2017) Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE international conference on computer vision, pp 667–676
Fernando B et al (2017) Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI. https://doi.org/10.1109/CVPR.2017.607
Zhuang C, Andonian A, Yamins D (2019) Unsupervised learning from video with deep neural embeddings. arXiv preprint arXiv:1905.11954
Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics
Sayed N, Brattoli B, Ommer B (2018) CROSS and learn: cross-modal self-supervision. Springer, Berlin
Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: European conference on computer vision. Springer, Cham, pp 527–544
Luo Z et al (2017) Unsupervised learning of long-term motion dynamics for videos
Chen Y et al (2015) The ucr time series classification archive
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11(11):2837–2854
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Abadi M et al (2016) Tensorflow: a system for large-scale machine learning. In: OSDI
Lee I et al (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks
Paparrizos J, Gravano L (2015) k-shape: Efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition
Xin M et al (2016) ARCH: Adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178:87–102
Li Z et al (2018) VideoLSTM convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50
Xiong Y (2017) TSN pretrained models on kinetics dataset. Available from: http://yjxiong.me/others/kinetics_action/#transfer
Crasto N et al (2019) MARS: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Piergiovanni A et al (2019) Evolving space-time neural architectures for videos. In: Proceedings of the IEEE international conference on computer vision
Feichtenhofer C et al (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision
Du W, Wang Y, Qiao Y (2017) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27(3):1347–1360
Li J et al (2020) Spatio-temporal deformable 3D ConvNets with attention for action recognition. Pattern Recognit 98:107037
Dai C, Liu X, Lai J (2019) Human action recognition using two-stream attention based LSTM networks. Appl Soft Comput 86:105820
Ge H et al (2019) An attention mechanism based convolutional LSTM network for video action recognition. Multimedia Tools Appl 78(14):20533–20556
Meng L et al (2019) Interpretable spatio-temporal attention for video action recognition. In: Proceedings of the IEEE international conference on computer vision workshops
Quan Y et al (2019) Attention with structure regularization for action recognition. Comput Vis Image Underst 187:102794
Sang H, Zhao Z, He D (2019) Two-level attention model based video action recognition network. IEEE Access 7:118388–118401
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119
Peng Y, Zhao Y, Zhang J (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786
Li D et al (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimedia 21(2):416–428
Zhang H et al (2018) End-to-end temporal attention extraction and human action recognition. Mach Vis Appl 29(7):1127–1142
Pang B et al (2019) Deep RNN framework for visual sequential applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang Y et al (2016) Hierarchical attention network for action recognition in videos. arXiv preprint arXiv:1607.06416
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision
Jain M, van Gemert J, Snoek CGM (2014) University of amsterdam at thumos challenge 2014. ECCV THUMOS Challenge
Zhang B et al (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Jain M, Van Gemert JC, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions? In: What do 15,000 object cate
Cho K et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Zhuang N et al (2018) Deep differential recurrent neural networks. arXiv preprint arXiv:1804.04192
Virdi J (2017) Using deep learning to predict obstacle trajectories for collision avoidance in autonomous vehicles. UC San Diego
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
In this section, we describe some basic concepts including LSTM network, encoder–decoder LSTM network, and Deep Embedding Clustering network. Afterward, the proposed model would be introduced.
LSTM Network
RNN models temporal dynamics of the input sequence to the sequence of hidden states, and then hidden states to the outputs. This capability is as a result of feedback connections and internal memory. RNNs perform poorly in learning long-term temporal dynamics. They are not appropriate to deal with the problem of vanishing gradient. LSTM overcomes this problem by considering recurrent forget gates in its architecture. Forget gates are memory units that allow LSTM to learn from which time step of previously hidden states must be considered in the network updating. So, LSTMs have proven successful in very deep learning models while they can perfectly remember some important events in long time intervals. Recent works have proposed many modified LSTM models such as Peephole LSTM, Peephole convolutional LSTM, and the Gated Recurrent Unit (GRU) [87]. In this paper, we use the standard LSTM structure in our model. The functionality of an LSTM unit is described by the following recurrence equations:
where \( \sigma \) is the sigmoidal non-linearity, \( \Phi \) is hyperbolic tangent non-linearity, \( \odot \) represents the product with the gate value, \( x_{l} \) is the input, \( h_{l} \) is the hidden state, \( o_{l} \) is the output at time step l and the weight matrices denoted by \( W_{ij} \) are the trained parameters. These hidden states constitute a representation of an input sequence learned over time. Conventional LSTM, however, fails to take into consideration the impact of salient temporal dynamics present in the sequential input data [88]. The attention mechanism helps the model to select informative words.
2.1 Encoder–Decoder LSTM Network
The encoder–decoder LSTM was initially introduced for natural language processing problems where it demonstrated state-of-the-art performance [89]. In this regard, the unsupervised representation learned by encoder–decoder LSTM naturally facilitates the learning of temporal representations with our proposed method. An encoder–decoder LSTM is a two-layer RNN that acts as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence [87]. encoder–decoder LSTM is designed specifically for seq 2seq problems defined as:
where \( y_{1} , \ldots y_{T} \) is output sequence, \( x_{1} , \ldots x_{L} \) is input sequence, c is context vector (hidden vector of last LSTM unit) that summarizes the input sequence. The goal of encoder–decoder LSTM is to build a neural network to model and to maximize this conditional probability. The encoder part maps the input sequence to context vector as:
The decoder part predicts and constructs the next data of the input sequence as output sequence:
After greedy layer-wise training, all encoder layers would be concatenated followed by all decoder layers, to form a deep network, and then fine-tune it to minimize the prediction loss.
Deep Embedding Clustering Network
Deep embedding clustering [10] jointly optimizes feature transformation and clustering. This algorithm leverages a pre-training auto encoder as an initial estimation of data representation and then removes the decoder. Then, the remaining encoder is fine-tuned by an effective clustering loss in a self-learning manner using high confidential samples. It updates parameters of transformation network and cluster centers, simultaneously. Let \( z_{i} = f_{\theta } \left( {x_{i} } \right) \) be the mapping function of the pre-trained encoder, where \( x_{i} \) is an input data. Using \( f_{\theta } \) we obtain all embedded points \( \left\{ {z_{i} } \right\} \). Then, k-means is employed on \( \left\{ {z_{i} } \right\} \) to get initial cluster centers \( \left\{ {\mu_{i} } \right\} \). Afterward, the objective function is defined as follow:
\( q_{ij} \) denotes the distance between embedded point \( z_{i} \) and cluster center \( \mu_{i} \) defined by Student’s t-distribution:
\( p_{ij} \) is the target distribution that considers high confidential samples as supervision that makes clusters more densely, defined as:
As \( p_{ij} \) is defined by \( q_{ij} \), minimizing the loss function (L) becomes a form of self-training. The cluster assignment of sample \( x_{i} \) is \( argmax_{j } q_{ij} \).
Rights and permissions
About this article
Cite this article
Koohzadi, M., Charkari, N.M. A Context Based Deep Temporal Embedding Network in Action Recognition. Neural Process Lett 52, 187–220 (2020). https://doi.org/10.1007/s11063-020-10248-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10248-1