Skip to main content
Log in

A Context Based Deep Temporal Embedding Network in Action Recognition

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Long term temporal representation methods demand high computational cost, restricting their practical use in real world applications. We propose a two-step deep residual method for efficiently learning long-term discriminative temporal representation, whilst significantly reducing computational cost. In the first step, a novel self-supervision deep temporal embedding method is presented to embed repetitive short-term motions at a cluster-friendly feature space. In the second step, an efficient temporal representation is made by leveraging the differences between the original data and its associated repetitive motion clusters as a novel deep residual method. Experimental results demonstrate that, the proposed method achieves competitive results on some challenging human action recognition datasets like UCF101, HMDB51, THUMOS14, and Kinetics-400.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems

  2. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  3. Wang L et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer

  4. Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision

  5. Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  6. Chen G et al (2015) Combining unsupervised learning and discrimination for 3D action recognition. Signal Process 110:67–81

    Google Scholar 

  7. Längkvist M, Karlsson L, Loutfi A (2014) A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognit Lett 42:11–24

    Google Scholar 

  8. Dosovitskiy A et al (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans Pattern Anal Mach Intell 38(9):1734–1747

    Google Scholar 

  9. Kallenberg M et al (2016) Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring. IEEE Trans Med Imaging 35(5):1322–1331

    Google Scholar 

  10. Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning

  11. Madiraju NS et al (2018) Deep temporal clustering: fully unsupervised learning of time-domain features. arXiv preprint arXiv:1802.01059

  12. Ouyang Y et al (2014) Autoencoder-based collaborative filtering. In: International conference on neural information processing. Springer

  13. Ng A (2011) Sparse autoencoder. CS294A Lecture notes, vol 72. pp 1–19

  14. Mikolov T et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  15. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset

  16. Yu J et al (2017) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern 47(12):4014–4024

    Google Scholar 

  17. Yu J et al (2017) iPrivacy: image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Trans Inf Forensics Secur 12(5):1005–1016

    Google Scholar 

  18. Yu J et al (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482

    Google Scholar 

  19. Hong C et al (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inf 15(7):3952–3961

    Google Scholar 

  20. Hong C et al (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670

    MathSciNet  MATH  Google Scholar 

  21. Bengio Y (2013) Deep learning of representations: looking forward, in statistical language and speech processing. In: International conference on statistical language and speech processing. Springer, Berlin, pp 1–37

  22. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Google Scholar 

  23. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  24. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition

  25. Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2329–2338

  26. Zhu J, Zhu Z, Zou W (2018) End-to-end video-level representation learning for action recognition. In: 24th international conference on pattern recognition (ICPR), IEEE, pp 645–650

  27. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European conference on computer vision. Springer, Berlin, pp 143–156

  28. Lan Z et al (2017) Deep local video feature for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–7

  29. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852

  30. Yue-Hei Ng J et al (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 4694–4702

  31. Girdhar R et al (2017) Actionvlad: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 971–980

  32. Wang L et al (2016) Temporal segment networks: towards good practices for deep action recognition. Springer, Berlin

    Google Scholar 

  33. Wang L et al (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755

    Google Scholar 

  34. Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Google Scholar 

  35. Bilen H et al (2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3034–3042

  36. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  37. Diba A et al (2018) Spatio-temporal channel correlation networks for action classification. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision – ECCV 2018, ECCV 2018. Lecture Notes in Computer Science, vol 11208. Springer, Cham

    Google Scholar 

  38. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th international conference on neural information processing systems, pp 3476–3484

  39. Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR). IEEE

  40. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning

  41. Lev G et al (2016) RNN fisher vectors for action recognition and image annotation. In: European Conference on Computer Vision. Springer, Cham, pp 833–850

  42. Koohzadi M, Charkari NM (2017) Survey on deep learning methods in human action recognition. IET Comput Vis 11(8):623–632

    Google Scholar 

  43. Yu J et al (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058

    Google Scholar 

  44. Raina R et al (2007) Self-taught learning: transfer learning from unlabeled data. ACM, New York

    Google Scholar 

  45. Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE international conference on computer vision, pp 2794–2802

  46. Misra I, Zitnick CL, Hebert M (2016) Unsupervised learning using sequential verification for action recognition. 2(7):8. arXiv preprint arXiv:1603.08561

  47. Brattoli B et al (2017) Lstm self-supervision for detailed behavior analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6466–6475

  48. Buchler U, Brattoli B, Ommer B (2018) Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European conference on computer vision (ECCV), pp 770–786

  49. Lee H-Y et al (2017) Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE international conference on computer vision, pp 667–676

  50. Fernando B et al (2017) Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI. https://doi.org/10.1109/CVPR.2017.607

  51. Zhuang C, Andonian A, Yamins D (2019) Unsupervised learning from video with deep neural embeddings. arXiv preprint arXiv:1905.11954

  52. Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics

  53. Sayed N, Brattoli B, Ommer B (2018) CROSS and learn: cross-modal self-supervision. Springer, Berlin

    Google Scholar 

  54. Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: European conference on computer vision. Springer, Cham, pp 527–544

  55. Luo Z et al (2017) Unsupervised learning of long-term motion dynamics for videos

  56. Chen Y et al (2015) The ucr time series classification archive

  57. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227

    Google Scholar 

  58. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11(11):2837–2854

    MathSciNet  MATH  Google Scholar 

  59. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874

    MathSciNet  Google Scholar 

  60. Abadi M et al (2016) Tensorflow: a system for large-scale machine learning. In: OSDI

  61. Lee I et al (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks

  62. Paparrizos J, Gravano L (2015) k-shape: Efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM

  63. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition

  64. Xin M et al (2016) ARCH: Adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178:87–102

    Google Scholar 

  65. Li Z et al (2018) VideoLSTM convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50

    Google Scholar 

  66. Xiong Y (2017) TSN pretrained models on kinetics dataset. Available from: http://yjxiong.me/others/kinetics_action/#transfer

  67. Crasto N et al (2019) MARS: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  68. Piergiovanni A et al (2019) Evolving space-time neural architectures for videos. In: Proceedings of the IEEE international conference on computer vision

  69. Feichtenhofer C et al (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision

  70. Du W, Wang Y, Qiao Y (2017) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27(3):1347–1360

    MathSciNet  MATH  Google Scholar 

  71. Li J et al (2020) Spatio-temporal deformable 3D ConvNets with attention for action recognition. Pattern Recognit 98:107037

    Google Scholar 

  72. Dai C, Liu X, Lai J (2019) Human action recognition using two-stream attention based LSTM networks. Appl Soft Comput 86:105820

    Google Scholar 

  73. Ge H et al (2019) An attention mechanism based convolutional LSTM network for video action recognition. Multimedia Tools Appl 78(14):20533–20556

    Google Scholar 

  74. Meng L et al (2019) Interpretable spatio-temporal attention for video action recognition. In: Proceedings of the IEEE international conference on computer vision workshops

  75. Quan Y et al (2019) Attention with structure regularization for action recognition. Comput Vis Image Underst 187:102794

    Google Scholar 

  76. Sang H, Zhao Z, He D (2019) Two-level attention model based video action recognition network. IEEE Access 7:118388–118401

    Google Scholar 

  77. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119

  78. Peng Y, Zhao Y, Zhang J (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786

    Google Scholar 

  79. Li D et al (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimedia 21(2):416–428

    Google Scholar 

  80. Zhang H et al (2018) End-to-end temporal attention extraction and human action recognition. Mach Vis Appl 29(7):1127–1142

    Google Scholar 

  81. Pang B et al (2019) Deep RNN framework for visual sequential applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  82. Wang Y et al (2016) Hierarchical attention network for action recognition in videos. arXiv preprint arXiv:1607.06416

  83. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision

  84. Jain M, van Gemert J, Snoek CGM (2014) University of amsterdam at thumos challenge 2014. ECCV THUMOS Challenge

  85. Zhang B et al (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  86. Jain M, Van Gemert JC, Snoek CGM (2015) What do 15,000 object categories tell us about classifying and localizing actions? In: What do 15,000 object cate

  87. Cho K et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  88. Zhuang N et al (2018) Deep differential recurrent neural networks. arXiv preprint arXiv:1804.04192

  89. Virdi J (2017) Using deep learning to predict obstacle trajectories for collision avoidance in autonomous vehicles. UC San Diego

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nasrollah Moghadam Charkari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

In this section, we describe some basic concepts including LSTM network, encoder–decoder LSTM network, and Deep Embedding Clustering network. Afterward, the proposed model would be introduced.

LSTM Network

RNN models temporal dynamics of the input sequence to the sequence of hidden states, and then hidden states to the outputs. This capability is as a result of feedback connections and internal memory. RNNs perform poorly in learning long-term temporal dynamics. They are not appropriate to deal with the problem of vanishing gradient. LSTM overcomes this problem by considering recurrent forget gates in its architecture. Forget gates are memory units that allow LSTM to learn from which time step of previously hidden states must be considered in the network updating. So, LSTMs have proven successful in very deep learning models while they can perfectly remember some important events in long time intervals. Recent works have proposed many modified LSTM models such as Peephole LSTM, Peephole convolutional LSTM, and the Gated Recurrent Unit (GRU) [87]. In this paper, we use the standard LSTM structure in our model. The functionality of an LSTM unit is described by the following recurrence equations:

$$ \begin{aligned} i_{l} & = \sigma \left( {W_{xi} x_{l} + W_{hi} h_{l - 1} } \right) \\ f_{l} & = \sigma \left( {W_{xf} x_{l} + W_{hf} h_{l - 1} } \right) \\ o_{l} & = \sigma \left( {W_{xo} x_{l} + W_{ho} h_{l - 1} } \right) \\ c_{l} & = f_{l} \odot c_{l - 1} + i_{l} \odot\Phi \left( {W_{xc} x_{l} + W_{hc} h_{l - 1} } \right) \\ h_{l} & = o_{l} \odot\Phi \left( {c_{l} } \right) \\ \end{aligned} $$
(14)

where \( \sigma \) is the sigmoidal non-linearity, \( \Phi \) is hyperbolic tangent non-linearity, \( \odot \) represents the product with the gate value, \( x_{l} \) is the input, \( h_{l} \) is the hidden state, \( o_{l} \) is the output at time step l and the weight matrices denoted by \( W_{ij} \) are the trained parameters. These hidden states constitute a representation of an input sequence learned over time. Conventional LSTM, however, fails to take into consideration the impact of salient temporal dynamics present in the sequential input data [88]. The attention mechanism helps the model to select informative words.

2.1 Encoder–Decoder LSTM Network

The encoder–decoder LSTM was initially introduced for natural language processing problems where it demonstrated state-of-the-art performance [89]. In this regard, the unsupervised representation learned by encoder–decoder LSTM naturally facilitates the learning of temporal representations with our proposed method. An encoder–decoder LSTM is a two-layer RNN that acts as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence [87]. encoder–decoder LSTM is designed specifically for seq 2seq problems defined as:

$$ p\left( {y_{1} , \ldots y_{T} |x_{1} , \ldots x_{L} } \right) = \pi_{t = 1}^{T} p\left( {y_{t} |c, y_{1} , \ldots y_{t - 1} } \right) $$
(15)

where \( y_{1} , \ldots y_{T} \) is output sequence, \( x_{1} , \ldots x_{L} \) is input sequence, c is context vector (hidden vector of last LSTM unit) that summarizes the input sequence. The goal of encoder–decoder LSTM is to build a neural network to model and to maximize this conditional probability. The encoder part maps the input sequence to context vector as:

$$ h_{l} = g_{1} \left( {h_{l - 1} , x_{l} } \right) , c = h_{L} $$
(16)

The decoder part predicts and constructs the next data of the input sequence as output sequence:

$$ \begin{aligned} & s_{t} = g_{2} \left( {s_{t - 1} ,\left[ {y_{t - 1} ,c} \right] } \right) \\ & p\left( {y_{t} |c, y_{1} , \ldots y_{t - 1} } \right) = \sigma \left( {Ws_{T} + b} \right) \\ \end{aligned} $$
(17)

After greedy layer-wise training, all encoder layers would be concatenated followed by all decoder layers, to form a deep network, and then fine-tune it to minimize the prediction loss.

Deep Embedding Clustering Network

Deep embedding clustering [10] jointly optimizes feature transformation and clustering. This algorithm leverages a pre-training auto encoder as an initial estimation of data representation and then removes the decoder. Then, the remaining encoder is fine-tuned by an effective clustering loss in a self-learning manner using high confidential samples. It updates parameters of transformation network and cluster centers, simultaneously. Let \( z_{i} = f_{\theta } \left( {x_{i} } \right) \) be the mapping function of the pre-trained encoder, where \( x_{i} \) is an input data. Using \( f_{\theta } \) we obtain all embedded points \( \left\{ {z_{i} } \right\} \). Then, k-means is employed on \( \left\{ {z_{i} } \right\} \) to get initial cluster centers \( \left\{ {\mu_{i} } \right\} \). Afterward, the objective function is defined as follow:

$$ {\text{L}} = KL(P| |Q )= \mathop \sum \limits_{i} \mathop \sum \limits_{j} p_{ij} log\frac{{p_{ij} }}{{q_{ij} }} $$
(18)

\( q_{ij} \) denotes the distance between embedded point \( z_{i} \) and cluster center \( \mu_{i} \) defined by Student’s t-distribution:

$$ q_{ij} = \frac{{\left( {1 + \left| {z_{i} - \mu_{j} } \right|^{2} /\alpha } \right)^{{ - \frac{\alpha + 1}{2}}} }}{{\mathop \sum \nolimits_{j} \left( {1 + \left| {z_{i} - \mu_{j} } \right|^{2} /\alpha } \right)^{{ - \frac{\alpha + 1}{2}}} }} $$
(19)

\( p_{ij} \) is the target distribution that considers high confidential samples as supervision that makes clusters more densely, defined as:

$$ p_{ij} = \frac{{{{q_{ij}^{2} } \mathord{\left/ {\vphantom {{q_{ij}^{2} } {\sum\nolimits_{i} {q_{ij} } }}} \right. \kern-0pt} {\sum\nolimits_{i} {q_{ij} } }}}}{{{{\sum\nolimits_{\text{j}} {q_{{i{\text{j}}}}^{2} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{\text{j}} {q_{{i{\text{j}}}}^{2} } } {\sum\nolimits_{i} {q_{ij} } }}} \right. \kern-0pt} {\sum\nolimits_{i} {q_{ij} } }}}} $$
(20)

As \( p_{ij} \) is defined by \( q_{ij} \), minimizing the loss function (L) becomes a form of self-training. The cluster assignment of sample \( x_{i} \) is \( argmax_{j } q_{ij} \).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Koohzadi, M., Charkari, N.M. A Context Based Deep Temporal Embedding Network in Action Recognition. Neural Process Lett 52, 187–220 (2020). https://doi.org/10.1007/s11063-020-10248-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10248-1

Keywords

Navigation