Abstract
This paper presents the results of the exploration of a two-stream Convolutional Neural Network (2S-CNN) architecture, with a novel feature fusion technique at multiple levels, to categorize events in videos. The two streams are a combination of dense optical flow features with: (a) RGB frames; and (b) salient object regions detected using a fast space-time saliency method. The main contribution is in the design of a classifier moderated method to fuse information from the two streams at multiple stages of the network, which enables capturing the most discriminative and complimentary features for localizing the spatio-temporal attention for the action being performed. This mutual auto-exchange of information in local and global contexts, produces an optimal combination of appearance and dynamism, for enhanced discrimination, thus producing the best performance of categorization. The network is trained end-to-end and subsequently evaluated on two challenging human action recognition benchmark datasets viz. UCF-101 and HMDB-51, where, the proposed 2S-CNN method outperforms the current state of the art ConvNets by a significant margin.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Although significant success have come in the area of image recognition domain through the use of Convolutional Neural Networks (CNNs), the field of videos still remain unconquered. As the traditional hand-crafted features failed to produce acceptable classification accuracy for action from videos, CNNs and other recurrent architectures emerged recently as part of the deep learning (DL) trend. These models try to learn a set of optimal features for accurate classification in a supervised manner, hence shifting the research from design of features to building complex deep networks.
This paper offers a novel value addition to the existing two-stream architectures by providing an adaptive and controlled multi-stage fusion strategy between the two pathways. The mutual exchange of information occurs in two contexts: (a) local and (b) global. The local context fusion captures the spatio-temporal cues in a small window of optical flow features centered around an RGB frame or the saliency map extracted from it, whereas the global fusion aims to integrate all these local operations and fix any errors accumulated during the local stage by using a projective Long Short Term Memory (LSTM) architecture [14] with added residual pathways. The local stage also incorporates controlled intermediate classification to automatically capture the spatio-temporal features for better inter-class discrimination. Experimentations on the popular real world datasets UCF-101 [16] and HMDB-51 [11] show considerable performance gain for the proposed model over the most current state-of-the-art techniques. Recently, deep learning techniques have achieved great success in the field of object recognition from images. Effective applications include [4, 10, 15, 17] the use of volumetric or 3D CNNs, Trajectory-Pooled Deep-Convolutional Descriptors, Factorized Spatio-Temporal CNNs and recurrent networks consisting of LSTM modules.
The proposed Two-Stream Convolutional Neural Network (2S-CNN) architecture works by computing the optical flow and saliency map using sufficiently fast and accurate methods. Use of motion features along with the pixel values of the input frames allow the model to automatically learn the action being performed in each frame. 2S-CNNs have been proposed by Simonyan and Zisserman [15] based on the primary assumption that action in videos can be decomposed into spatial and temporal parts. Although they split the network in two parts, fusion of information takes place only after the class scores are predicted. Our model aims to rectify this by introducing fusion nodes at multiple stages of the network for facilitating controlled exchange of information at various scales of the input, which is a significant contribution of the paper.
2 Two-Stream Multi-level Fusion Architecture
The input to the network (see Fig. 1a) combines stacked optical flow features with: (a) RGB frames and (b) saliency maps extracted from individual frames.
Optical Flow Stream. Optical flow information is first computed based on the method proposed by Brox et al. [1]. Affine flow vector compensates the noise comes from [8]. This can be modeled at a point \( p=(x,y) \) at time t, as
where, \( c_i \) and \( a_i \) represents the translation, rotation and scaling parameters. The affine flow vector extracted using Eq. 1 is then subtracted from the original flow vector \( w_{flow} \), to obtain the final corrected flow field \( w_{cor} \), as:
The obtained flow information between two frames \( F_t \) and \( F_{t+1} \) comprises of \( w_{cor}^x \) and \( w_{cor}^y \), the horizontal and vertical components of the flow field respectively. These two fields are stacked over C number of frames to obtain 2C input channels, such that they efficiently encode motion information at every point of the input frame.
Saliency Map Stream. As the proposed architecture aims for categorizing actions from video data, a fast but accurate saliency method is used which considers both the spatial (appearance) and temporal information, as proposed by Zhou et al. [21]. The saliency is computed by first over-segmenting the input video into color coherent Spatio-Temporal Regions (STR) and subsequently computing three feature vectors encoding the color statistics, normalized histogram of the flow magnitude and the distribution of flow orientation respectively.
RGB Stream. The RGB stream gets a RGB frame \( F_t \) as its input. This captures the overall visual features from the frame, and is aided at multiple levels by the optical flow stream to focus its attention to the desired localization of action in frames.
2.1 Multi-level Feature Fusion in 2S-CNN
The proposed architecture incorporates fusion at three levels as shown in Fig. 1a. These nodes at the intermediate stages capture information from the two streams and aid the later part of the network in identifying finer details to classify the actions more accurately. In the proposed model, the fusion of the two streams is done adaptively in two modes: (a) Local context and (b) Global context.
Local Context Fusion. This fusion strategy works on a sequence of optical flow frames \( \{w_{cor}^t \pm C\} \) centered at RGB frame \( F_t \) at time t. This fuses the information in a context of 2C clips. Intuitively, the spatial (RGB/Saliency) stream captures the position of the target motion whereas the temporal (Optical flow) stream identifies the motion pattern in the local window. The main idea behind this exchange of information at a local level is to place the pixel-wise feature responses from the two parallel networks in harmony. The proposed 2S-CNN architecture uses convolutional fusion guided by internal as well as final classifiers to achieve this feat, as shown by the blocks FConv 1, FConv 2 and the penultimate Fully Connected layer in Fig. 1a.
Also, as the model is very deep (51 convolutional layers), residual connections have been used to ease the training process. Residual networks attain this by the use of skip connections inserted throughout the network. These units are represented as: \( X_{l+1}=f(X_l+\mathcal {F};\mathcal {W}_l) \), where, \( X_l \) and \( X_{l+1} \) denotes the input and output of the lth layer respectively, \( \mathcal {F} \) is a nonlinear residual mapping denoted by the filter weights \( \mathcal {W}_l \).
During back propagation, these skip connections help the gradients to propagate directly from the loss layers to any of the previous layers bypassing the intermediate ones, resulting in vastly diminished chance of vanishing gradients.
The fusion operation can be thought of as a function \(f : P_a,P_b\) \(\longrightarrow \) Q where \(P_a,P_b,Q \in \mathbb {R}^{h\times w\times d}\), and h, w and d denote the height, width and depth of the feature maps respectively. The convolution based fusion application stacks the two corresponding sets of feature maps from the respective streams and applies a series of convolutional operations to produce the output. In the architecture shown in Fig. 1a, the stacked output of the fusion node is fed to two convolutional layers having receptive fields of \(1 \times 1\) and \(3 \times 3\). The purpose of inclusion of the \( 1 \times 1 \) convolution kernels is to introduce a non-linearity without altering the size of the feature maps. All the convolutions are followed by a ReLU activation function [6] and Local Response Normalization (LRN), which aims to imitate biological neurons by implementing a type of lateral inhibition in the network. The fusion operation is finalized by normalizing the input batches spatially.
Global Context Fusion. Although the local context fusion captures the correspondences between the spatial and temporal streams for recognizing action, it often under-performs if the input video has sudden viewpoint changes, unpredictable camera motion or jittery frames. These disturbing artifacts result in erroneous learning of spatio-temporal features and as the fusion is applied at several levels, the error gets accumulated resulting in incorrectly classified action. To overcome this the proposed method globally fuses the information obtained from several local context fusion stages over the entire video. This global information fusion is achieved by the use of stacked deep LSTM units [7] with projection layers.
As the training of LSTM layers are computationally expensive for large models, the proposed method uses Long Short-Term Memory Projected (LSTMP) architecture [14]. The computational complexity of learning LSTM models per time step is O(N) and is dominated by the factor \( n_c \times (4 \times n_c + n_o) \) where \( n_c \) and \( n_o \) are respectively the number of memory cells and output units. Hence in a large network similar to the proposed one, the learning time becomes infeasible for even a moderate number of output dimensions and cells. The LSTMP architecture reduces the computational cost by inserting a projection or recurrent layer after the normal LSTM unit and forwarding the output to the input module. In this case, the cost of computation is dominated by \( n_r \times (4 \times n_c+n_o) \) where \( n_r \) denotes the number of projection layer units. This use of projection layer results in reduced number of parameters from the LSTM module by a factor of \( \frac{n_r}{n_c} \) and helps to increase the memory capacity of the model substantially.
3 Experimentation and Results
Rigorous experimentations are done for comparison of the performance for the 2S-CNN architecture with multi-level fusion on two real-word moderately large datasets: (a) UCF-101[16] and (b) HMDB-51[11].
3.1 Evaluation of Performance
The selection of fusion stages. The network is first trained in the local context level mentioned in Sect. 2.1 using multiple combinations of fusions at several stages, separately over the two datasets described above. The results, listed in Table 1 follow the trend observed in [4], that fusing information at the earlier stages show less impact on the overall classification accuracy. Amongst the different experimental combinations the highest performance gain was noticed by implementing the fusion before three levels in the network, viz. Conv 2, Conv 3 and the final fully connected layer, as shown in Fig. 1a. Table 1 compares the performance of the network under different combination of local fusion with and without the auxiliary classifiers introduced.
Training the Global Context Level. The global context stage is implemented using 1, 2 and 3 levels of LSTMP layers to capture the motion over the whole video. Comparison of the results in Tables 2 and 3 exhibit that the addition of this global fusion strategy achieves superior performance gain over classification accuracy using only the extracted local features (see Sect. 2.1) from the short chunks of the input. As the gain in accuracy was insignificant in case of inserting the third LSTMP layer over the previous case of two stacked LSTMP layers and also due to the high computational cost of BPTT for deeper recurrent networks, experimentation with more layers were not performed.
As evident from Tables 1 and 3, the global context fusion provides a significant improvement on predicting actions in the HMDB-51 dataset over the local stage of the network. Reason behind this is the fact that for HMDB-51 videos have high intra-class variations and several challenges in the form of camera motion, jitter and low quality. As the features get fine tuned at the local stage, the final global fusion module takes the advantage of temporal modeling of the learned features to better discriminate them.
Combination of Optical Flow with Saliency Maps. The model was trained both using the RGB and saliency maps obtained using the methodology described in Sect. 2. For both the datasets, using the saliency maps coupled with optical flow features result in better performance (as in Table 2) than the use of plain RGB frames. This is due to the fact that the segmented salient regions give a clue for the model to localize the distinguishing features better from those parts, while suppressing noisy areas which otherwise would have contributed to outliers, thus reducing the classification accuracy.
Finally, results in Table 4 compare the performance of the proposed architecture with other state-of-the-art and recent methods, utilizing hand-crafted and deep learned features, revealing the superiority of performance on both the challenging datasets.
4 Conclusion
The two-stream CNN along with adaptive fusion strategy incorporated at different contextual levels shows best performance on two of the most popular and challenging action datasets. A controlled and adaptive multi-level fusion strategy at both local and global context is the highlight of this paper. Use of saliency maps with a combination of optical flow provides performance gain over the use of direct RGB frames as evident from the results in Table 2. Also, incorporating the fusion of information at two contexts results in significant improvement in performance.
References
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24673-2_3
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. arXiv preprint arXiv:1611.02155 (2016)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. arXiv preprint arXiv:1604.06573 (2016)
Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: CVPR, pp. 5378–5387 (2015)
Hahnloser, R.H., Sarpeshkar, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 947–951 (2000)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR, pp. 2555–2562 (2013)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. T-PAMI 35(1), 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)
Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR, pp. 204–212 (2015)
Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Cham (2014). doi:10.1007/978-3-319-10602-1_38
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: ISCA (2014)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV, pp. 4597–4605 (2015)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)
Wang, X., Farhadi, A., Gupta, A.: Actions \({\sim }\) transformations. arXiv preprint arXiv:1512.00795 (2015)
Wu, Z., Jiang, Y.-G., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086 (2015)
Zhou, F., Bing Kang, S., Cohen, M.F.: Time-mapping using space-time saliency. In: CVPR, pp. 3358–3365 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Bhattacharjee, P., Das, S. (2017). Two-Stream Convolutional Network with Multi-level Feature Fusion for Categorization of Human Action from Videos. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_70
Download citation
DOI: https://doi.org/10.1007/978-3-319-69900-4_70
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)