Keywords

1 Introduction

Although significant success have come in the area of image recognition domain through the use of Convolutional Neural Networks (CNNs), the field of videos still remain unconquered. As the traditional hand-crafted features failed to produce acceptable classification accuracy for action from videos, CNNs and other recurrent architectures emerged recently as part of the deep learning (DL) trend. These models try to learn a set of optimal features for accurate classification in a supervised manner, hence shifting the research from design of features to building complex deep networks.

This paper offers a novel value addition to the existing two-stream architectures by providing an adaptive and controlled multi-stage fusion strategy between the two pathways. The mutual exchange of information occurs in two contexts: (a) local and (b) global. The local context fusion captures the spatio-temporal cues in a small window of optical flow features centered around an RGB frame or the saliency map extracted from it, whereas the global fusion aims to integrate all these local operations and fix any errors accumulated during the local stage by using a projective Long Short Term Memory (LSTM) architecture [14] with added residual pathways. The local stage also incorporates controlled intermediate classification to automatically capture the spatio-temporal features for better inter-class discrimination. Experimentations on the popular real world datasets UCF-101 [16] and HMDB-51 [11] show considerable performance gain for the proposed model over the most current state-of-the-art techniques. Recently, deep learning techniques have achieved great success in the field of object recognition from images. Effective applications include [4, 10, 15, 17] the use of volumetric or 3D CNNs, Trajectory-Pooled Deep-Convolutional Descriptors, Factorized Spatio-Temporal CNNs and recurrent networks consisting of LSTM modules.

The proposed Two-Stream Convolutional Neural Network (2S-CNN) architecture works by computing the optical flow and saliency map using sufficiently fast and accurate methods. Use of motion features along with the pixel values of the input frames allow the model to automatically learn the action being performed in each frame. 2S-CNNs have been proposed by Simonyan and Zisserman [15] based on the primary assumption that action in videos can be decomposed into spatial and temporal parts. Although they split the network in two parts, fusion of information takes place only after the class scores are predicted. Our model aims to rectify this by introducing fusion nodes at multiple stages of the network for facilitating controlled exchange of information at various scales of the input, which is a significant contribution of the paper.

2 Two-Stream Multi-level Fusion Architecture

The input to the network (see Fig. 1a) combines stacked optical flow features with: (a) RGB frames and (b) saliency maps extracted from individual frames.

Optical Flow Stream. Optical flow information is first computed based on the method proposed by Brox et al. [1]. Affine flow vector compensates the noise comes from [8]. This can be modeled at a point \( p=(x,y) \) at time t, as

$$\begin{aligned} w_{aff}(p_t)=\begin{bmatrix} c_1(t) \\ c_2(t) \end{bmatrix} + \begin{bmatrix} a_1(t)&a_2(t) \\ a_3(t)&a_4(t) \end{bmatrix} \begin{bmatrix} x_t \\ y_t \end{bmatrix}, \end{aligned}$$
(1)

where, \( c_i \) and \( a_i \) represents the translation, rotation and scaling parameters. The affine flow vector extracted using Eq. 1 is then subtracted from the original flow vector \( w_{flow} \), to obtain the final corrected flow field \( w_{cor} \), as:

$$\begin{aligned} w_{cor}(p_t)=w_{flow}(p_t)-w_{aff}(p_t) \end{aligned}$$
(2)

The obtained flow information between two frames \( F_t \) and \( F_{t+1} \) comprises of \( w_{cor}^x \) and \( w_{cor}^y \), the horizontal and vertical components of the flow field respectively. These two fields are stacked over C number of frames to obtain 2C input channels, such that they efficiently encode motion information at every point of the input frame.

Saliency Map Stream. As the proposed architecture aims for categorizing actions from video data, a fast but accurate saliency method is used which considers both the spatial (appearance) and temporal information, as proposed by Zhou et al. [21]. The saliency is computed by first over-segmenting the input video into color coherent Spatio-Temporal Regions (STR) and subsequently computing three feature vectors encoding the color statistics, normalized histogram of the flow magnitude and the distribution of flow orientation respectively.

RGB Stream. The RGB stream gets a RGB frame \( F_t \) as its input. This captures the overall visual features from the frame, and is aided at multiple levels by the optical flow stream to focus its attention to the desired localization of action in frames.

2.1 Multi-level Feature Fusion in 2S-CNN

The proposed architecture incorporates fusion at three levels as shown in Fig. 1a. These nodes at the intermediate stages capture information from the two streams and aid the later part of the network in identifying finer details to classify the actions more accurately. In the proposed model, the fusion of the two streams is done adaptively in two modes: (a) Local context and (b) Global context.

Local Context Fusion. This fusion strategy works on a sequence of optical flow frames \( \{w_{cor}^t \pm C\} \) centered at RGB frame \( F_t \) at time t. This fuses the information in a context of 2C clips. Intuitively, the spatial (RGB/Saliency) stream captures the position of the target motion whereas the temporal (Optical flow) stream identifies the motion pattern in the local window. The main idea behind this exchange of information at a local level is to place the pixel-wise feature responses from the two parallel networks in harmony. The proposed 2S-CNN architecture uses convolutional fusion guided by internal as well as final classifiers to achieve this feat, as shown by the blocks FConv 1, FConv 2 and the penultimate Fully Connected layer in Fig. 1a.

Also, as the model is very deep (51 convolutional layers), residual connections have been used to ease the training process. Residual networks attain this by the use of skip connections inserted throughout the network. These units are represented as: \( X_{l+1}=f(X_l+\mathcal {F};\mathcal {W}_l) \), where, \( X_l \) and \( X_{l+1} \) denotes the input and output of the lth layer respectively, \( \mathcal {F} \) is a nonlinear residual mapping denoted by the filter weights \( \mathcal {W}_l \).

During back propagation, these skip connections help the gradients to propagate directly from the loss layers to any of the previous layers bypassing the intermediate ones, resulting in vastly diminished chance of vanishing gradients.

The fusion operation can be thought of as a function \(f : P_a,P_b\) \(\longrightarrow \) Q where \(P_a,P_b,Q \in \mathbb {R}^{h\times w\times d}\), and h, w and d denote the height, width and depth of the feature maps respectively. The convolution based fusion application stacks the two corresponding sets of feature maps from the respective streams and applies a series of convolutional operations to produce the output. In the architecture shown in Fig. 1a, the stacked output of the fusion node is fed to two convolutional layers having receptive fields of \(1 \times 1\) and \(3 \times 3\). The purpose of inclusion of the \( 1 \times 1 \) convolution kernels is to introduce a non-linearity without altering the size of the feature maps. All the convolutions are followed by a ReLU activation function [6] and Local Response Normalization (LRN), which aims to imitate biological neurons by implementing a type of lateral inhibition in the network. The fusion operation is finalized by normalizing the input batches spatially.

Fig. 1.
figure 1

(a) The two-stream multilevel local context fusion model. The dotted arrows denote the residual connections between two convolutional blocks. (b) The global context fusion stage. Dotted arrows between the LSTMP layers indicate hidden state outputs. The global stage aggregates all the input video clips and predicts the action label by pooling the final softmax scores over the entire input video.

Global Context Fusion. Although the local context fusion captures the correspondences between the spatial and temporal streams for recognizing action, it often under-performs if the input video has sudden viewpoint changes, unpredictable camera motion or jittery frames. These disturbing artifacts result in erroneous learning of spatio-temporal features and as the fusion is applied at several levels, the error gets accumulated resulting in incorrectly classified action. To overcome this the proposed method globally fuses the information obtained from several local context fusion stages over the entire video. This global information fusion is achieved by the use of stacked deep LSTM units [7] with projection layers.

As the training of LSTM layers are computationally expensive for large models, the proposed method uses Long Short-Term Memory Projected (LSTMP) architecture [14]. The computational complexity of learning LSTM models per time step is O(N) and is dominated by the factor \( n_c \times (4 \times n_c + n_o) \) where \( n_c \) and \( n_o \) are respectively the number of memory cells and output units. Hence in a large network similar to the proposed one, the learning time becomes infeasible for even a moderate number of output dimensions and cells. The LSTMP architecture reduces the computational cost by inserting a projection or recurrent layer after the normal LSTM unit and forwarding the output to the input module. In this case, the cost of computation is dominated by \( n_r \times (4 \times n_c+n_o) \) where \( n_r \) denotes the number of projection layer units. This use of projection layer results in reduced number of parameters from the LSTM module by a factor of \( \frac{n_r}{n_c} \) and helps to increase the memory capacity of the model substantially.

3 Experimentation and Results

Rigorous experimentations are done for comparison of the performance for the 2S-CNN architecture with multi-level fusion on two real-word moderately large datasets: (a) UCF-101[16] and (b) HMDB-51[11].

3.1 Evaluation of Performance

The selection of fusion stages. The network is first trained in the local context level mentioned in Sect. 2.1 using multiple combinations of fusions at several stages, separately over the two datasets described above. The results, listed in Table 1 follow the trend observed in [4], that fusing information at the earlier stages show less impact on the overall classification accuracy. Amongst the different experimental combinations the highest performance gain was noticed by implementing the fusion before three levels in the network, viz. Conv 2, Conv 3 and the final fully connected layer, as shown in Fig. 1a. Table 1 compares the performance of the network under different combination of local fusion with and without the auxiliary classifiers introduced.

Table 1. Classification accuracy achieved on different combinations of feature fusion at several stages of the network, by the local context fusion stage. Results are shown on the split 1 of UCF-101 dataset [16]. Similar trend was observed on the HMDB-51 dataset [11] too.

Training the Global Context Level. The global context stage is implemented using 1, 2 and 3 levels of LSTMP layers to capture the motion over the whole video. Comparison of the results in Tables 2 and 3 exhibit that the addition of this global fusion strategy achieves superior performance gain over classification accuracy using only the extracted local features (see Sect. 2.1) from the short chunks of the input. As the gain in accuracy was insignificant in case of inserting the third LSTMP layer over the previous case of two stacked LSTMP layers and also due to the high computational cost of BPTT for deeper recurrent networks, experimentation with more layers were not performed.

Table 2. Results for the local fusion network using combination of flow features with saliency maps and RGB frames. All the results are on split 1 of UCF-101 and HMDB-51.
Table 3. Classification accuracy for the global context network using several levels of LSTMP units.

As evident from Tables 1 and 3, the global context fusion provides a significant improvement on predicting actions in the HMDB-51 dataset over the local stage of the network. Reason behind this is the fact that for HMDB-51 videos have high intra-class variations and several challenges in the form of camera motion, jitter and low quality. As the features get fine tuned at the local stage, the final global fusion module takes the advantage of temporal modeling of the learned features to better discriminate them.

Combination of Optical Flow with Saliency Maps. The model was trained both using the RGB and saliency maps obtained using the methodology described in Sect. 2. For both the datasets, using the saliency maps coupled with optical flow features result in better performance (as in Table 2) than the use of plain RGB frames. This is due to the fact that the segmented salient regions give a clue for the model to localize the distinguishing features better from those parts, while suppressing noisy areas which otherwise would have contributed to outliers, thus reducing the classification accuracy.

Table 4. Comparison of the proposed 2S-CNN architecture with the state-of-the-art methods. (*) corresponds to methods using Improved Dense Trajectory (IDT) features. Missing numbers indicate unavailability of performance data or the corresponding method in the published article on the particular dataset.

Finally, results in Table 4 compare the performance of the proposed architecture with other state-of-the-art and recent methods, utilizing hand-crafted and deep learned features, revealing the superiority of performance on both the challenging datasets.

4 Conclusion

The two-stream CNN along with adaptive fusion strategy incorporated at different contextual levels shows best performance on two of the most popular and challenging action datasets. A controlled and adaptive multi-level fusion strategy at both local and global context is the highlight of this paper. Use of saliency maps with a combination of optical flow provides performance gain over the use of direct RGB frames as evident from the results in Table 2. Also, incorporating the fusion of information at two contexts results in significant improvement in performance.