1 Introduction

Action recognition is an active research problem in computer vision with applications in, e.g., real-time surveillance, security, video retrieval, human-computer interfaces and sports video analysis. Many approaches employ the popular bag-of-words framework for action recognition. Local descriptors based bag-of-words representation have shown promising results for action recognition [20, 27]. Several local features, such as, 3D-SIFT [21] and motion boundary histograms [26] are utilized for video description. These features capture shape, appearance and motion information crucial for action recognition.

Recently, convolutional neural networks (CNNs) have shown promising results on a variety of vision applications including action recognition [22]. CNNs are typically trained on a large labeled data and consist of a series of convolution and pooling operations followed by one or more fully-connected (FC) layers. Initially, most deep learning based approaches rely on capturing appearance information by training the network on RGB patches. Recently, motion-based CNN features have been investigated for the problems of action classification and detection [7, 22]. The motion-based CNNs operates on a dense optical flow signal to capture the motion patterns. The appearance and motion deep networks are combined in a late fusion manner. Features from the FC layers are then used for classification. The deep appearance and motion features are used both as holistic representations [22] and in combination with human pose estimation [3].

As discussed above, activations extracted from the output of the FC layers of the deep network are typically used as features for domain transfer in CNNs. Different to these approaches, recent studies [4, 18] have shown that activations from the convolutional layers provide excellent performance for object and texture recognition. These layers can be exploited within a bag-of-features pipeline by employing them as dense local features. The convolutional layer based bag-of-features framework has been successfully used for the task of object and texture recognition [4]. The deeper convolutional layers are known to possess higher discriminative power [29] and mitigate the need to use a fixed input image size. In this work, we investigate the fusion of appearance and motion based local convolutional features for the problem of action recognition.

In the last decade, most approaches for image classification and action recognition relied on the popular bag-of-words (BOW) based representations. The BOW approach starts by feature detection followed by feature extraction stage. Several hand-crafted features such as SIFT [19] are used for image description. The feature extraction stage is followed by a vocabulary construction step where the local features are vector quantized into a fixed size visual codebook. Consequently, the final representation is obtained by encoding the local features to a visual vocabulary. Within the BOW framework, the fusion of multiple cues such as color and shape is a well studied problem [15, 24]. The two standard strategies to fuse multiple cues within the BOW framework are early and late feature fusion. Early fusion fuses color and shape at the feature level as a result of which a joint multi-cue visual vocabulary is constructed. The second strategy, late fusion, fuses multiple cues at the feature encoding level by concatenating the explicit image representations of each visual cue. Early fusion possesses the property of feature binding since the spatial connection between color and shape is preserved at the feature level. Early fusion has been shown to provide improved results for natural scene categories [15]. Late fusion provides feature compactness since separate vocabularies are constructed for each visual cue and has been shown to provide superior performance for man-made categories [15].

As mentioned above, the two standard fusion approaches are only optimal for a specific type of object categories. The color attention based fusion approach [15] aims to combine the advantages of both early and late fusion. In color attention approach, color is used to modulate the shape features. The modulation can be applied both top-down and bottom-up and results in sampling more shape features from regions in an image that are likely to contain an object instance. Color attention possesses the feature binding property since color and shape are combined at the feature level. However, like late fusion, it also possesses the feature compactness property since separate vocabularies are constructed for color and shape. Color attention combines hand-crafted color and shape features for bag-of-words based object recognition. In this work, we re-visit the attention based fusion framework [15] to fuse motion and appearance local features obtained by the convolutional layers of the deep networks for bag-of-deep-features based action recognition.

Contributions: In this paper, we investigate the problem of fusing deep appearance and motion features for action recognition. We introduce an attention based bag-of-deep-features framework to combine deep appearance and motion based local convolutional features. Firstly, dense local appearance (RGB) and motion (flow) based local convolutional features are extracted from the spatial and temporal deep networks. Afterwards, a separate visual vocabulary is constructed for the deep motion and appearance features. Class-specific appearance information is then learned to modulate the weights of the deep motion features. Consequently, a category-specific histogram is constructed for each action class resulting in a discriminative video representation.

We validate our proposed approach by performing experiments on two challenging video datasets namely JHMDB with 21 categories and ACT with 43 action classes. On the JHMD dataset, the proposed approach provides a significant performance improvement of 4.6% and 4.1% compared to standard approaches of early and late fusion, respectively. Similarly on the ACT dataset, the proposed approach obtains a gain of 3.2% and 2.4% compared to standard approaches of early and late fusion, respectively. Furthermore, our approach, without exploiting body part information, achieves competitive performance compared to state-of-the-art approaches employing deep appearance and motion approaches.

2 Related Work

Recently, CNNs have shown significant improvement in performance over the state-of-the-art for various computer vision applications such as image classification and action recognition [17, 22]. CNNs, also known as deep networks, comprise of a series of convolution and pooling layers followed by several fully connected (FC) layers and are trained using a large amount of labeled training data. Several recent works have proposed deep features based video representations for action recognition [3, 7, 22]. Simonyan and Zisserman [22] proposed a two-stream CNN architecture where separate deep networks are trained to capture spatial and temporal features. The spatial network operates on RGB images whereas the temporal stream takes optical flow signal as input. The work of [3] introduced pose based CNNs based on appearance and flow information for action classification.

As discussed above, state-of-the-art action recognition approaches [3, 7, 22] employ deep architectures where both appearance (RGB) and motion (optical flow) information is exploited. Generally, the appearance and motion based CNNs are trained separately and combined at the FC layers. Other than the FC layers, recent works [4, 18] in image classification have demonstrated the effectiveness of convolutional layer activations instead of FC ones. The convolutional layers are discriminative while containing semantically meaningful information. The work of [4] proposed a bag-of-deep-features approach where convolutional layer activations are used as local descriptors. In this work, we employ the bag-of-deep-features framework and investigate the problem of fusing deep appearance and motion information for action recognition.

There exist two main approaches to fuse multiple cues within the bag-of-features framework. The first approach, called as early fusion, combines the visual cues before the vocabulary construction stage. This results in a single visual vocabulary with visual words representing multiple visual cues. Early fusion is shown to be especially suitable for natural object categories [15]. The second fusion strategy is termed as late fusion, where the two visual cues are processed separately and only combined at the final representation level. This implies that separate vocabularies are constructed for each visual cue and the final representation is the concatenation of all representations. Late fusion is shown to provide improved performance compared to early fusion for man-made object categories by [14, 15]. Further, late fusion of color and shape have shown to provide improved performance compared to early fusion for texture recognition [11], object detection [10], and action recognition [9]. Different to early and late fusion, the work of [14, 15] proposed an attention based fusion framework, to combine color and shape features, for object recognition. In color attention framework, color and shape are processed separately by constructing explicit vocabularies for both cues. Color is used to construct top-down attention maps, used to modulate the shape features. Color attention was shown to provide superior performance compared to both early and late fusion for object recognition.

Our Approach: Here, we re-visit the fusion framework of [14, 15] to combine deep appearance and motion based local features, within the bag-of-deep-features framework, for action recognition. We train separate deep networks to capture appearance (RGB) and motion (optical flow) information. Activations from the last convolutional layer of the two networks are then used as local features for each video frame. We construct separate vocabularies for the deep motion and appearance features. Deep appearance is used to construct top-down attention maps, used to modulate the deep motion features. A fixed-length video level representation is then obtained by max aggregation over all video frames.

3 Deep Features for Action Recognition

We train two CNNs to capture spatial and temporal features. The spatial network is trained on the ImageNet ILSVRC-2012 dataset [5] and the temporal network is trained on the UCF101 dataset [23].

Appearance Features: The spatial network takes an RGB image as an input and captures the appearance information. We employ the VGG-F network [2] which is similar to AlexNet and is faster to train. The network consists of five convolutional and three fully-connected layers. The network takes an RGB image of \(224\times 224\) dimensions. The first convolutional layer comprises of stride of 4 pixels. The remaining four convolutional layers consists of a convolution stride of 1 pixel. The number of convolution filters is 64 in the first convolutional layer and 256 in the remaining four convolutional layers. During training, the learning rate is set to 0.001, the weight decay to 0.0005 and the momentum to 0.9.

Motion Features: The temporal network takes optical flow signal as input and captures the motion information. Similar to [3, 7], we compute the optical flow from each consecutive pair of frames using the method of [1]. The values of the motion fields are transformed to the interval [0, 255]. The flow maps are saved as a 3-dimensional image by stacking the flow in \(x{-}\) and \(y{-}\) directions together with the flow magnitude. The network is trained on optical flow images using the UCF101 dataset [23] containing 13320 videos and 101 different classes. Similar to the spatial network, we employ the VGG-F architecture consisting of five convolutional and three FC layers. The work of [7] trains a temporal network using region proposals for action detection. Different to [7], the network is trained using the optical flow on the entire image for action classification. Figure 1 shows a few activations from the spatial (RGB) and temporal (flow) networks.

Fig. 1.
figure 1

Visualization of activations with the highest energy from the deepest (last) convolutional layer in the spatial (top row) and temporal (bottom row) networks. Appearance feature maps are computed from the RGB frame (top left) and motion feature maps from the corresponding flow image (bottom left).

4 Top-Down Appearance Attention for Action Recognition

Here, we investigate fusion strategies to combine deep appearance and motion based convolutional features for the problem of action recognition. We then propose an attention based framework where top-down appearance information is used to modulate deep motion features. Given a video, we extract dense local convolutional features \(f_{bj}\), \(j=1,...,M^{b}\), in each frame \({B^b}, b=1,2,...,N\), where \(M^{b}\) is the total number of feature sites in frame b. We extract dense local features from the last convolutional layer of the deep spatial (RGB) and temporal (flow) networks, respectively. The extracted local convolutional appearance and motion features are then quantized into fixed-sized visual vocabularies. The visual vocabularies for appearance and motion cues are represented as, \({\mathrm {W}^\mathrm {k}}=\{\mathrm {w}_{1}^\mathrm {k}, ...,{\mathrm {w}_{\mathrm {V}^\mathrm {k}}^\mathrm {k}}\}\) with \(k\in \{ap,mo,apmo\}\) for the two separate vocabularies for appearance and motion and for the joint visual vocabulary of appearance and motion, respectively. In the case of early fusion, the local features \(f_{bj}\) are quantized into a single vocabulary with joint appearance-motion words \(\mathrm{w}_\mathrm{bj}^\mathrm{apmo}\). In the case of late fusion, separate visual vocabularies are constructed for appearance and motion cues with visual-words \((\mathrm{w}_\mathrm{bj}^\mathrm{ap}\), \(\mathrm{w}_\mathrm{bj}^\mathrm{mo})\). In both fusion cases, visual-words \(\mathrm {w}_\mathrm{bj}^\mathrm{k} \in \mathrm {W}^\mathrm{k}\) is the \(j^{th}\) quantized convolutional feature of the \(b^{th}\) frame of a video for a visual cue k.

In the standard bag-of-words framework, the final representation is a histogram constructed by counting the occurrence of each visual-word in a frame. In case of early fusion, a single histogram is constructed based on joint appearance-motion words:

$$\begin{aligned} h\left( {\mathrm{{w}_{n}^\mathrm{{apmo}}}|{B^b}} \right) \propto \sum \limits _{j=1}^{{M^b}} {\delta \left( \mathrm{w_{bj}^{apmo},{\mathrm{{w}}_{n}^\mathrm{{apmo}}}} \right) } \end{aligned}$$
(1)

with

$$\begin{aligned} \delta \left( {x,y} \right) =\left\{ {\begin{array}{*{20}{c}} {\;\;0\;\;\mathrm{{for}}\;x \ne y\;\;\;} \\ {1\;\;\mathrm{{for}}\;x = y} \\ \end{array}} \right. \end{aligned}$$
(2)

In the case of late fusion, we construct separate histogram representations for appearance \(h\left( {\mathrm{{w}_{n}^\mathrm{{ap}}}|{B^b}} \right) \) and motion \(h\left( {\mathrm{{w}_{n}^\mathrm{{mo}}}|{B^b}} \right) \), respectively. The two histograms are then concatenated to obtain the final representation. As discussed earlier, both early and late feature fusion approaches are advantageous for a certain set of categories. Early fusion possesses the property of feature binding due to joint vocabulary whereas late fusion possesses the property of feature compactness due to separate visual vocabularies.

Next, we introduce an attention based bag-of-deep-features framework to combine deep appearance and motion based local convolutional features. In the attention framework, the visual cues are processed separately and combined at a later stage in the presence of top-down attention. We reformulate Eq. 1 to modulate the motion features with top-down appearance attention:

$$\begin{aligned} h\left( \mathrm{w}_n^{mo} |B^b , class\right) \propto \sum \limits _{j = 1}^{M^b } a\left( \mathbf{x}_{bj}, class\right) \delta \left( \mathrm{w_{bj}^{mo} ,\mathrm{w}_{n}^{mo} } \right) , \end{aligned}$$
(3)

where, \(a\left( \mathbf{x}_{bj}, class\right) \) are the attention weights and describe attention of the \(j^{th}\) local feature of the \(b^{th}\) frame. The attention is top-down and induces spatial binding since it is dependent on both the location \(\mathbf{x}_{bj}\) and the corresponding action class. The top-down attention component \(a\left( \mathbf{x}_{bj}, class\right) \) is defined to be the probability of an action class given its deep appearance value, described as

$$\begin{aligned} a\left( \mathbf{x}_{bj}, class\right) = p\left( {class| \mathrm{w}_\mathrm{bj}^\mathrm{ap}} \right) . \end{aligned}$$
(4)

where \(\mathrm{w}_\mathrm{bj}^\mathrm{ap}\) describes an appearance visual-word. We compute the appearance probabilities \(p\left( {class| \mathrm{w}_\mathrm{bj}^\mathrm{ap}} \right) \) as,

$$\begin{aligned} p\left( {class|{\mathrm{{w}}^\mathrm{{ap}}}} \right) \propto p\left( {{\mathrm{{w}}^\mathrm{{ap}}}|class} \right) p\left( {class} \right) \end{aligned}$$
(5)

where, \(p\left( {{\mathrm{{w}}^\mathrm{{ap}}}|class} \right) \) is the empirical distribution. The distribution is obtained by taking a summation over the indexes of the training frames belonging to the action category \(I^{class}\) as

$$\begin{aligned} p\left( \mathrm{{w_{n}^{ap}}}|class \right) \propto \sum \limits _{I^{class}} {\sum \limits _{j=1}^{{M^b}} {\delta \left( w_{bj}^{ap},\mathrm{{w_{n}^{ap}}} \right) }}, \end{aligned}$$
(6)

To obtain the prior \(p\left( {class} \right) \) over the action classes, we use the training data. The attention formulation in Eq. 3 reduces to standard bag-of-deep-features based motion histogram when the probabilities \(p\left( {class|{\mathrm{{w}}^c}} \right) \) are uniform. In the attention framework, motion features are given more weights in regions with higher appearance attention compared to regions where attention is low. Note that due to the top-down attention, a different distribution is obtained for each action class using the same deep motion visual-words. The final representation is obtained by concatenating all action category-specific histograms. The proposed attention based representation combines the advantages of both late and early fusion. Similar to early fusion, the final representation possesses feature binding property since the appearance and motion features are binded spatially. Similar to late fusion, the attention based final representation possesses the property of feature compactness since separate vocabularies are constructed for appearance and motion features. Finally, an attention based video representation is obtained by max aggregating the frame-level attention histograms over all video frames.

Fig. 2.
figure 2

Example images from the JHMDB (top two rows) and ACT datasets (bottom row). The JHMDB dataset consists of 928 video clips of 21 different action categories, such as jump, golf, shoot-gun, shoot-bow, kick-ball, brush hair and swing-baseball. The ACT dataset consists of 11234 video clips of 43 different action categories, such as swinging golf, swinging tennis, pouring-juice, jumping-high and cutting-apple.

5 Experiments

Here, we present the results of our experiments. We compare our approach with standard early and late fusion approaches. We also provide a comparison of our approach with state-of-the-art results reported in literature.

Datasets: We validate our approach on two challenging video datasets: JHMDB [8] and ACT [25] datasets. The JHMDB dataset consists of 21 human actions, such as jump, golf, climb and swing-baseball. The dataset consists of 928 video clips where there are 36 to 55 clips per action category. There are 15 to 40 frames in each video clip. We use the train/test splits provided with the dataset. The ACT dataset contains 11234 high quality video clips. The dataset is divided into 7260 training videos and 3974 test videos. The dataset consists of 43 action classes such as swinging golf, cutting-orange and cutting-apple. We use the train/test splits provided with the dataset. On both datasets, the performance is evaluated in terms of mean accuracy over all action classes. Each test video clip is assigned the action category label of the classifier giving the highest response. Figure 2 shows example images from the JHMBD and ACT datasets.

Experimental Setup: As discussed in Sect. 3, we employ spatial and temporal deep networks to obtain appearance and motion features. We train the RGB and flow VGG-F networks using the Matconvnet library [2]. The RGB network is trained on ImageNet 2012 dataset whereas the flow network is trained on the UCF-101 dataset. The standard deep features are extracted from the FC7 layer of the spatial (RGB) and temporal (flow) networks, respectively. For the bag-of-deep-features representations, we extract the convolutional features from the output of the last convolutional layer of the two networks. For both RGB and flow, we construct vocabularies of 4096 words using the K-means algorithm. Since the final representation for late fusion is \(4096+4096 = 8192\) dimensional, for a fair comparison, we construct a vocabulary of 8192 words for early fusion. For classification, we employ SVMs with linear kernels.

Attention Cue Evaluation: We first evaluate the role of appearance and motion features as the attention and modulated cues. As discussed earlier, the attention cue contains the prior knowledge about action categories and used to alter the weights of the histogram of modulated cue. We investigate the pairs of appearance-appearance, motion-motion, motion-appearance and appearance-motion attention model. Table 1 shows the results when different attention-modulated cues were used. The results do not change when the same visual cue is used both as the attention and modulated cue. The accuracy improves when using motion as attention cue compared to using appearance alone. The best results are obtained when deep appearance features are used as an attention cue to modulate the weights of motion features. On JHMDB, the top-down appearance provides a gain of 5.8% compared to the motion-motion attention.

Table 1. Attention cue evaluation (mean accuracy in \(\%\)). We evaluate the role of appearance and motion as attention and modulated cues. The best results are obtained when appearance is used as an attention cue to modulate deep motion features.
Table 2. Feature fusion evaluation (mean accuracy in \(\%\)). We evaluate fusion strategies both with standard deep features (FC layer) and bag-of-deep-features framework. On the JHMDB dataset, our proposed fusion approach provides the best results with a gain of \(1.8\%\) compared to the late fusion of standard deep features (FC-A, FC-M). On the ACT dataset, our approach achieves the best performance with a gain of \(2.4\%\) compared to the late fusion of BOW deep features (A, M).

Feature Fusion Comparison: We compare our fusion approach with two standard fusion methods: early and late fusion. We further compare our approach with standard deep features extracted from the FC layers of the deep networks. Table 2 shows the results of different fusion approaches on the two action recognition datasets. On the JHMDB dataset, the standard FC based deep appearance features achieve the classification score of \(36.8\%\). The standard FC based deep motion features obtain the classification score of \(57.0\%\). The late fusion of FC based deep appearance and motion features improves the results by \(2.9\%\) with recognition accuracy of \(59.9\%\). The bag-of-deep-features based appearance and motion representations obtain the classification scores of \(41.7\%\) and \(55.8\%\), respectively. Among the standard BOW based fusion approaches, late fusion provides slightly improved performance by obtaining the classification score of \(57.6\%\), compared to early fusion. The best results are obtained with our attention based fusion framework which provides a significant gain of \(4.1\%\) compared to late fusion of appearance and motion (A, M).

Fig. 3.
figure 3

Per-class comparison of our fusion approach with early and late fusion on the JHMDB dataset. Our approach improves the results on most of the action classes.

On the ACT dataset (Table 2), the standard FC based deep appearance features achieve the classification accuracy of \(56.7\%\). The standard FC based motion features achieve a recognition rate of \(59.1\%\). The late fusion of FC based motion and appearance features provide a score of \(68.5\%\). The bag-of-deep-feature based motion and appearance representations obtain classification scores of \(61.0\%\) and \(56.1\%\), respectively. The early feature fusion approach improves the classification results with a recognition accuracy of \(68.7\%\). The late feature fusion obtains a score of \(69.5\%\). Our fusion approach outperforms both early and late fusion by achieving a classification accuracy of \(71.9\%\). Further, our approach outperforms late fusion of FC based standard deep appearance and motion features by \(3.4\%\).

Table 3. State-of-the-art comparison on the JHMDB dataset with 21 action categories. The results are shown in terms of accuracy (\(\%\)). Our approach provides superior results compared to existing methods.

It is worthwhile to investigate the combination of our bag-of-deep-features based fusion approach and the standard FC based deep features, since they are potentially complementary. A further gain of \(8.4\%\) and \(6.2\%\) in accuracy is obtained (presented in Tables 3 and 4) on the JHMDB and ACT datasets respectively by combining our attention based fusion approach with the standard FC based deep features. This clearly suggests that our bag-of-deep-features based fusion representation is complementary to the standard deep features and combining them results in a significant improvement in performance.

Figure 3 shows the per-category performance comparison of our approach with late and early fusion methods on the JHMDB dataset. Our attention based method provides improved performance on nine classes and achieves similar results to the standard fusion approaches on six categories. A significant gain is achieved especially for clap (\(+21.0\%\)), pick (\(+16.7\%\)) and sit (\(+12.2\%\)) categories, all in comparison to the two standard fusion methods.

Table 4. State-of-the-art comparison on the ACT dataset with 43 classes. The results are shown in terms of accuracy (\(\%\)). Existing approaches are based on very deep (VGG16) architecture. Our approach provides competitive performance despite employing the shallow VGG-F network architecture, compared to state-of-the-art using very deep VGG16 network. It is worth to mention that our fusion approach is generic and can be used with any CNN architecture including very deep networks (VGG16).

State-of-the-Art Comparison: Table 3 shows the state-of-the-art comparison on the JHMDB dataset. Our final representation is the combination of the proposed attention based fusion and the standard FC based deep features. The P-CNN based framework [3] combining body part information with appearance and motion based deep features, achieves a classification accuracy of \(59.9\%\). The results are further improved to \(64.7\%\) when using improved dense trajectory features with pose information. Our approach, without exploiting any pose information, achieves a gain of \(5.4\%\) compared to IDT-FV Pose [3]. It is worth to mention that the IDT-FV Pose method [3] is complementary to our approach and their combination is expected to further improve the results.

Table 4 shows the comparison on the ACT dataset. The work of [22] proposes a two-stream CNN using the very deep (VGG16) architecture and obtains accuracy of \(78.7\%\). The Siamese network based approach [25] that works by modeling the action as a transformation on a high-level feature space achieves a score of \(80.6\%\). Our approach, employing the shallow VGG-F network provides competitive performance with a score of \(78.7\%\). However, our fusion approach is generic and can be used with any CNN architecture including very deep networks (VGG16). The two-stream and Siamese network approaches are complementary to our method and can be combined to further improve the results.

6 Conclusions

We proposed an approach within the bag-of-deep-features framework to combine deep appearance and motion features. Appearance and motion based local features are extracted from the spatial and temporal networks, respectively. Separate vocabularies are constructed for the appearance and motion cues. Top-down deep appearance information is used to modulate the deep motion features. Experiments show that our approach provides significant improvements compared to the standard fusion approaches based same set of deep features. A promising future direction is to investigate the integration of semantic part based information [12, 13] within the proposed framework. Another research direction is to investigate integrating semantic information in a weakly supervised fashion [16] for real-world autonomous applications.