Keywords

1 Introduction

A video can be thought of as a visual document which may be represented from different dimensions such as frames, objects and other different levels of features. Since the proposal of corner points in 1980s, local features have been designed and applied in image and video analysis with great success for many tasks. For video activity analysis in real scenarios, it is crucial to explore beyond the simple use of local features. Two trends for action representation are becoming evident: (1) instead of using signal-level features for recognition, higher-level features [1] and/or semantics and attributes [2] become common choice of features; (2) the temporal/spatial relationship between features are attracting increasing attention and efforts [3, 4]. Both are efforts for analysis of activities in more complex settings.

With the prevalence of video-related applications across different domains such as surveillance, human machine interaction and movie narration, automatically analyzing video content has attracted attention from both research and industry. Action recognition is usually one of the most important and popular tasks, and requires the understanding of temporal and spatial cues in videos.

Efforts have been taken to build models for representation and inference for action recognition. Models at the early stage used local features, and achieved success under some specific conditions. To represent the actions in a video, many of them applied the bag-of-features scheme and neglected the spatial or temporal relationships between the local features. Since the same set of local features can represents different actions, this scheme is hard to handle complex scenarios.

Action recognition has put special emphasis on the representation of the spatial and temporal relationships between low-level or mid-level features. Graphical models are the choice for most work. Such models build a graph for higher-level or global action recognition based on lower-level features. Examples of such models include hidden Markov models, dynamic Bayesian networks among others. Recently, deep learning approaches construct a multi-layer graphical model to learn the representation from videos automatically, and achieve state-of-the-art performance.

Motivated by graphical models to preserve the structure of the signals, we propose an action graph-based sparse coding framework. Most models of sparse coding are based on a linear combination of dictionary signals to approximate the input, and the signals are usually n-dimensional vectors. Differing from traditional use, we apply sparse coding to action graphs, which are represented by \( {\text{n}} \times {\text{n}} \) dimensional matrices. Such an extension keeps the spatio-temporal structure at the signal level, and the sparse coding problem still remains tractable and has effective solvers.

1.1 Sparse Coding for Visual Computing

Sparse coding and dictionary learning have attracted interests during the last decade, as reviewed in [5]. Originated from computational neuroscience, sparse coding is a class of algorithms for finding a small set of basis functions that capture higher-level features in the data, given only unlabeled data [6]. Since its introduction and promotion by Olshausen and Field [7], sparse coding has been applied into many fields such as image/video/audio classification, image annotation, object/speech recognition and many others.

Zhu et al. encode local 3D spatial-temporal gradient features with sparse codes for human action recognition [8]. [9] uses sparse coding for unusual events analysis in video by learning the dictionary and the codes without supervision. It is worth noting that all of these approaches use vectorized features as input without consideration on the structure information among the features. [10] combines the geometrical structure of the data into sparse coding framework, and achieves better performance in image classification and clustering. Further, [11] proposes tensor sparse coding for positive definite matrices as input features. This motivates this work by combining graph representation of actions [12] with sparse coding.

Differing from most existing research, the elementary objects of dictionary learning and sparse coding operations are graphs in our approach. More specifically, it is the graphs that describe the temporal relationships that comprise our mid-level features. Graphs have been used in the activity analysis in literature. Gaur et al. [13] proposed a “string of feature graphs” model to recognize complex activities in videos. The string of feature graphs (SFGs) describe the temporally ordered local feature points such as spatio-temporal interest points (STIPs) within a time window. Ta et al. [14] provide ad similar idea but using hyper-graphs to represent the spatio-temporal relationship of more than two STIPs. The recognition in both works is fulfilled by graph matching. Using individual STIPs to construct the nodes can result in unstable graphs and performance. A study similar to ours is that of Brendel and Todorovic [15], who built a spatio-temporal graph based on segmented videos.

2 Action Graph from Dense Trajectories

In this section, we describe the construction of action graphs. Dense trajectories [16] are employed as low-level features from which we extract meaningful local action descriptors (referred to as actionlets hereafter). The action graphs describe the temporal relations between actionlets, and are used as features in the sparse coding framework in Sect. 2.3.

2.1 Grouping Dense Trajectories

The dense trajectories in [16] are extracted from multiple spatial scales based on dense optical field. Abrupt change and stationary trajectories are removed from the final results. For each trajectory, the descriptor combines trajectory shape, appearance (HoG), and motion (HoF and MBH) information. Therefore, the feature vector for a single trajectory is in the form of

$$ Q = \left( {S,HoG,HoF,MBH_{x} ,MBH_{y} } \right) $$

where \( S = \frac{{\left( {\Delta P_{q} , \ldots ,\Delta P_{q + L - 1} } \right)}}{{\sum\nolimits_{i = q}^{q + L - 1} {\left\| {\Delta P_{i} } \right\|} }} \) is the normalized shape vector, and its dimension \( L \) is the length of the trajectory. MBH is divided into \( MBH_{x} \) and \( MBH_{y} \) to describe the motion in x and y direction respectively.

The trajectories are clustered into groups based on their descriptors, and each group consists of spatio-temporally similar trajectories which characterize the motion of a particular object or its part. Given two trajectories \( q_{1} \) and \( q_{2} \), the distance between them is

$$ d\left( {q_{1} ,q_{2} } \right) = \frac{1}{L}d_{S} \left( {q_{1} ,q_{2} } \right) \cdot \bar{d}_{spatial} \left( {q_{1} ,q_{2} } \right) \cdot d_{q} \left( {q_{1} ,q_{2} } \right) $$

where \( d_{S} \) is the Euclidean distance between the shape vectors of \( q_{1} \) and \( q_{2} \), \( \bar{d}_{spatial} \left( {q_{1} ,q_{2} } \right) \) is the mean spatial distance between corresponding trajectory points, and \( d_{q} \left( {q_{1} ,q_{2} } \right) \) indicates the temporal distance. Trajectories are grouped based on a graph clustering algorithm. Figure 1 shows examples of grouped trajectories with background motion removed for some sample videos.

Fig. 1.
figure 1

Illustration on trajectory grouping based on spatio-temporal proximity

The trajectories provide low level description to the action content in a video. A mean feature vector, \( x_{i} \, \in \,R^{d} \), is obtained for all the trajectories in the same group. Because of the large motion variation even in the same type of actions, our model clusters these trajectory groups using K-means over \( x_{i} \, \in \,R^{d} \)’s to generate a set of prototypes of trajectory clusters, which describes different types of local actions.

2.2 Action Graphs

Based on the bag-of-groups representation, our model develops the statistical temporal relations between the “groups”. We categorize Allen’s temporal relationships into two classes: overlaps \( \left( O \right) \) and separates \( \left( S \right) \), and construct two types of action graphs. It is also possible too use the original thirteen relations to construct action graphs. Because the procedure is the same, we use the two categorized relations for simplicity.

For each type of action, the temporal relationship between pairs of group words is modelled by an action graph, which is a two-dimensional histogram. Each histogram shows the frequencies with which the relation is true between a pair of group words. That is, a temporal relation \( R^{i} \in \left\{ {O,S} \right\} \), \( R^{i} \left( {x,y} \right) \) is the frequency of \( xR^{i} y \) between two group words x and y. In our model, we construct the temporal relations for each type of action in a supervised manner. Figure 2 shows an example of overlaps for different actions in one testing dataset. It can be observed that different actions exhibit different histograms, and similar actions have similar histograms. Examining each of the histograms shows which temporal relation (such as overlaps for boxing) has a stronger response for some pairs of group words than the others. This implies the major relation between actionlets.

Fig. 2.
figure 2

Laplacian matrix of action graphs for overlaps of five different actions in KTH dataset. The X and Y axes are different types of actionlets

2.3 Action Graph Representation with Sparse Coding

Given actionlets and their temporal relationship, we precede here to present a sparse coding approach which is based on the temporal relationship graphs, and apply it for video action recognition. Let \( Y = \left[ {y_{1} , \ldots ,y_{n} } \right] \in \,R^{d \times n} \) denote the data matrix of a video clip, where \( y_{i} \, \in \,R^{d} \) denotes each actionlet descriptor. For the temporal relationships separate (S) and overlap (O), each is represented by an undirected action graph for this study. Therefore \( K = 2 \) action graphs \( G_{k = 1}^{K} \) are employed to cover all the cases using a 1-of-K coding scheme. If actionlets \( a_{i} \) and \( a_{j} \) has a temporal relationship \( R_{k} \), then edge \( \left( {a_{i} ,a_{j} } \right) \) exists in graph \( G_{k} \). For each type of graph, sparse coding analysis is performed separately, and then the codes are combined to form the feature representation of a video clip for tasks such as classification.

In this section, we describe the Laplacian matrix of action graphs in Sect. 2.3.1, followed by discussion on sparse coding framework in Sects. 2.3.2 and 2.3.3.

2.3.1 Laplacian Matrix of Action Graphs

As representation of action graphs, the adjacency matrices are not an ideal choice to be adapted in a sparse coding framework. As shown in the following sections, symmetric positive definite matrices are desirable to compare action graphs and reduce the problem to a classic form. In this work, we use the Laplacian matrix, \( L \), to represent the action graphs. This is mainly because the Laplacian matrix of a graph is always symmetric positive semi-definite (SPSD), i.e. \( \forall y,y^{T} Ly \ge 0 \).

There exists an easy conversion between the Laplacian matrix of a graph and its adjacency or incidence matrix. For adjacency matrix \( A \) representation of action graphs, its Laplacian matrix \( L = D - A \) where \( D \) is diagonal degree matrix. However, construction of Laplacian matrix from adjacency matrices only apply for simple graphs which are undirected without loops or multiple edges between two actionlets. Another way to obtain Laplacian matrix of a graph is through incidence matrices. Suppose \( M_{\left| V \right| \times \left| E \right|} \) is the incidence matrix, then \( L = MM^{T} \). For an undirected graph, we can use its oriented incidence matrix by arbitrarily defining an order of the actionlets; it is straightforward to get M and thus L for a directed graph. We use the incidence matrix of a graph to obtain its Laplacian matrix for further extension although we use undirected graphs in this work.

To make the matrices of action graphs strictly positive definite (SPD), we regularize the Laplacian matrices by adding a small multiple of the identity matrix. Without further explanation, all the action graphs below are represented by regularized Laplacian matrices, including the dictionary and the action graph to be approximated.

2.3.2 Sparse Coding for Action Graphs

Action graphs describe the temporal relationship among the actionlets, and each is represented by a Laplacian matrix. For each of the two relationships, we collect several action graphs from different videos of the same type. For example, graph \( O \) describes the “overlap” relationship between any two actionlets. If there exists a pair of actionlets \( a_{i} \) and \( a_{j} \) in a video whose time intervals overlap, then there is an edge between nodes \( a_{i} \) and \( a_{j} \) in the graph \( O \), and the weight is the normalized frequency of \( a_{i} Oa_{j} \).

Given a set of video clips, an action graph \( A_{i} \) is constructed for each of them. For localization and precise detection purpose, \( A_{i} \)’s are constructed from short clips or results after shot detection on an entire video. Let \( D = \left[ {A_{1} ,A_{2} , \ldots ,A_{p} } \right] \in \,R^{{\left( {n \times n} \right) \times p}} \) be the dictionary of the action graphs, and \( A_{i} \) be an \( n \times n \) basis relationship, where n is the total number of actionlet types across different actions. For given videos, let \( G = \left[ {G_{1} ,G_{2} , \ldots ,G_{m} } \right] \in \,R^{{\left( {n \times n} \right) \times m}} \) be the action graphs extracted from them. Based on the dictionary, we decompose each graph \( G_{i} \) into the linear combination of the basis relationships

$$ G_{i} \approx \hat{G}_{i} = s_{i1} A_{1} + s_{i2} A_{2} + \ldots + s_{ip} A_{p} \triangleq s_{i} D $$

where \( s_{i} \) is the coefficient vector for action graph \( G_{i} \). Let \( S = \left[ {S_{1} ,S_{2} , \ldots ,S_{m} } \right] \) be the coefficient matrix for \( G \).

The empirical loss function \( l\left( {G,S} \right) = \sum\nolimits_{i = 1}^{m} {d\left( {G_{i} ,s_{i} D} \right)} \) evaluates the decomposition error by representing \( G \) using \( S \) based on dictionary \( D.d( \cdot , \cdot ) \) measures the distortion of the approximation \( \hat{G}_{i} \) to its original action graph \( G_{i} \), which can be evaluated by the distance between two matrices. The objective function can then be formulated as in

$$ \mathop {\hbox{min} }\limits_{S} \sum\nolimits_{i = 1}^{m} {d\left( {G_{i} ,s_{i} D} \right)} + \beta \left\| {S_{i} } \right\|_{1} $$

where \( \left\| \cdot \right\|_{1} \) denotes \( L_{1} \) norm. \( \left\| {S_{i} } \right\|_{1} \) is the sparsity term, and \( \beta \) is a parameter which trades off between the empirical loss and sparsity.

2.3.3 Distance Between Action Graphs

To evaluate the empirical loss, different distance metrics between action graphs, \( d( \cdot , \cdot ) \), could be used. Let \( S_{ + + }^{n} \) denote the set of symmetric positive definite (SPD) matrices.

Given \( A,B \in S_{ + + }^{n} \), in this paper we use the Logdet divergence [67] as the distortion measurement because it results in a tractable convex optimization problem. The Logdet divergence between \( A,B \in S_{ + + }^{n} \) is defined by

$$ D_{ld} \left( {A,B} \right) = tr\left( {AB^{ - 1} } \right) - \log \,\det \left( {AB^{ - 1} } \right) - n $$

The Logdet is convex in \( A \), and therefore \( A \) can be \( G_{i} \) which is the true action graph we need to estimate based on a sparse combination of the basis action graphs in the dictionary D. Following a similar procedure as in [120], we transform \( D_{ld} \left( {A,B} \right) \) to convert into a known determinant maximization problem. The objective function becomes

$$ \mathop {\hbox{min} }\limits_{S} \sum\nolimits_{i = 1}^{m} {tr\left( {s_{i}^{Q} \varepsilon } \right)} - \log \,\det \left( {s_{i} \hat{D}} \right),\quad s.t.\;s_{i} \ge 0,\;s_{i} \hat{D} \succ 0 $$

where \( \hat{D} \) is transformed dictionary tuned according to the action graph to approximate, \( G_{i} \), i.e., \( \hat{D} = \left[ {\hat{A}_{j} } \right]_{j = 1}^{p} \) with \( \hat{A}_{j} = G_{i}^{ - 1/2} A_{j} G_{i}^{ - 1/2} \). \( \varepsilon \) is the vector of the traces of dictionary \( \hat{D}:\varepsilon_{i} = tr\hat{A}_{i} + \beta \).

This is a convex optimization problem on \( \left\{ {s_{i} \left| {s_{i} D \succ 0} \right.} \right\} \) known as max-det problem [17], which has an efficient interior-point solution. We use the cvx modeling framework 1 to get the optimal values for \( S \).

Notice the sparseness of the coefficients.

2.4 Experimental Results

We use the KTH dataset to evaluate our approach. We split the videos of each category into training and testing, and build the dictionary \( D \) from the training dataset. Action graphs are constructed for each video, and we randomly select \( k \) (\( p = Nk \), where \( N \) is the number of actions) action graphs from each category and assemble them to obtain the dictionary \( D \). Therefore, the dictionary is in the form of

$$ D = \left[ {A_{11} , \ldots ,A_{1k} } \right] \ldots ,\left[ {A_{N1} , \ldots ,A_{Nk} } \right] $$

For any given video, its action graph is decomposed using the dictionary and represented by the decomposition coefficients, \( s_{i \cdot } \) Figure. 3 shows two examples of the coefficients of two videos of different actions. For classification purpose, we get the maximum of decomposition coefficients of each category, \( a \), and label the video with the category having the maximum coefficient as shown in the following equation:

Fig. 3.
figure 3

Plot of the optimal sparse coding solutions

$$ a^{*} = \mathop {\arg \hbox{max} }\limits_{a} \left\{ {\hbox{max} \left\{ {s_{ai} } \right\}} \right\} $$

Figure 4 shows the result from sparse coding optimization. For each testing video of categories shown in x-axis, we take the maximum optimized coefficients \( sa^{*} \) for each category \( a \), \( a \in \left\{ {box,clap,jog,run,walk} \right\} \), i.e. \( sa^{*} = \hbox{max} \left\{ {s_{a \cdot } } \right\} \), and then average it over all the videos in the same category to obtain \( \left[ {\bar{s}_{a}^{*} } \right] \). Each vector \( \left[ {\bar{s}_{box}^{*} , \ldots ,\bar{s}_{walk}^{*} } \right] \) corresponds to one curve in the figure. For each curve, we can see the peak of the coefficients is always consistent with its actual type of actions. Figure 5 shows the decomposition coefficients from some sample videos. The shaded cells denote that the corresponding videos do not have the maximum coefficient and thus will be misclassified.

Fig. 4.
figure 4

Average sparse coding coefficients \( s_{i \cdot } \) for each category of videos

Fig. 5.
figure 5

The maximum coefficients from the sparse coding

3 Summary

We present a sparse coding approach to decompose action graphs for analysing activities in videos. The action in a video is characterized by an action graph, which describe the temporal relationships between different types of mid-level trajectory clusters. The use of graphs instead of vectors as features keeps better structure information about the components of actions. The difficulty with variation in graphs is handled by using tensor sparse coding.