1 Introduction

Human action recognition is a central problem in computer vision with potential impact in surveillance, human-robot interaction, elderly assistance systems, and gaming, to name a few. While there have been significant advancements in this area over the past few years, action recognition in unconstrained settings still remains a challenge. There have been research to simplify the problem from using RGB cameras to more sophisticated sensors such as Microsoft Kinect that can localize human body-parts and produce moving 3D skeletons [1]; these skeletons are then used for recognition. Unfortunately, these skeletons are often noisy due to the difficulty in localizing body-parts, self-occlusions, and sensor range errors; thus necessitating higher-order reasoning on these 3D skeletons for action recognition.

There have been several approaches suggested in the recent past to improve recognition performance of actions from such noisy skeletons. These approaches can be mainly divided into two perspectives, namely (i) generative models that assume the skeleton points are produced by a latent dynamic model [2] corrupted by noise and (ii) discriminative approaches that generate compact representations of sequences on which classifiers are trained [3]. Due to the huge configuration space of 3D actions and the unavailability of sufficient training data, discriminative approaches have been the trend in the recent years for this problem. In this line of research, the main idea has been to compactly represent the spatio-temporal evolution of 3D skeletons, and later train classifiers on these representations to recognize the actions. Fortunately, there is a definitive structure to motions of 3D joints relative to each other due to the connectivity and length constraints of body-parts. Such constraints have been used to model actions; examples include Lie Algebra [4], positive definite matrices [5, 6], using a torus manifold [7], Hanklet representations [8], among several others. While modeling actions with explicit manifold assumptions can be useful, it is computationally expensive.

In this paper, we present a novel methodology for action representation from 3D skeleton points that avoids any manifold assumptions on the data representation, instead captures the higher-order statistics of how the body-joints relate to each other in a given action sequence. To this end, our scheme combines positive definite kernels and higher-order tensors, with the goal to obtain rich and compact representations. Our scheme benefits from using non-linear kernels such as radial basis functions (RBF) and it can also capture higher-order data statistics and the complexity of action dynamics.

We present two such kernel-tensor representations for the task. Our first representation sequence compatibility kernel (SCK), captures the spatio-temporal compatibility of body-joints between two sequences. To this end, we present an RBF kernel formulation that jointly captures the spatial and temporal similarity of each body-pose (normalized with respect to the hip position) in a sequence against those in another. We show that tensors generated from third-order outer-products of the linearizations of these kernels can be a simple yet powerful representation capturing higher-order co-occurrence statistics of body-parts and yield high classification confidences.

Our second representation, termed dynamics compatibility kernel (DCK) aims at representing spatio-temporal dynamics of each sequence explicitly. We present a novel RBF kernel formulation that captures the similarity between a pair of body-poses in a given sequence explicitly, and then compare it against such body-pose pairs in other sequences. As it might appear, such spatio-temporal modeling could be expensive due to the volumetric nature of space and time. However, we show that using an appropriate kernel model can shrink the time-related variable in a small constant size representation after kernel linearization. With this approach, we can model both spatial and temporal variations in the form of co-occurrences which could otherwise have been prohibitive.

We further show through experiments that the above two representations in fact capture complementary statistics regarding the actions, and combining them leads to significant benefits. We present experiments on three standard datasets for the task, namely (i) UTKinect-Actions [9], (ii) Florence3D-Actions [10], and (iii) MSR-Action3D [11] datasets and demonstrate state-of-the-art accuracy.

To summarize, the main contributions of this paper are (i) introduction of sequence and the dynamics compatibility kernels for capturing spatio-temporal evolution of body-joints for 3D skeleton based action sequences, (ii) derivations of linearization of these kernels, and (iii) their tensor reformulations. We review the related literature next.

2 Related Work

The problem of skeleton based action recognition has received significant attention over the past decades. Interested readers may refer to useful surveys [3] on the topic. In the sequel, we will review some of the more recent related approaches to the problem.

In this paper, we focus on action recognition datasets that represent a human body as an articulated set of connected body-joints that evolve in time [12]. A temporal evolution of the human skeleton is very informative for action recognition as shown by Johansson in his seminal experiment involving the moving lights display [13]. At the simplest level, the human body can be represented as a set of 3D points corresponding to body-joints such as elbow, wrist, knee, ankle, etc. Action dynamics has been modeled using the motion of such 3D points in [14, 15], using joint orientations with respect to a reference axis [16] and even relative body-joint positions [17, 18]. In contrast, we focus on representing these 3D body-joints by kernels whose linearization results in higher-order tensors capturing complex statistics. Noteworthy are also parts-based approaches that additionally consider the connected body segments [4, 1921].

Our work also differs from previous works in the way it handles the temporal domain. 3D joint locations are modeled as temporal hierarchy of coefficients in [14]. Pairwise relative positions of joints were modeled in [17] and combined with a hierarchy of Fourier coefficients to capture temporal evolution of actions. Moreover, this approach uses multiple kernel learning to select discriminative joint combinations. In [18], the relative joint positions and their temporal displacements are modeled with respect to the initial frame. In [4], the displacements and angles between the body parts are represented as a collection of matrices belonging to the special Euclidean group SE(3). Temporal domain is handled by the discrete time warping and Fourier temporal pyramid matching on a sequence of such matrices. In contrast, we model temporal domain with a single RBF kernel providing invariance to local temporal shifts and avoid expensive techniques such as time warping and multiple-kernel learning.

Our scheme also differs from prior works such as kernel descriptors [22] that aggregate orientations of gradients for recognition. Their approach exploits sums over the product of at most two RBF kernels handling two cues e.g., gradient orientations and spatial locations, which are later linearized by Kernel PCA and Nyström techniques. Similarly, convolutional kernel networks [23] consider stacked layers of a variant of kernel descriptors [22]. Kernel trick was utilized for action recognition in kernelized covariances [24] which are obtained in Nyström-like process. A time series kernel [25] between auto-correlation matrices is proposed to capture spatio-temporal auto-correlations. In contrast, our scheme allows sums over several multiplicative and additive RBF kernels, thus, it allows handling multiple input cues to build a complex representation. We show how to capture higher-order statistics by linearizing a polynomial kernel and avoid evaluating costly kernels directly in contrast to kernel trick.

Third-order tensors have been found to be useful for several other vision tasks. For example, in [26], spatio-temporal third-order tensors on videos is proposed for action analysis, non-negative tensor factorization is used for image denoising in [27], tensor textures are proposed for texture rendering in [28], and higher order tensors are used for face recognition in [29]. A survey of multi-linear algebraic methods for tensor subspace learning and applications is available in [30]. These applications use a single tensor, while our goal is to use the tensors as data descriptors similar to [3134] for image recognition tasks. However, in contrast to these similar methods, we explore the possibility of using third-order representations for 3D action recognition, which poses a different set of challenges.

3 Preliminaries

In this section, we review our notations and the necessary background on shift-invariant kernels and their linearizations, which will be useful for deriving kernels on 3D skeletons for action recognition.

3.1 Tensor Notations

Let \(\varvec{\mathcal {V}}\in \mathbb {R}^{d_1\times d_2\times d_3}\) denote a third-order tensor. Using Matlab style notation, we refer to the p-th slice of this tensor as \(\varvec{\mathcal {V}}_{:,:,p}\), which is a \(d_1\times d_2\) matrix. For a matrix \(\varvec{V}\in \mathbb {R}^{d_1\times d_2}\) and a vector \(\mathbf {v}\in \mathbb {R}^{d_3}\), the notation \(\varvec{\mathcal {V}}=\varvec{V}\uparrow \!\otimes \mathbf {v}\) produces a tensor \(\varvec{\mathcal {V}}\!\in \!\mathbb {R}^{d_1\times d_2\times d_3}\) where the p-th slice of \(\varvec{\mathcal {V}}\) is given by \(\varvec{V}v_p\), \(v_p\) being the p-th dimension of \(\mathbf {v}\). Symmetric third-order tensors of rank one are formed by the outer product of a vector \(\mathbf {v}\in \mathbb {R}^{d}\) in modes two and three. That is, a rank-one \(\varvec{\mathcal {V}}\in \mathbb {R}^{d\times d\times d}\) is obtained from \(\mathbf {v}\) as \(\varvec{\mathcal {V}}=({\uparrow \!\otimes }_3\mathbf {v}\!\triangleq \!(\mathbf {v}\mathbf {v}^T)\uparrow \!\otimes \mathbf {v})\). Concatenation of n tensors in mode k is denoted as \(\left[ \varvec{\mathcal {V}}_i\right] _{i\in \mathcal {I}_{n}}^{\oplus _k}\), where \(\mathcal {I}_{n}\) is an index sequence \(1,2,..., n\). The Frobenius norm of tensor is given by \(\left\| {\varvec{\mathcal {V}}}\right\| _F = \sqrt{\sum _{i,j,k} \mathcal {V}_{ijk}^2}\), where \(\mathcal {V}_{ijk}\) represents the ijk-th element of \(\varvec{\mathcal {V}}\). Similarly, the inner-product between two tensors \(\varvec{\mathcal {X}}\) and \(\varvec{\mathcal {Y}}\) is given by \(\left\langle \varvec{\mathcal {X}},\varvec{\mathcal {Y}}\right\rangle =\sum _{ijk}\mathcal {X}_{ijk}\mathcal {Y}_{ijk}\).

3.2 Kernel Linearization

Let \(G_{\sigma }(\mathbf {u}-\mathbf {\bar{u}})=\exp (-\left\| {\mathbf {u}- \mathbf {\bar{u}}}\right\| _2^2/{2\sigma ^2})\) denote a standard Gaussian RBF kernel centered at \(\mathbf {\bar{u}}\) and having a bandwidth \(\sigma \). Kernel linearization refers to rewriting this \(G_{\sigma }\) as an inner-product of two infinite-dimensional feature maps. To obtain these maps, we use a fast approximation method based on probability product kernels [35]. Specifically, we employ the inner product of \(d'\)-dimensional isotropic Gaussians given \(u,u'\!\!\in \!\mathbb {R}^{d'}\!\). The resulting approximation can be written as:

$$\begin{aligned}&G_{\sigma }\!\left( \mathbf {u}-\mathbf {\bar{u}}\right) =\left( \frac{2}{\pi \sigma ^2}\right) ^{\!\!\frac{d'}{2}}\!\!\!\!\!\!\int \limits _{\varvec{\zeta }\in \mathbb {R}^{d'}}G_{\sigma /\sqrt{2}}\left( \mathbf {u}-\varvec{\zeta }\right) G_{\sigma /\sqrt{2}}(\mathbf {\bar{u}}-\varvec{\zeta })\,\mathrm {d}\varvec{\zeta }. \end{aligned}$$
(1)

Equation (1) is then approximated by replacing the integral with the sum over Z pivots \(\varvec{\zeta }_1,...,\varvec{\zeta }_Z\), thus writing a feature map \(\varvec{\phi }\) as:

$$\begin{aligned}&\varvec{\phi }(\mathbf {u})=\left[ {G}_{\sigma /\sqrt{2}}(\mathbf {u}-\varvec{\zeta }_1),...,{G}_{\sigma /\sqrt{2}}(\mathbf {u}-\varvec{\zeta }_Z)\right] ^T,\end{aligned}$$
(2)
$$\begin{aligned} \text { and }&G_{\sigma }(\mathbf {u}-\mathbf {\bar{u}})\approx \left\langle \sqrt{c}\varvec{\phi }(\mathbf {u}), \sqrt{c}\varvec{\phi }(\mathbf {\bar{u}})\right\rangle , \end{aligned}$$
(3)

where c represents a constant. We refer to (3) as the linearization of the RBF kernel.

4 Proposed Approach

In this section, we first formulate the problem of action recognition from 3D skeleton sequences, which precedes an exposition of our two kernel formulations for describing the actions, followed by their tensor reformulations through kernel linearization.

4.1 Problem Formulation

Suppose we are given a set of 3D human pose skeleton sequences, each pose consisting of J body-keypoints. Further, to simplify our notations, we assume each sequence consists of N skeletons, one per frameFootnote 1. Mathematically, we can define such a pose sequence \(\varPi \) as:

$$\begin{aligned} \varPi = \left\{ \mathbf {x}_{is}\in \mathbb {R}^{3},i\in \mathcal {I}_{J}, s\in \mathcal {I}_{N}\right\} . \end{aligned}$$
(4)

Further, let each such sequence \(\varPi \) be associated with one of K action class labels \(\ell \in \mathcal {I}_{K}\). Our goal is to use the skeleton sequence \(\varPi \) and generate an action descriptor for this sequence that can be used in a classifier for recognizing the action class. In the following, we will present two such action descriptors, namely (i) sequence compatibility kernel and (ii) dynamics compatibility kernel, which are formulated using the ideas of kernel linearization and tensor algebra. We present both these kernel formulations next.

Fig. 1.
figure 1

Figures (a) and (b) show how SCK works – kernel \(G_{\sigma _2}\) compares exhaustively e.g. hand-related joint i for every frame in sequence A with every frame in sequence B. Kernel \(G_{\sigma _3}\) compares exhaustively the frame indexes. Figure (c) shows this burden is avoided by linearization – third-order statistics on feature maps \(\varvec{\phi }(\mathbf {x}_{is})\) and \(\mathbf {z}(s)\) for joint i are captured in tensor \(\varvec{\mathcal {X}}_i\) and whitened by EPN to obtain \(\varvec{\mathcal {V}}_i\) which are concatenated over \(i\!=\!1,...,J\) to represent a sequence.

4.2 Sequence Compatibility Kernel

As alluded to earlier, the main idea of this kernel is to measure the compatibility between two action sequences in terms of the similarity between their skeletons and their temporal order. To this end, we assume each skeleton is centralized with respect to one of the body-joints (say, hip). Suppose we are given two such sequences \(\varPi _A\) and \(\varPi _B\), each with J joints, and N frames. Further, let \(\mathbf {x}_{is}\!\in \!\mathbb {R}^{3}\) and \(\mathbf {y}_{jt}\!\in \!\mathbb {R}^{3}\) correspond to the body-joint coordinates of \(\varPi _A\) and \(\varPi _B\), respectively. We define our sequence compatibility kernel (SCK) between \(\varPi _A\) and \(\varPi _B\) as (See footnote 1):

$$\begin{aligned}&K_S(\varPi _A,\varPi _B) = \frac{1}{\varLambda }\!\!\! \sum \limits _{(i,s)\in \mathcal {J}}\sum \limits _{(j,t)\in \mathcal {J}}\!G_{\sigma _1}(i-j)\Big (\beta _1 G_{\sigma _2}\!\left( \mathbf {x}_{is} - \mathbf {y}_{jt}\right) + \beta _2\, G_{\sigma _3}(\frac{s-t}{N})\Big )^r, \end{aligned}$$
(5)

where \(\varLambda \) is a normalization constant and \(\mathcal {J}=\mathcal {I}_{J}\times \mathcal {I}_{N}\). As is clear, this kernel involves three different compatibility subkernels, namely (i) \(G_{\sigma _1}\), that captures the compatibility between joint-types i and j, (ii) \(G_{\sigma _2}\), capturing the compatibility between joint locations \(\mathbf {x}\) and \(\mathbf {y}\), and (iii) \(G_{\sigma _3}\), measuring the temporal alignment of two poses in the sequences. We also introduce weighting factors \(\beta _1,\beta _2\ge 0\) that adjusts the importance of the body-joint compatibility against the temporal alignment, where \(\beta _1+\beta _2=1\). Figures 1a and b illustrate how this kernel works. It might come as a surprise, why we need the kernel \(G_{\sigma _1}\). Note that our skeletons may be noisy and there is a possibility that some of the keypoints are detected incorrectly (for example, elbows and wrists). Thus, this kernel allows incorporating some degree of uncertainty to the alignment of such joints. To simplify our formulations, in this paper, we will assume that such errors are absent from our skeletons, and thus \(G_{\sigma _1}(i-j)=\delta (i-j)\). Further, the standard deviations \(\sigma _2\) and \(\sigma _3\) control the joint-coordinate selectivity and temporal shift-invariance respectively. That is, for \(\sigma _3\rightarrow 0\), two sequences will have to match perfectly in the temporal sense. For \(\sigma _3\rightarrow \infty \), the algorithm is invariant to any permutations of the frames. As will be clear in the sequel, the parameter r determines the order statistics of the kernel (we use \(r=3\)).

Next, we present linearization of our kernel using the method proposed in Sect. 3.2 and Eq. (3) so that kernel \(G_{\sigma _2}(\mathbf {x}-\mathbf {y})\approx \phi (\mathbf {x})^T\phi (\mathbf {y})\) (see footnoteFootnote 2) while \(G_{\sigma _3}(\frac{s-t}{N})\approx \mathbf {z}(s/N)^T\mathbf {z}(t/N)\). With these approximations and simplification to \(G_{\sigma _1}\!\) we described above, we can rewrite our sequence compatibility kernel as:

$$\begin{aligned} K_S(\varPi _A,\varPi _B)&= \frac{1}{\varLambda }\!\!\sum \limits _{i\in \mathcal {I}_{J}}\sum \limits _{s\in \mathcal {I}_{N}}\!\sum \limits _{t\in \mathcal {I}_{N}} \left( \begin{bmatrix} \sqrt{\beta _1}\,\varvec{\phi }(\mathbf {x}_{is}),\text { (see footnote 2)}\\ \sqrt{\beta _2}\,\mathbf {z}(s/N)\\[3pt] \end{bmatrix}^T\!\!\!\cdot \begin{bmatrix} \sqrt{\beta _1}\varvec{\phi }(\mathbf {y}_{it})\\ \sqrt{\beta _2}\mathbf {z}(t/N)\\[3pt] \end{bmatrix}\right) ^r\end{aligned}$$
(6)
$$\begin{aligned}&=\frac{1}{\varLambda }\!\!\sum \limits _{i\in \mathcal {I}_{J}}\sum \limits _{s\in \mathcal {I}_{N}}\!\sum \limits _{t\in \mathcal {I}_{N}}\left\langle {\uparrow \!\otimes }_r\!\begin{bmatrix} \sqrt{\beta _1}\,\varvec{\phi }(\mathbf {x}_{is})\\ \sqrt{\beta _2}\,\mathbf {z}(s/N)\\[3pt] \end{bmatrix}\!, {\uparrow \!\otimes }_r\!\begin{bmatrix} \sqrt{\beta _1}\varvec{\phi }(\mathbf {y}_{it})\\ \sqrt{\beta _2}\mathbf {z}(t/N)\\[3pt] \end{bmatrix}\right\rangle \end{aligned}$$
(7)
$$\begin{aligned}&=\sum \limits _{i\in \mathcal {I}_{J}}\left\langle \!\frac{1}{\sqrt{\varLambda }}\!\!\sum \limits _{s\in \mathcal {I}_{N}}\!\!{\uparrow \!\otimes }_r\!\begin{bmatrix} \sqrt{\beta _1}\,\varvec{\phi }(\mathbf {x}_{is})\\ \sqrt{\beta _2}\mathbf {z}(s/N)\\[3pt] \end{bmatrix}\!, \frac{1}{\sqrt{\varLambda }}\!\!\sum \limits _{t\in \mathcal {I}_{N}}\!\!{\uparrow \!\otimes }_r\!\begin{bmatrix} \sqrt{\beta _1}\varvec{\phi }(\mathbf {y}_{it})\\ \sqrt{\beta _2}\mathbf {z}(t/N)\\[3pt] \end{bmatrix}\right\rangle . \end{aligned}$$
(8)

As is clear, (8) expresses \(K_S(\varPi _A,\varPi _B)\) as a sum of inner-products on third-order tensors (\(r=3\)). This is illustrated by Fig. 1c. While, using the dot-product as the inner-product is a possibility, there are much richer alternatives for tensors of order \(r>=2\) that can exploit their structure or manipulate higher-order statistics inherent in them, thus leading to better representations. An example of such a commonly encountered property is the so-called burstiness [36], which is the property that a given feature appears more often in a sequence than a statistically independent model would predict. A robust sequence representation should be invariant to the length of actions e.g., a prolonged hand waving represents the same action as a short hand wave. The same is true for short versus repeated head nodding. Eigenvalue Power Normalization (EPN) [32] is known to suppress burstiness. It acts on higher-order statistics illustrated in Fig. 1c. Incorporating EPN, we generalize (8) as:

$$\begin{aligned}&\!K_S^{*}({\varPi _A},{\varPi _B})=\sum \limits _{i\in \mathcal {I}_{J}}\left\langle \!\varvec{\mathcal {G}}\left( \frac{1}{\sqrt{\varLambda }}\!\!\sum \limits _{s\in \mathcal {I}_{N}}\!\!{\uparrow \!\otimes }_r\!\!\begin{bmatrix} \!\sqrt{\beta _1}\varvec{\phi }(\mathbf {x}_{is})\\ \sqrt{\beta _2}\mathbf {z}(s/N)\\[3pt] \end{bmatrix}\right) ,\varvec{\mathcal {G}}\left( \frac{1}{\sqrt{\varLambda }}\!\!\sum \limits _{t\in \mathcal {I}_{N}}\!\!{\uparrow \!\otimes }_r\!\!\begin{bmatrix} \sqrt{\beta _1}\varvec{\phi }(\mathbf {y}_{it})\\ \sqrt{\beta _2}\mathbf {z}(t/N)\\[3pt] \end{bmatrix}\right) \right\rangle , \end{aligned}$$
(9)

where the operator \(\varvec{\mathcal {G}}\) performs EPN by applying power normalization to the spectrum of the third-order tensor (by taking the higher-order SVD). Note that in general \(K_S^{*}({\varPi _A},{\varPi _B})\!\not \approx \!K_S({\varPi _A},{\varPi _B})\) as \(\varvec{\mathcal {G}}\) is intended to manipulate the spectrum of \(\varvec{\mathcal {X}}\). The final representation, for instance for a sequence \({\varPi _A}\), takes the following form:

$$\begin{aligned}&\varvec{\mathcal {V}}_i=\varvec{\mathcal {G}}\left( \varvec{\mathcal {X}}_i\right) ,\text { where } \varvec{\mathcal {X}}_i=\frac{1}{\sqrt{\varLambda }}\!\!\!\sum \limits _{s\in \mathcal {I}_{N}}\!\!\!{\uparrow \!\otimes }_r \begin{bmatrix} \!\sqrt{\beta _1}\,\varvec{\phi }(\mathbf {x}_{is})\\ \sqrt{\beta _2}\mathbf {z}(s/N)\\[3pt] \end{bmatrix}. \end{aligned}$$
(10)

We can further replace the summation over the body-joint indexes in (9) by concatenating \(\varvec{\mathcal {V}}_i\) in (10) along the fourth tensor mode, thus defining \(\varvec{\mathcal {V}}= \big [\varvec{\mathcal {V}}_i\big ]_{i\in \mathcal {I}_{J}}^{\oplus _4}\). Suppose \(\varvec{\mathcal {V}}_A\) and \(\varvec{\mathcal {V}}_B\) are the corresponding fourth order tensors for \(\varPi _A\) and \(\varPi _B\) respectively. Then, we obtain:

$$\begin{aligned}&K_S^{*}({\varPi _A},{\varPi _B})=\left\langle \varvec{\mathcal {V}}_A, \varvec{\mathcal {V}}_B\right\rangle . \end{aligned}$$
(11)

Note that the tensors \(\varvec{\mathcal {X}}\) have the following properties: (i) super-symmetry \(\varvec{\mathcal {X}}_{i,j,k}=\varvec{\mathcal {X}}_{\pi (i,j,k)}\) for indexes ijk and their permutation given by \(\pi ,\;\forall \pi \), and (ii) positive semi-definiteness of every slice, that is, \(\varvec{\mathcal {X}}_{:,:,s}\!\in \!\mathcal {S}_{+}^{d},\) for \(s\!\in \!\mathcal {I}_{d}\). Therefore, we need to use only the upper-simplex of the tensor which consists of \(\left( {\begin{array}{c}d+r-1\\ r\end{array}}\right) \) coefficients (which is the total size of our final representation) rather than \(d^r\!\), where d is the side-dimension of \(\varvec{\mathcal {X}}\) i.e., \(d=3Z_2\!+\!Z_3\) (see footnote 2), and \(Z_2\) and \(Z_3\) are the numbers of pivots used in the approximation of \(G_{\sigma _2}\) (see footnote 2) and \(G_{\sigma _3}\) respectively. As we want to preserve the above listed properties in tensors \(\varvec{\mathcal {V}}\), we employ slice-wise EPN which is induced by the Power-Euclidean distance and involves rising matrices to a power \(\gamma \). Finally, we re-stack these slices along the third mode as:

$$\begin{aligned}&\varvec{\mathcal {G}}\left( \varvec{\mathcal {X}}\right) =[\varvec{\mathcal {X}}_{:,:,s}^{\gamma }]_{s\in \mathcal {I}_{d}}^{\oplus _3}, \text { for } 0\!<\gamma \!\le \!1. \end{aligned}$$
(12)

This \( \varvec{\mathcal {G}}\left( \varvec{\mathcal {X}}\right) \) forms our tensor representation for the action sequence.

Fig. 2.
figure 2

Figure (a) shows that kernel \(G_{\sigma '_2}\) in DCK captures spatio-temporal dynamics by measuring displacement vectors from any given body-joint to remaining joints spatially- and temporally-wise (i.e. see dashed lines). Figure (b) shows that comparisons performed by \(G_{\sigma '_2}\) for any selected two joints are performed all-against-all temporally-wise which is computationally expensive. Figure (c) shows the encoding steps in the proposed linearization which overcome this burden.

4.3 Dynamics Compatibility Kernel

The SCK kernel that we described above captures the inter-sequence alignment, while the intra-sequence spatio-temporal dynamics is lost. In order to capture these temporal dynamics, we propose a novel dynamics compatibility kernel (DCK). To this end, we use the absolute coordinates of the joints in our kernel. Using the notations from the earlier section, for two action sequences \(\varPi _A\) and \(\varPi _B\), we define this kernel as:

$$\begin{aligned} K_D({\varPi _A},{\varPi _B})=&\ \frac{1}{\varLambda }\!\!\!\!\!\sum \limits _{\begin{array}{c} (i,s)\in \mathcal {J}\!,\\ (i',s')\in \mathcal {J}\!,\\ i'\!\!\ne \!i\!,s'\!\!\ne \!s \end{array}}\sum \limits _{\begin{array}{c} (\!j,t)\in \mathcal {J}\!,\\ (\!j'\!\!,t'\!)\in \mathcal {J},\\ j'\!\!\ne \!j\!,t'\!\!\ne \!t \end{array}}\!\!\!\!G'_{\sigma '_1}(i-j\!, i'-j'\!)\,G_{\sigma '_2}\left( \left( \mathbf {x}_{is}-\mathbf {x}_{i's'}\!\right) -\left( \mathbf {y}_{jt}-\mathbf {y}_{j't'}\right) \right) \nonumber \\[-16pt]&\cdot G'_{\sigma '_3}(\frac{s-t}{N},\!\frac{s'-t'}{N})\,G'_{\sigma '_4}(s-s'\!,t-t'\!), \end{aligned}$$
(13)

where \(G'_{\sigma }(\varvec{\alpha },\varvec{\beta })=G_{\sigma }(\varvec{\alpha })G_{\sigma }(\varvec{\beta })\). In comparison to the SCK kernel in (5), the DCK kernel uses the intra-sequence joint differences, thus capturing the dynamics. This dynamics is then compared to those in the other sequences. Figures 2a–c, depict schematically how this kernel captures co-occurrences. As in SCK, the first kernel \(G'_{\sigma '_1}\) is used to capture sensor uncertainty in body-keypoint detection, and is assumed to be a delta function in this paper. The second kernel \(G_{\sigma '_2}\) models the spatio-temporal co-occurrences of the body-joints. Temporal alignment kernels expressed as \(G_{\sigma '_3}\) encode the temporal start and end-points from \((s,s'\!)\) and \((t,t'\!)\). Finally, \(G_{\sigma '_4}\) limits contributions of dynamics between temporal points if they are distant from each other, i.e. if \(s'\!\gg \!s\) or \(t'\!\gg \!t\) and \(\sigma '_4\) is small. Furthermore, similar to SCK, the standard deviations \(\sigma '_2\) and \(\sigma '_3\) control the selectivity over spatio-temporal dynamics of body-joints and their temporal shift-invariance for the start and end points, respectively. As discussed for SCK, the practical extensions described by the footnotes 1 and 2 apply to DCK as well.

As in the previous section, we employ linearization to this kernel. Following the derivations described above, it can be shown that the linearized kernel has the following form (see [37] or supplementary material for details):

$$\begin{aligned} \!K_D(\varPi _A,\varPi _B) =&\sum \limits _{\begin{array}{c} i\in \mathcal {I}_{J}\!,\\ i'\!\in \mathcal {I}_{J}\!:\\ i'\!\ne i \end{array}}\!\! \left\langle \!\frac{1}{\sqrt{\varLambda }}\!\!\sum \limits _{\begin{array}{c} s\in \mathcal {I}_{N}\!,\\ s'\!\!\in \mathcal {I}_{N}\!:\\ s'\!\!\ne \!s \end{array}}\!\! G_{\sigma '_4}(s-s'\!)\left( \varvec{\phi }(\mathbf {x}_{is}-\mathbf {x}_{i's'}) \!\cdot \!\mathbf {z}\big (\frac{s}{N}\big )^T\!\right) \!\uparrow \!\otimes \mathbf {z}\big (\frac{s'\!}{N}\big )\!\right. ,\\[-16pt]&\quad \left. \!\frac{1}{\sqrt{\varLambda }}\!\!\sum \limits _{\begin{array}{c} t\in \mathcal {I}_{N}\!,\\ t'\!\!\in \mathcal {I}_{N}\!:\\ t'\!\!\ne \!t \end{array}}\!\! G_{\sigma '_4}(t-t'\!)\Big ( \varvec{\phi }(\mathbf {y}_{it}-\mathbf {y}_{i't'}) \!\cdot \!\mathbf {z}\big (\frac{t}{N}\big )^T\!\Big )\!\uparrow \!\otimes \mathbf {z}\big (\frac{t'\!}{N}\big )\!\right\rangle \!.\nonumber \end{aligned}$$
(14)

Equation (14) expresses \(K_D({\varPi _A},{\varPi _B})\) as a sum over inner-products on third-order non-symmetric tensors of third-order (c.f. Sect. 4.2 where the proposed kernel results in an inner-product between super-symmetric tensors). However, we can decompose each of these tensors with a variant of EPN which involves Higher Order Singular Value Decomposition (HOSVD) into factors stored in the so-called core tensor and equalize the contributions of these factors. Intuitively, this would prevent bursts in the statistically captured spatio-temporal co-occurrence dynamics of actions. For example, consider that a long hand wave versus a short one yield different temporal statistics, that is, the prolonged action results in bursts. However, the representation for action recognition should be invariant to such cases. As in the previous section, we introduce a non-linear operator \(\varvec{\mathcal {G}}\) into Eq. (14) which will handle this. Our final representation, for example, for sequence \({\varPi _A}\) can be expressed as:

$$\begin{aligned}&\!\!\!\!\!\varvec{\mathcal {V}}_{ii'\!}=\varvec{\mathcal {G}}\left( \varvec{\mathcal {X}}_{ii'\!}\right) \!,\!\text { and }\varvec{\mathcal {X}}_{ii'\!}=\frac{1}{\sqrt{\varLambda }}\!\!\sum \limits _{\begin{array}{c} s\in \mathcal {I}_{N}\!,\\ s'\!\!\in \mathcal {I}_{N}\!:\\ s'\!\!\ne \!s \end{array}}\!\! G_{\sigma '_4}(s-s'\!)\left( \varvec{\phi }(\mathbf {x}_{is}-\mathbf {x}_{i's'}) \!\cdot \!\mathbf {z}\big (\frac{s}{N}\big )^T\!\right) \!\uparrow \!\otimes \mathbf {z}\big (\frac{s'\!}{N}\big ),\!\! \end{aligned}$$
(15)

where the summation over the pairs of body-joint indexes in (14) becomes equivalent to the concatenation of \(\varvec{\mathcal {V}}_{ii'}\!\) from (15) along the fourth mode such that we obtain tensor representations \(\big [\varvec{\mathcal {V}}_{ii'\!}\big ]_{i>i'\!:\,i,i'\in \mathcal {I}_{J}}^{\oplus _4}\!\) for sequence \({\varPi _A}\) and \(\big [\varvec{\mathcal {\bar{V}}}_{ii'\!}\big ]_{i>i'\!:\,i,i'\in \mathcal {I}_{J}}^{\oplus _4}\!\) for sequence \({\varPi _B}\). The dot-product can be now applied between these representations for comparing them. For the operator \(\varvec{\mathcal {G}}\), we choose HOSVD-based tensor whitening as proposed in [32]. However, they work with the super-symmetric tensors, such as the one we proposed in Sect. 4.2. We work with a general non-symmetric case in (15) and use the following operator \(\varvec{\mathcal {G}}\):

$$\begin{aligned}&{\left( \varvec{\mathcal {E}}; \varvec{A}_1,...,\varvec{A}_r\right) }=HOSVD(\varvec{\mathcal {X}})\end{aligned}$$
(16)
$$\begin{aligned}&\varvec{\mathcal {\hat{E}}}=Sgn\left( \varvec{\mathcal {E}}\right) \!\,\left| \!\,\varvec{\mathcal {E}}\right| ^{\gamma }\end{aligned}$$
(17)
$$\begin{aligned}&\varvec{\mathcal {\hat{V}}}=((\varvec{\mathcal {\hat{E}}}\otimes _{1}\!\varvec{A}_1)\,...)\otimes _{r}\!\varvec{A}_r\end{aligned}$$
(18)
$$\begin{aligned}&\varvec{\mathcal {G}}(\varvec{\mathcal {X}})=Sgn(\varvec{\mathcal {\hat{V}}})\,|\!\varvec{\mathcal {\hat{V}}}|^{\gamma ^{*}} \end{aligned}$$
(19)

In the above equations, we distinguish the core tensor \(\varvec{\mathcal {E}}\) and its power normalized variants \(\varvec{\mathcal {\hat{E}}}\) with factors that are being evened out by rising to the power \(0<\gamma \!\le \!1\), eigenvalue matrices \(\varvec{A}_1,...,\varvec{A}_r\) and operation \(\otimes _r\) which represents a so-called tensor-product in mode r. We refer the reader to paper [32] for the detailed description of the above steps.

5 Computational Complexity

Non-linearized SCK with kernel SVM has complexity \(\mathcal {O}(JN^2T^\rho )\) given J body joints, N frames per sequence, T sequences, and \(2<\rho <3\) which concerns complexity of kernel SVM. Linearized SCK with linear SVM takes \(\mathcal {O}(JNTZ_*^r)\) for a total of \(Z_*\) pivots and tensor order \(r=3\). Note that \(N^2T^\rho \!\gg \!NTZ_*^r\). For \(N=50\) and \(Z_*=20\), this is 3.5\(\times \) (or 32\(\times \)) faster than the exact kernel for \(T=557\) (or \(T=5000\)) used in our experiments. Non-linearized DCK with kernel SVM has complexity \(\mathcal {O}(J^2N^4T^\rho )\) while linearized DCK takes \(\mathcal {O}(J^2N^2TZ^3)\) for Z pivots per kernel, e.g. \(Z=Z_2=Z_3\) given \(G_{\sigma '_2}\) and \(G_{\sigma '_3}\). As \(N^4T^\rho \!\gg \!N^2TZ^3\), the linearization is \(~11000\!\times \) faster than the exact kernel, for say \(Z=5\). Note that EPN incurs negligible cost (see [37] for details).

6 Experiments

In this section, we present experiments using our models on three benchmark 3D skeleton based action recognition datasets, namely (i) the UTKinect-Action [9], (ii) Florence3D-Action [10], and (iii) MSR-Action3D [11]. We also present experiments evaluating the influence of the choice of various hyper-parameters, such as the number of pivots Z used for linearizing the body-joint and temporal kernels, the impact of Eigenvalue Power Normalization, and factor equalization.

6.1 Datasets

UTKinect-Action [9] dataset consists of 10 actions performed twice by 10 different subjects, and has 199 action sequences. The dataset provides 3D coordinate annotations of 20 body-joints for every frame. The dataset was captured with a stationary Kinect sensor and contains significant viewpoint and intra-class variations.

Florence3D-Action [10] dataset consists of 9 actions performed two to three times by 10 different subjects. It comprises 215 action sequences. 3D coordinate annotations of 15 body-joints are provided for every frame. This dataset was also captured with a Kinect sensor and contains significant intra-class variations i.e., the same action may be articulated with the left or right hand. Moreover, some actions such as drinking, performing a phone call, etc., can be visually ambiguous.

MSR-Action3D [11] dataset is comprised from 20 actions performed two to three times by 10 different subjects. Overall, it consists of 557 action sequences. 3D coordinates of 20 body-joints are provided. This dataset was captured using a Kinect-like depth sensor. It exhibits strong inter-class similarity.

In all experiments we follow the standard protocols for these datasets. We use the cross-subject test setting, in which half of the subjects are used for training and the remaining half for testing. Similarly, we divide the training set into two halves for purpose of training-validation. Additionally, we use two protocols for MSR-Action3D according to approaches [11, 17], where the latter protocol uses three subsets grouping related actions together.

6.2 Experimental Setup

For the sequence compatibility kernel, we first normalized all body-keypoints with respect to the hip joints across frames, as indicated in Sect. 4.2. Moreover, lengths of all body-parts are normalized with respect to a reference skeleton. This setup follows the pre-processing suggested in [4]. For our dynamics compatibility kernel, we use unnormalized body-joints and assume that the displacements of body-joint coordinates across frames capture their temporal evolution implicitly.

Sequence Compatibility Kernel. In this section, we first present experiments evaluating the influence of parameters \(\sigma _2\) and \(\sigma _3\) of kernels \(G_{\sigma _2}\) and \(G_{\sigma _3}\) which control the degree of selectivity for the 3D body-joints and temporal shift invariance, respectively. See Sect. 4.2 for a full definition of these parameters.

Fig. 3.
figure 3

Figure (a) illustrates the classification accuracy on Florence3d-Action for the sequence compatibility kernel when varying radii \(\sigma _2\) (body-joints subkernel) and \(\sigma _3\) (temporal subkernel). Figure (b) evaluates behavior of SCK w.r.t. the number of pivots \(Z_2\) and \(Z_3\). Figure (c) demonstrates effectiveness of our slice-wise Eigenvalue Power Normalization in tackling burstiness by varying parameter \(\gamma \).

Furthermore, recall that the kernels \(G_{\sigma _2}\) and \(G_{\sigma _3}\) are approximated via linearizations according to Eqs. (1) and (3). The quality of these approximations and the size of our final tensor representations depend on the number of pivots \(Z_2\!\) and \(Z_3\) chosen. In our experiments, the pivots \(\varvec{\zeta }\) are spaced uniformly within interval \([-1;1]\) and [0; 1] for kernels \(G_{\sigma _2}\) and \(G_{\sigma _3}\) respectively.

Figures 3a and b present the results of this experiment on the Florence3D-Action dataset – these are the results presented on the test set as we have also observed exactly the same trends on the validation set.

Figure 3a shows that the body-joint compatibility subkernel \(G_{\sigma _2}\) requires a choice of \(\sigma _2\) which is not too strict as the specific body-joints (e.g., elbow) would be expected to repeat across sequences in the exactly same position. On the one hand, very small \(\sigma _2\) leads to poor generalization. On the other hand, very large \(\sigma _2\) allows big displacements of the corresponding body-joints between sequences which results in poor discriminative power of this kernel. Furthermore, Fig. 3a demonstrates that the range of \(\sigma _3\) for the temporal subkernel for which we obtain very good performance is large, however, as \(\sigma _3\) becomes very small or very large, extreme temporal selectivity or full temporal invariance, respectively, result in a loss of performance. For instance, \(\sigma _3=4\) results in \(91\,\%\) accuracy only.

In Fig. 3b, we show the performance of our SCK kernel with respect to the number of pivots used for linearization. For the body-joint compatibility subkernel \(G_{\sigma _2}\), we see that \(Z_2=5\) pivots are sufficient to obtain good performance of \(92.98\,\%\) accuracy. We have observed that this is consistent with the results on the validation set. Using more pivots, say \(Z_2=20\), deteriorates the results slightly, suggesting overfitting. We make similar observations for the temporal subkernel \(G_{\sigma _3}\) which demonstrates good performance for as few as \(Z_3=2\) pivots. Such a small number of pivots suggests that linearizing 1D variables and generating higher-order co-occurrences, as described in Sect. 4.2, is a simple, robust, and effective linearization strategy.

Further, Fig. 3c demonstrates the effectiveness of our slice-wise Eigenvalue Power Normalization (EPN) described in Eq. (12). When \(\gamma =1\), the EPN functionality is absent. This results in a drop of performance from \(92.98\,\%\) to \(88.7\,\%\) accuracy. This demonstrates that statistically unpredictable bursts of actions described by the body-joints, such as long versus short hand waving, are indeed undesirable. It is clear that in such cases, EPN is very effective, as in practice it considers correlated bursts, e.g. co-occurring hand wave and associated with it elbow and neck motion. For more details behind this concept, see [32]. For our further experiments, we choose \(\sigma _2=0.6\), \(\sigma _3=0.5\), \(Z_2=5\), \(Z_3=6\), and \(\gamma =0.36\), as dictated by cross-validation.

Fig. 4.
figure 4

Figure (a) enumerates the body-joints in the Florence3D-Action dataset. The table below lists subsets A-I of the body-joints used to build representations evaluated in Fig. (b), which demonstrates the performance of our dynamics compatibility kernel w.r.t. these subsets. Figure (c) demonstrates effectiveness of equalizing the factors in non-symmetric tensor representation by HOSVD Eigenvalue Power Normalization by varying \(\gamma \).

Dynamics Compatibility Kernel. In this section, we evaluate the influence of choosing parameters for the DCK kernel. Our experiments are based on the Florence3D-Action dataset. We present the scores on the test set as the results on the validation set match these closely. As this kernel considers all spatio-temporal co-occurrences of body-joints, we first evaluate the impact of the joint subsets we select for generating this representation as not all body-joints need to be used for describing actions.

Figure 4a enumerates the body-joints that describe every 3D human skeleton on the Florence3D-Action dataset whilst the table underneath lists the proposed body-joint subsets A-I which we use for computations of DCK. In Fig. 4b, we plot the performance of our DCK kernel for each subset. The plot shows that using two body-joints associated with the hands from Configuration-A in the DCK kernel construction, we attain \(88.32\,\%\) accuracy which highlights the informativeness of temporal dynamics. For Configuration-D, which includes six body-joints such as the knees, elbows and hands, our performance reaches \(93.03\,\%\). This suggests that some not-selected for this configuration body-joints may be noisy and therefore detrimental to classification.

As configuration Configuration-E includes eight body-joints such as the feet, knees, elbows and hands, we choose it for our further experiments as it represents a reasonable trade-off between performance and size of representations. This configuration scores \(92.77\,\%\) accuracy. We see that if we utilize all the body-joints according to Configuration-I, performance of \(91.65\,\%\) accuracy is still somewhat lower compared to \(93.03\,\%\) accuracy for Configuration-D highlighting again the issue of noisy body-joints.

In Fig. 4c, we show the performance of our DCK kernel when HOSVD factors underlying our non-symmetric tensors are equalized by varying the EPN parameter \(\gamma \). For \(\gamma =1\), HOSVD EPN is absent which leads to \(90.49\,\%\) accuracy only. For the optimal value of \(\gamma =0.85\), the accuracy rises to \(92.77\,\%\). This again demonstrates the presence of the burstiness effect in temporal representations.

Comparison to the State of the Art. In this section, we compare the performance of our representations against the best performing methods on the three datasets. Along with comparing SCK and DCK, we will also explore the complementarity of these representations in capturing the action dynamics by combining them.

On the Florence3D-Action dataset, we present our best results in Table 1a. Note that the model parameters for the evaluation was selected by cross-validation. Linearizing a sequence compatibility kernel using these parameters resulted in a tensor representation of size 26, 565 dimensionsFootnote 3, and producing an accuracy of \(92.98\,\%\) accuracy. As for the dynamics compatibility kernel (DCK), our model selected Configuration-E (described in Fig. 4a) resulting in a representation of dimensionality 16, 920 and achieved a performance of \(92\,\%\). However, somewhat better results were attained by Configuration-D, namely \(92.27\,\%\) accuracy for size of 9, 450. Combining both SCK representation with DCK in Configuration-E results in an accuracy of \(95.23\,\%\). This constitutes a \(4.5\,\%\) improvement over the state of the art on this dataset as listed in Table 1a and demonstrates the complementary nature of SCK and DCK. To the best of our knowledge, this is the highest performance attained on this dataset.

Table 1. Evaluations of SCK and DCK and comparisons to the state-of-the-art results on (a) the Florence3D-Action and (b) UTKinect-Action dataset.

Action recognition results on the UTKinect-Action dataset are presented in Table 1b. For our experiments on this dataset, we kept all the parameters the same as those we used on the Florence3D dataset (described above). On this dataset, both SCK and DCK representations yield \(96.08\,\%\) and \(97.5\,\%\) accuracy, respectively. Combining SCK and DCK yields \(98.2\,\%\) accuracy outperforming marginally a more complex approach described in [4] which uses Lie group algebra on SE(3) matrix descriptors and requires practical extensions such as discrete time warping and Fourier temporal pyramids for attaining this performance, which we avoid completely.

Table 2. Results on SCK and DCK and comparisons to the state of the art on MSR-Action3D.

In Table 2, we present our results on the MSR-Action3D dataset. Again, we kept all the model parameters the same as those used on the Florence3D dataset. Conforming to prior literature, we use two evaluation protocols on this dataset, namely (i) the protocol described in actionlets [17], for which the authors utilize the entire dataset with its 20 classes during the training and evaluation, and (ii) approach of [11], for which the authors divide the data into three subsets and report the average in classification accuracy over these subsets. The SCK representation results in the state-of-the-art accuracy of \(90.72\,\%\) and \(93.52\,\%\) for the two evaluation protocols, respectively. Combining SCK with DCK outperforms other approaches listed in the table and yields \(91.45\,\%\) and \(93.96\,\%\) accuracy for the two protocols, respectively.

Processing Time. For SCK and DCK, processing a single sequence with unoptimized MATLAB code on a single core i5 takes 0.2 s and 1.2 s, respectively. Training on full MSR Action3D with the SCK and DCK takes about 13 min. In comparison, extracting SE(3) features [4] takes 5.3 s per sequence, processing on the full MSR Action3D dataset takes \(\sim \) 50 min. and with post-processing (time warping, Fourier pyramids, etc.) it goes to about 72 min. Therefore, SCK and DCK is about \(5.4\!\times \) faster.

7 Conclusions

We have presented two kernel-based tensor representations for action recognition from 3D skeletons, namely the sequence compatibility kernel (SCK) and dynamics compatibility kernel (DCK). SCK captures the higher-order correlations between 3D coordinates of the body-joints and their temporal variations, and factors out the need for expensive operations such as Fourier temporal pyramid matching or dynamic time warping, commonly used for generating sequence-level action representations. Further, our DCK kernel captures the action dynamics by modeling the spatio-temporal co-occurrences of the body-joints. This tensor representation also factors out the temporal variable, whose length depends on each sequence. Our experiments substantiate the effectiveness of our representations, demonstrating state-of-the-art performance on three challenging action recognition datasets.