Keywords

1 Introduction

Unsupervised basis learning is a class of methods [8] which aims to discover the underlying low-dimensional structures of the data. They may serve different purposes depending on the specific tasks. For example, methods such as [23, 32, 36] are usually used for dimension reductions. Methods such as sparse coding [17, 21], non-negative matrix factorization [14] and the clustering methods [5, 37] are usually used as data modeling tools.

Fig. 1.
figure 1

Illustrations of D-hulls and P-hulls. (a) The convex combinations of training data form a convex hull called D-hull. (b) The red dots denote the archetypes. Each archetype is chosen from convex combinations of training data. So it is within the D-hull. Archetypal analysis represents data by convex combinations of archetypes which forms another hull called P-hull (c) To minimize the representation error, the P-hull should be large enough to represent all the data. (Color figure online)

Archetypal Analysis (AA) [4, 6] is an unsupervised learning method which represents data by convex combinations of the archetypes. One distinctive property of AA is that the archetypes are convex combinations of the training data which binds the data and archetypes. There is a geometric interpretation for the method. See Fig. 1. First, convex combinations of the archetypes form a principal convex hull which we call P-hull. To minimize the representation error on the dataset, the P-hull should be large enough to cover as many training data as possible. Second, since each archetype is required to be chosen from the convex hull (D-hull) formed by convex combinations of training data, the P-hull should be within the D-hull. So archetypal analysis seeks for the largest P-hull whose vertices are within the D-hull. In general cases, the learned archetypes approximately correspond to the extremal points in the dataset.

In spite of the nice interpretability, the method has not received sufficient attention because it is slow especially when the dataset is large. To scale to large training datasets, sparse coding uses online dictionary learning [20] to iteratively update the bases, where each iteration is based on a mini-batch of data. However, this learning scheme cannot be applied to AA because of the explicit binding between the data and the archetypes. More specifically, updating the archetypes needs to recompute the coefficients with respect to all the training data which makes mini-batch learning meaningless.

In this work, we propose a fast variant of archetypal analysis called Decoupled Archetypal Analysis (DAA). The core idea is to decouple the binding between the data and the archetypes, thus making mini-batch based dictionary learning possible. The method first projects the data onto the unit sphere without severely distorting the data structures. Then it learns a set of archetypes on the sphere. Instead of requiring the archetypes to be within the D-hull as in the original AA method, we require them to be within the unit sphere which is a relaxation of the original requirement.

Fig. 2.
figure 2

Comparisons of (a) archetypal analysis and (b) our method. The gray and red dots represent the training data and the learned archetypes, respectively. Our method first embeds the two-dimensional data to a 3D unit sphere and learns a convex hull whose surfaces are close to the data. (Color figure online)

As shown in Fig. 2(b), the method learns a convex hull (triangle in the figure) within the sphere and represents data using their surfaces. By minimizing the representation error, DAA drives the surfaces to be close to the data. Thus the convex hull vertices are pushed towards the extremal points. The union of the surfaces resembles the P-hull in AA.

The decoupling between data and archetypes enables us to use an online algorithm (similar to [20]) to optimize the problem which scales gracefully to large datasets. For example, we learn 400 archetypes from 300K human pose data using only several minutes. We present extensive experiments on realistic data sources including human faces and poses. In particular, we obtain decent performance on 3D human pose estimation and pose-based action recognition which are comparable to the state-of-the-arts.

2 Related Work

We briefly discuss the related work on unsupervised learning including sparse coding, non-negative matrix factorization and manifold learning.

2.1 Sparse Coding and Non-negative Matrix Factorization

It is natural to compare our approach to sparse coding [9, 24, 25, 33] which activates only a small number of bases when representing a datum. Sparse coding enforces that the \(l_1\)-norm of the coefficients is smaller than a threshold \(\lambda \). In contrast, our method requires that the \(l_1\)-norm of the coefficients equals one. In addition, the representation is also sparse which we will discuss in more details later. Another related work is non-negative matrix factorization, which assumes the data are non-negative and learns two non-negative and rank-reduced factor matrices such that their product approximates the input data. Some variants of sparse coding such as [13] incorporate these two methods and require non-negative coefficients in addition to sparsity in order to learn additive object parts features. Our approach differs from the above approaches in that we require the data and bases to be unit normalized and enforce joint non-negative and sum-to-one constraints.

2.2 Manifold Learning

Principal Component Analysis (PCA) is a common method for representing high-dimensional data in terms of bases. It is well suited for data which lie in a linear space or equivalently on a low-dimensional linear manifold. But it is inappropriate for data which lie in non-linear spaces. The limitations of PCA motivate researchers to develop approaches for detecting low-dimensional manifold structures in high-dimensional data [1, 7, 12, 27, 31, 39]. Typically these methods use projection methods that preserve local properties of the data (e.g., geodesic distance). These methods are global, in that a single low dimensional structure is used in the representation of the data and they typically assume that the manifold is connected. Although our approach also represents data by bases (archetypes), it does not explicitly learn the low-dimensional data manifolds as the above approaches. Instead it learns a set of representative bases which correspond to the extremal data to best represent the dataset.

3 Revisit Archetypal Analysis

We consider a dataset \( \mathbf {X} = \{ \mathbf {x}_1, \cdots , \mathbf {x}_n \} \) having n training data where \( \mathbf {x}_i \in \mathbb {R}^m \) is a datum. Archetypal analysis represents a datum \( \mathbf {x} \) by convex combinations of archetypes \( \mathbf {Z}=\{ \mathbf {z}_1, \cdots , \mathbf {z}_k \} \): \( \mathbf {x} = \mathbf {Z} \varvec{\alpha } \) where \( \varvec{\alpha } \) resides in the simplex \( \mathrm {\Delta }_k \):

$$\begin{aligned} \mathrm {\Delta }_k \triangleq \left\{ \varvec{\alpha } \in \mathbb {R}^k \text {~s.t.~} \varvec{\alpha } \succeq 0 \text{ and } | \varvec{\alpha } |_1=1 \right\} . \end{aligned}$$
(1)

Meanwhile each archetype \( \mathbf {z} \) is required to be chosen from the convex combinations of the training data: \( \mathbf {z} = \mathbf {X} \varvec{\beta } \) where \( \varvec{\beta } \in \mathrm {\Delta }_n \). The archetypes are learned by minimizing the representation error on the whole training dataset:

$$\begin{aligned} \begin{aligned} \underset{\mathbf {A},\mathbf {B}}{\min }&{ \left\| \mathbf {X}-\mathbf {XBA} \right\| _{\mathrm {F}}^2 }, \quad \mathbf {A}=[\varvec{\alpha }_1, \cdots ,\varvec{\alpha }_n], \quad \mathbf {B}=[\varvec{\beta }_1, \cdots , \varvec{\beta }_k] \\&\varvec{\alpha }_i \in \mathrm {\Delta }_k, \quad \text {for} \quad i=1 \dots n, \quad \varvec{\beta }_j \in \mathrm {\Delta }_n, \quad \text {for} \quad j=1 \dots k, \end{aligned} \end{aligned}$$
(2)

where \( \mathbf {Z} = \mathbf {X} \mathbf {B} \) denotes the learned k archetypes.

It would help to give a geometric interpretation for AA which inspires our approach. There are two convex hulls which are constructed from the dataset (D-hull) and the archetypes (P-hull), respectively. See Fig. 1 for illustrations. Loosely speaking, the P-hull is within the D-hull and is optimized to have the largest overlap with the D-hull. Ideally, the P-hull is the same as the D-hull.

4 Decoupled Archetypal Analysis

We learn archetypes without requiring them to be chosen from convex combinations of data. However, if there is no constraint on the archetypes, the problem becomes ill-posed and has infinite numbers of solutions. We solve the problem by normalizing the data to lie on the unit sphere and the archetypes are within the unit sphere which can be regarded as a relaxation to the constraints in AA.

4.1 Project Data onto the Unit Sphere

We normalize each datum \( \mathbf {x}_i \) such that it lies on the unit sphere: \( \Vert \mathbf {x}_i \Vert _2 = 1, \ \forall i \). This is usually accomplished by directly normalizing the data: \( \mathbf {x}_i \mapsto \frac{\mathbf {x}_i}{\Vert \mathbf {x}_i\Vert _2} \), or by centering and then normalizing: \( \mathbf {x}_i \mapsto \frac{\mathbf {x}_i-\bar{\mathbf {x}}}{\Vert \mathbf {x}_i-\bar{\mathbf {x}}\Vert _2} \) where \( \bar{\mathbf {x}} \) is the data mean \( \frac{1}{n} \sum _{i=1}^{n}{\mathbf {x}_i} \). For many signals, little information is lost in normalization. Indeed the normalization is a common first step in many applications. We directly normalize the data in the following experiments in this paper unless stated elsewhere.

Fig. 3.
figure 3

(a) The torus is flattened into a band when normalized. (b) Stereographic projection from the equatorial plane into the sphere maps P to Q. (c) A triangle in the plane and the corresponding curved triangle on the sphere.

However, some structures may collapse under direct normalization. Consider, for example, the usual two-dimensional torus. Figure 3(a) shows the torus and its normalization in the sphere. The normalization flattens the torus into a band around the equator and the inner cavity is lost. To avoid data collision, we use stereographic projection to map data in \( \mathbb {R}^d \) into the d dimensional sphere \( \mathbb {S}^d \) in \( \mathbb {R}^{d+1} \). Figure 3(b) shows stereographic projection from the equatorial plane onto the sphere, through the north pole N. The point P is mapped to the point Q, which is the intersection of the line through NP and the unit sphere. It is worth noting that this is an invertible conformal map because it preserves angles between curves. The mapping is: \( P \mapsto Q = N + \frac{2}{1 + |P\Vert ^2}{(P - N)}, \) and the inverse mapping is \( Q \mapsto P = N + \frac{1}{1 - Q_{d + 1}}{(Q - N)}, \) where \( Q_{d + 1}\) is the \( (d+1) \)-th coordinate of Q. In some situations, the entire data analysis can be carried out in the normalized space rather than in the original space. But for some situations, e.g., for visualization purposes, it would help to map the learned bases back to the original space. Figure 3(c) shows an example in which the learned bases are back-projected to the original space.

4.2 Learning the Archetypes

We now describe how to learn a set of archetypes \( \mathbf {D} = [\mathbf {d}_1, \cdots , \mathbf {d}_k] \) from the normalized data. We first give the formulation and then explain why the formulation can learn the archetypes. The following is the proposed formulation:

$$\begin{aligned} \begin{aligned} \underset{\mathbf {D, C}}{\min }&{\Vert \mathbf {X}-\mathbf {DC} \Vert _{\mathrm {F}}^2}, \quad \mathbf {C}=[\mathbf {c}_1, \cdots , \mathbf {c}_n], \\&\mathbf {c}_i \in \mathrm {\Delta }_k, \quad \Vert \mathbf {x}_i\Vert _2=1, \quad \mathbf {D} \in \mathcal {C}, \end{aligned} \end{aligned}$$
(3)

where \( \mathcal {C} \triangleq \{ \mathbf {D} \in \mathbb {R}^{m \times k} \text {s.t.} \forall j = 1 \dots k, \ \Vert \mathbf {d}_j \Vert _2^2 \le 1. \} \)

Fig. 4.
figure 4

Illustration of data reconstructions in 3D and 2D. The black points are the learned archetypes and the blue/cyan points are the data. (Color figure online)

Since the learned archetypes lie inside the unit sphere, their convex combinations will form a convex hull inside the unit sphere. See Fig. 4 for an illustration. Since the data are also normalized to lie on the unit sphere, they are guaranteed to lie outside the convex hull. So in the reconstruction stage, the data will be projected onto the surfaces of the convex hull so as to minimize the representation error. In the learning stage, to minimize the overall representation error on the whole dataset, the algorithm will push the surfaces as close to the data as possible. The union of the set of convex hull surfaces (which data are projected on) resembles the P-hull as defined in the original archetypal analysis. The convex hull vertices are the learned archetypes. In summary, directly optimizing the problem (3) will learn archetypes which approximately correspond to the extremal points in the dataset.

4.3 Represent Data by Archetypes

After learning the archetypes, we represent a datum x by convex combinations of the archetypes: \( \hat{\mathbf {x}} = \mathbf {Dc}^*\) where

$$\begin{aligned} \mathbf {c}^{*} \triangleq \underset{\mathbf {c} \in \mathrm {\Delta }_k}{{{\mathrm{arg\,min}}}} {\Vert \mathbf {x}-\mathbf {Dc}\Vert _2^2} \end{aligned}$$
(4)

We would like to note that by solving the equation (4), the datum is projected to the surface of the convex hull. In ideal cases, the difference between \( \mathbf {x} \) and the reconstructed datum \( \hat{\mathbf {x}} \) is small.

One interesting property of our method is that only the archetypes on the same surface can be activated at the same time (co-activated). For example, in Fig. 4(b), the archetypes \( \mathbf {x}_4 \) and \( \mathbf {x}_6 \) will never be co-activated to represent a datum. If they are co-activated, the datum would be projected into the interior of the convex hull, which is impossible. This is because projecting the datum to the interior will first cross a surface and projecting the datum to that surface achieves smaller error. So the archetypes learned using our method are automatically clustered in groups which provides a tight representation of the datasets. This is a desirable property for reconstruction and classification tasks which need strong regularization capabilities.

4.4 Geometric Comparison with Related Work

To highlight their differences to our approach, we provide geometric interpretations for sparse coding and non-negative matrix factorization. Sparse coding represents a datum by a small number of bases by constraining the \(l_1\)-norm of the coefficient vector \(\Vert \varvec{\alpha }\Vert _1\) to be smaller than a constant \(\lambda \), which could be interpreted geometrically. We augment each basis \(\mathbf {d}\) by its negative \(-\mathbf {d}\) and require non-negative coefficients as shown in Fig. 5(a). Sparse coding projects a datum to a scaled convex hull (cyan in Fig. 5(a)) formed by the bases where the scale is \(\Vert \varvec{\alpha }\Vert _1\) which varies for different data. The variation of the scale makes it non-trivial to interpret the properties of the bases. In contrast, the scale of the convex hull for our approach is fixed to be one.

Fig. 5.
figure 5

Geometric interpretation of sparse coding and comparison of different methods on a naive dataset. The orange points represent the learned bases and the points within the orange area could be reconstructed without any error. (Color figure online)

We also compare different methods on a naive dataset in Fig. 5(b). Geometrically, data lie in a convex polyhedral cone and NMF aims to find the edges of the cone. Similar to archetypal analysis, this method also tends to learn a large convex hull for non-convex datasets but does not explore the local structures of the manifold. Instead, our method tends to use a mixture of small surfaces to approximate the local non-convex regions.

5 Optimization

We sketch the optimization algorithm in this section. Although the problem (3) is not convex with respect to \( \mathbf {D} \) and \(\mathbf {C} \), it is convex with respect to one of them with the other fixed. We propose an online learning algorithm similar to [20]. At each iteration, only a mini-batch of samples are used to update the dictionary of archetypes, endowing us with the ability to handle large-scale datasets. Similarly as [20], the sufficient statistics with respect to coefficients \( \mathbf {c}_i \) are aggregated, which are used for dictionary updates.

5.1 Algorithm Outline

The outline is shown in Algorithm 1. At each iteration t,  we deal with one sample \( \mathbf {x}_t \) (the extension to multiple samples is straightforward). Given the sample for current iteration, we first compute the coefficient by solving the equation (5) based on the dictionary \(\mathbf {D}_{t - 1}\) computed in the last iteration. Then the sufficient statistics of \( \mathbf {c}_i \), namely \( \mathbf {A} \) and \( \mathbf {B}, \) are updated. Finally, the dictionary is updated based on historical observed data, by minimizing \(\hat{f}_t (\mathbf {D}) \triangleq \frac{1}{t} \sum _{i = 1}^{t} || \mathbf {x}_i - \mathbf {Dc}_i ||_2^2, \) where the coefficients \( \mathbf {c}_i \) are obtained at iteration i of the algorithm. It acts as an approximation to \( \hat{\mathbf {c}}_i (\mathbf {D}_{t - 1}) \) which denotes recomputing the coefficients for all the data in previous iterations based on the update-to-date dictionary \(\mathbf {D}_{t-1}\). We will describe the details for computing the coefficients and updating the dictionary in the following subsections.

figure a

5.2 Compute Coefficients

For calculating the coefficient \( \mathbf {c}_t \) given current dictionary \( \mathbf {D}_{t - 1}, \) we need to solve the following problem:

$$\begin{aligned} \mathbf {c}_t \triangleq \underset{\mathbf {c} \in \mathrm {\Delta }_k}{{{\mathrm{arg\,min}}}} || \mathbf {x}_t - \mathbf {D}_{t - 1} \mathbf {c} ||_2^2. \end{aligned}$$
(5)

which is a least-squares optimization problem with simplicial constraint. Several approaches have been used for this kind of problem. Culter et al. [6] proposed a penalty approach. Also, fast iterative shrinkage-thresholding algorithm (FISTA) could be used. Following [4], we use an active-set algorithm, which benefits from the sparsity of the coefficient \( \mathbf {c}. \) It is worth noting that the coefficients obtained by the equation (5) are sparse because of the interaction of curved geometry of the sphere and the linear surfaces, which encourages using low dimensional surfaces to represent the data. The maximum number of activated archetypes is determined by the number of vertices of the surface.

5.3 Update Dictionary

We update the dictionary using the sufficient statistics \( \mathbf {A} \) and \( \mathbf {B} \):

$$\begin{aligned} \begin{aligned} \mathbf {D}_t&\triangleq \underset{\mathbf {D} \in \mathcal {C}}{{{\mathrm{arg\,min}}}} \frac{1}{t} \sum _{i = 1}^t \Vert \mathbf {x}_i - \mathbf {D} \mathbf {c}_i \Vert _2^2 \\&= \mathop {{{\mathrm{arg\,min}}}}\limits _{\mathbf {D} \in \mathcal {C}} \frac{1}{t} ({{\mathrm{Tr}}}(\mathbf {D}^T \mathbf {D} \mathbf {A}_t) - 2 {{\mathrm{Tr}}}(\mathbf {D}^T \mathbf {B}_t)). \end{aligned} \end{aligned}$$
(6)

Since this formula is the same as the one in [20], we utilize the same dictionary update scheme. Block-coordinate descent approach with warm restart is used, and the main advantage of it is parameter free.

We sequentially update the column \( \mathbf {d}_j, \) while fixing the others. Firstly, the algorithm calculates a new position \( \mathbf {u}_j = \mathbf {d}_j + (1/\mathbf {A}[j, j])(\mathbf {b}_j - \mathbf {D} \mathbf {a}_j), \) where \( \mathbf {b}_j \) and \( \mathbf {a}_j \) are j-th column of \( \mathbf {A} \) and \( \mathbf {B} \) respectively. Then we project \( \mathbf {u}_j \) into the unit sphere by \( \mathbf {d}'_j = \mathbf {u}_j/\max (\Vert \mathbf {u}_j \Vert _2, 1) \) to satisfy the constraint \( \Vert \mathbf {d}_j \Vert \le 1. \) The convergence of this convex optimization problem is guaranteed.

6 Experiments

We evaluate the proposed method from three aspects. First, we evaluate the learning speed on large-scale datasets and compare with the original archetypal analysis method [4]. Second, we experimentally demonstrate that the convex combinations of the learned archetypes accurately approximate the D-hull. In particular, the archetypes correspond to the extremal points in the dataset. Third, we give applications of using the archetypes for data modeling in the context of reconstruction and classification tasksFootnote 1.

Fig. 6.
figure 6

Comparison of the learning speed for our approach and [4] on the H3.6M dataset. The results of our method and [4] are represented by blue and red lines, respectively. The solid line denotes the results on the training set and the dotted line denotes the results on the testing set. For [4], the points are plotted for each iteration. For our method, the points are plotted for every ten batches. (Color figure online)

6.1 Speed

We implement our algorithm based on Eigen [11]. For the original AA method [4], we use the fastest implementation included in the SPAMS toolbox [20]. Both approaches use Intel MKL and multi-threading. The experiments are done on a single Intel CPU of i7-5930K.

We experiment on the H3.6M dataset [16] which has about three million human poses. The dimension of a pose is 48. Following the protocol in [22], we sample about 300 K training poses and 3000 testing poses. We learn 400 archetypes from the dataset. The results are shown in Fig. 6.

First, the reconstruction error of our approach on the training dataset drops to a lower level than [4]. This is expected as the archetypes in our approach have larger flexibility because they can be anywhere in the sphere. This differs from [4] where the archetypes can only be in the D-hull. Second, our approach uses much less time than [4] to reach similar levels of reconstruction errors. In particular, when the first iteration (which has to traverse the whole dataset once) of archetypal analysis is finished, our method has updated the dictionary for about 1700 times (around 3 epochs) and obtained a small reconstruction error. The gap will become even larger when the number of training data increases.

6.2 Interpretability

One distinctive property of archetypal analysis [4, 6] is that some of the learned archetypes correspond to the extremal points in the dataset. In this case, the convex combinations of the archetypes will form a convex hull which tightly surrounds the data hull. In this section, we examine the archetypes learned on two data sources: human faces and poses.

Fig. 7.
figure 7

This figure shows 64 images from the Yale face dataset which correspond to the same person under different lighting directions. We arrange the images according to the two axis-es of the lighting directions. The nearest neighbors of archetypes identified by our method and archetypal analysis are annotated with blue and red rectangles, respectively. (Color figure online)

Face Images. It is known that the imaging of an object in variable illumination conditions can be accurately approximated by convex combinations of a set of images captured at extreme illumination conditions [10]. Hence this is a good data source to visually evaluate whether the proposed approach can identify the extremal points as the AA method.

We experiment on the cropped Yale face dataset [10]. We normalize each image to have unit \(l_2\)-norm for our method. The archetypal analysis directly uses the original images. The results are shown in Fig. 7. For visualization purposes, for each archetype, we find the nearest neighbor in the face dataset and highlight that face image in a particular color. The face images with blue rectangles denote the archetypes learned by our method. We also visualize the archetypes learned by AA in red rectangles. We can see that for both methods, the archetypes mostly correspond to the extremal illumination conditions, and the convex hull of the archetypes largely overlapped with the convex hull of the images, which justifies the effectiveness of our method.

Fig. 8.
figure 8

Visualization of the learned archetypes on human poses. We use [36] to project the data into 2D. The data and the learned archetypes are represented by cyan and red points respectively. The convex hull of the data/archetypes is plotted using the dotted line. (a) the archetypes learned on all poses. (b) the archetypes learned on the poses of a particular action. (Color figure online)

Pose. We also experiment on the large 3D human pose dataset H3.6M. We visualize the poses and the learned archetypes by projecting them to two-dimensional space using PCA [36]. The results are shown in Fig. 8. We can see that some of the learned archetypes correspond to the extremal points in the dataset. In addition, the convex combinations of the archetypes form a convex hull which tightly surrounds the training dataset. The left and right figures show the results learned on all poses and on poses of a particular action respectively.

6.3 Applications

In this section, we use the learned archetypes as a data modeling tool for 3D human pose estimation and pose-based action recognition where strong regularization capabilities would help.

3D Pose Estimation. A 2D pose is represented by d joint locations in 2D \(\mathbf {x} \in \mathbb {R}^{2d}\). Similarly, a 3D pose is denoted by \(\mathbf {y} \in \mathbb {R}^{3d}\). The 2D and 3D pose are related by the camera parameters \(\mathbf {M}\): \(\mathbf {x}=\mathbf {My}\) [26].

The task of 3D human pose estimation is to estimate the 3D pose \(\mathbf {y}\) and the camera parameters \(\mathbf {M}\) from a 2D pose \(\mathbf {x}\). It is usually achieved by minimizing \(\Vert \mathbf {x}-\mathbf {My} \Vert _2\) which is an ill-posed problem. The authors of [26] propose to reduce the ambiguities by representing a 3D pose by a set of PCA bases \(\mathbf {B}\) and seeking for the optimal coefficient for a testing 2D pose: \(\mathbf {M}^*, \varvec{\beta }^*={{\mathrm{arg\,min}}}_{\mathbf {M}, \varvec{\beta }}{\Vert \mathbf {x}-\mathbf {M B} \varvec{\beta }\Vert _2}\). The estimated 3D pose is \(\hat{\mathbf {y}}=\mathbf {B} \varvec{\beta }^*\).

We replace the PCA bases in [26] with the archetypes learned by our approach. In particular, we learn 400 archetypes on the training set. We also compare with a related baseline which uses the bases learned by sparse coding [20]. The sparse bases are learned with the regularization factor being set as \( \lambda = 1.2/\sqrt{3d}\) as suggested in [20] where 3d is the data dimension.

We experiment on the H3.6M dataset where the groundtruth 2D poses are assumed known. To reveal the properties of different bases (or archetypes), we calculate three metrics: (1) the projection error in 2D which is computed as \(\Vert \mathbf {x}-\mathbf {M} \mathbf {B} \varvec{\beta }^*\Vert \) when groundtruth camera parameters are known; the unit is pixel; (2) the mean per joint position error (MPJPE) in 3D when groundtruth camera parameters are known; the unit is mm; (3) MPJPE in 3D when camera parameters are jointly estimated.

Table 1. Pose estimation results on the H3.6M dataset. Results using estimated 2D pose are listed in parentheses for comparison with state-of-the-arts.

The results are shown in Table 1. First, we can see that the projection error of sparse coding is much smaller than archetypal analysis and our method. However, the 3D pose estimation error is much larger. This is because the bases learned by sparse coding have weaker regularization properties. In other words, sparse coding doesn’t provide a tight/bounded representation for the data manifold as our method. Second, the result of our method is similar to archetypal analysis. Third, the final error 106.52mm is achieved when the 2D poses are estimated by [2] and the camera parameters are jointly estimated using our method. The result is comparable to some state-of-the-arts which rely on deep networks.

Pose-Based Action Recognition. We also evaluate our approach on a classification task of pose-based action recognition. We adopt a simple nearest neighbor based classification framework. For a pose sequence, following [35], we extract a set of snippets with each being a short sequence of poses, to represent the original sequence. Then for each action class, we learn a set of archetypes based on the snippets. In the testing stage, for a pose sequence, we reconstruct its snippets using the archetypes of each class and the class which achieves the minimum average reconstruction error over all the snippets of the sequence is the predicted action class.

We experiment on the NTU RGB+D dataset [28], which is a large benchmark consisting of four million poses. The number of archetypes is set to 1024 by cross-validation. For sparse coding, the regularization factor is set to 0.1 by cross-validation. Following previous works, we report the results using two different train/test split methods: Cross-Subject (CS) and Cross-View (CV).

Table 2. Action recognition accuracy (%) on the NTU RGB+D dataset with Cross-Subject and Cross-View settings.

The experimental results are shown in Table 2. It is not feasible to learn archetypes using [4] on such a large dataset so we don’t list its results. We observe our method outperforms sparse coding by \( 3.12\% \) and \( 3.32\% \) in accuracy for CS and CV settings respectively, as our method learns a tighter representation than sparse coding. We also compare with the state-of-the-art methods which take advantage of deep neural networks such as [15, 18, 19, 28, 29, 38]. The recognition accuracy of our method is lower than VA-LSTM. The reason may be that our method doesn’t model the motions as LSTM. However, considering that our method is simple and not tuned for the action recognition task, the results are already promising.

7 Conclusion

We present a fast variant of archetypal analysis, which not only maintains the favorable interpretability, but also scales gracefully to large datasets. It decouples the deep binding between data and archetypes, thus enables an online optimization algorithm. This formulation learns tight representations which is a favorable property for many tasks. Experiments on pose estimation and pose-based action recognition reveal its power.