Keywords

1 Introduction

The Symmetric Positive Definite (SPD) matrix has become a prevalent representation in many visual tasks, such as face recognition [12], action recognition [30], and object detection [25]. It utilizes the second-order or higher-order statistics information to capture the desirable feature distribution. There are several works try to model a more discriminative SPD matrix [16, 27, 28] from local features. Meanwhile, calculating the distance metric in the SPD manifold is a crucial problem coming along with the SPD matrix representation. Due to the no-Euclidean structure of SPD manifolds, the Euclidean metric can’t be applied directly on it. In this paper, we focus on a robust metric learning method on SPD manifolds.

Many efforts have been devoted to the SPD matrix metric, such as the Affine Invariant Metric (AIM) [19], Log-Euclidean Metric (LEM) [2], Bregman divergence [14], Stein divergence [21], and alpha-beta divergence [3, 4, 22]. Given a concrete metric, metric learning aims at learning proper metric parameters that keep similar pairs close and separate dissimilar pairs. Most of the existing metric learning methods on the SPD manifold learn a discriminative metric on the tangent Euclidean space [11, 23, 31].

However, how to learn a proper SPD matrix metric is still a challenging problem. The SPD matrix is aggregated from local features, and contains different essential intrinsic properties. Existing SPD matrix metric learning methods [11, 23, 31] just regard an SPD matrix as a global representation and exploit a direct metric on the complex manifold, ignoring the different roles of intrinsic properties in the SPD matrix. It is unsuitable to treat intrinsic properties equally when they have different roles, e.g., different distribution or significance. Therefore, we argue that an SPD matrix metric modeling different roles of intrinsic properties will achieve a better performance.

In this paper, a novel metric learning method on SPD manifolds is proposed to solve the issues mentioned above. Firstly we discover intrinsic properties of an SPD matrix, and then calculate the SPD matrix metric considering different roles of them. In particular, our method aims to jointly learn multiple low-dimensional projections and a set-to-set metric. As the property discovery can be seen as the feature extraction, we apply multiple low-dimensional manifold projections on the SPD matrix to discover discriminative intrinsic properties. Thus, the distance metric between two original SPD matrices is transformed into the distance metric between the two sets which contain several corresponding projected low-dimensional SPD matrices. The alpha-beta divergences is a learnable SPD matrix metric, so it is applied in our set-to-set metric to be adaptive to the intrinsic property. We assign multiple alpha-beta divergences on different low-dimensional manifolds as the sub-metrics and summarize these sub-metrics discriminatively as the SPD matrix metric. Through this, the different roles of intrinsic properties are involved in the SPD matrix metric. Evaluated by experiments, the proposed learnable metric is extremely helpful to capture meaningful nearest neighbors of different original SPD matrices.

In summary, our contributions are three-fold.

  1. (1)

    We propose a robust SPD matrix metric learning method of discovering discriminative intrinsic properties and modeling their different roles in metric computation.

  2. (2)

    We formulate the metric learning as the two-component joint optimization problem, i.e., multiple low-dimensional manifold projections and a set-to-set metric are learned jointly.

  3. (3)

    We introduce the manifold optimization method which can learn metric parameters to guarantee the robustness of the proposed metric.

2 The Proposed Method

Throughout this paper, scalars are denoted by the lower-case letters; the vectors are represented by the bold lower-case letters; the matrices are denoted by the upper-case letters; the sets are represented by the bold upper-case letters.

Fig. 1.
figure 1

The flowchart of our SPD matrix metric learning method. Left: multiple projections \(f_W^1\), \(f_W^2\), and \(f_W^3\) used to discover intrinsic properties; Right: the computation of the set-to-set distance \(D_s\) which considers different roles of intrinsic properties.

2.1 Problem Definition

This work aims to discover discriminative intrinsic properties in an SPD matrix and compute the distance of SPD matrices considering different roles of discovered properties. The property discovery can be regarded as a feature extraction process that projects an original SPD matrix to multiple low-dimensional SPD manifolds to form a set of the low-dimensional SPD matrices. We propose a set-to-set metric to consider different roles of intrinsic properties. Individual sub-metrics are assigned on low-dimensional manifolds and summarized discriminatively. Consequently, our metric learning method is composed of two components, multiple low-dimensional manifold projections and a set-to-set metric. Given two SPD matrices \(X_{i}\) and \(X_{j}\), the distance \(D^{\varTheta } (X_{i}, X_{j})\) is

$$\begin{aligned} \begin{aligned} D^{\varTheta } (X_{i}, X_{j})&= D_{s}(\mathbf X _{i},\mathbf X _{j}) \\&= D_{s} \Big ( \{f_W^1(X_{i}), \cdots , f_W^m(X_{i})\}, \{f_W^1(X_{j}), \cdots , f_W^m(X_{j})\} \Big ) \\&= h_{M} \Big ( g_A^1 \big (f_W^1(X_{i} ) ,f_W^1( X_{j} ) \big ), \cdots , g_A^m \big (f_W^m(X_{i} ) ,f_W^m( X_{j} ) \big ) \Big ), \end{aligned} \end{aligned}$$
(1)

where \( f_W^k(\cdot ) \) is the low-dimensional manifold projection, and \(\mathbf X _{i} = \{ f_W^k(X_{i}) \}_{k=1}^m\) is the set containing low-dimensional SPD matrices. The distance \(D^{\varTheta } (X_{i}, X_{j})\) between original SPD matrices \(X_{i}\) and \(X_{j}\) is transformed into a set-to-set distance \(D_{s}(\mathbf X _{i},\mathbf X _{j})\), where the sub-metric on the k-th low-dimensional manifold is calculated by \(g_{A}^{k}(\cdot ,\cdot )\) and all sub-metrics of properties are summarized by \(h_{M}(\cdot )\). WAM are the projection parameter, the sub-metric parameter, and the summarization parameter, respectively. We exploit a learnable parameter set \(\varTheta = \{W,A,M \}\) to represent the parameters. The framework of our metric learning method for the SPD matrix is shown in Fig. 1.

The goal of metric learning is to learn the metric parameter \(\varTheta \) from an SPD matrix similar pair set \(\mathcal {S}\), a dissimilar pair set \(\mathcal {D}\), and their labels Y, where \(y_{ij}=1\) means \(X_{i}\) and \(X_{j}\) are similar, otherwise \(y_{ij}=0\). The metric parameter \(\varTheta \) can be learned by optimizing the loss function \(\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)\) which is the punishment of both far similar sample pairs and close dissimilar sample pairs. We define \(\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)\) in the following subsection. Moreover, we impose the manifold constraints on W and M to obtain a more robust metric.

2.2 Multiple Low-Dimensional Manifold Projections

For an SPD matrix sample \(X_{i} \in \mathbb {R}^{n \times n} \), we project \(X_{i}\) to m low-dimensional manifolds to discover the intrinsic properties,

$$\begin{aligned} \begin{aligned}&X_{i}^{1} = f_W^1(X_{i}) = W_{1}^{\top }X_{i}W_{1}\\&... \\&X_{i}^{m} = f_W^m(X_{i}) = W_{m}^{\top }X_{i}W_{m}, \end{aligned} \end{aligned}$$
(2)

where \(X_{i}^{k} \in \mathbb {R}^{p \times p}\) is the k-th low-dimensional SPD matrix, \(k \in \{1,2,\cdots ,m \}\). An SPD matrix \(X_{i}\) is projected to a set \(\mathbf X _{i}=\{X_{i}^{k}\}_{k=1}^{m}\), which contains several low-dimensional SPD matrices.

We expect that each low-dimensional matrix \(X_{i}^{k}\) is guaranteed to be still an SPD matrix having the ability of capturing desirable feature distribution, and any two low-dimensional SPD manifolds are unrelated to preserve as much information as possible in the low-dimensional SPD matrix set. The learnable parameter \(W_{k}\) needs to be a column full rank matrix to make \(X_{i}^{k}\) be an SPD matrix as well. Based on the affine invariance [3, 7] of the alpha-beta divergence, we relax the column full rank constraint of \(W_{k}\) to the semi-orthogonal constraint, i.e., \(W_{k}^{\top }W_{k}=I_{p}\). In order to preserve more information in the \(\mathbf X _{i}=\{X_{i}^{k}\}_{k=1}^{m}\) set, we expect that any two low-dimensional manifolds have a low relevance. For any \(k \ne l \), we set \(W_{k}^{\top }W_{l} = \mathbf {0}\), where \(\mathbf {0} \in \mathbb {R}^{p \times p}\) is a matrix whose elements are all “0”s, to reduce relevance between \(X_{i}^{k}\) and \(X_{i}^{l}\). These low-dimensional SPD manifolds can be seen as analogies of different PCA subspaces. A total projection matrix W is composed of all \(W_{k}\), \(W= [W_{1},W_{2},\cdots ,W_{m}] \in \mathbb {R}^{n \times mp}\), in which \(W_{k}\) is a partitioned matrix of W containing p columns. Note that, W is a semi-orthogonal matrix, i.e., \(W^{\top }W =I_{mp}\), which is on the non-Euclidean Stiefel manifold [1].

2.3 The Set-to-Set Metric

Based on multiple manifold projections, the distance \(D^{\varTheta } (X_{i}, X_{j})\) of two SPD matrices is transformed into the set-to-set distance \(D_{s}(\mathbf X _{i},\mathbf X _{j})\). Firstly \(\{ g_A^k(\cdot ,\cdot )\}_{k=1}^{m} \) is exploited to compute sub-metrics on m low-dimensional SPD manifolds, and then \(h_M(\cdot )\) is utilized to summarize the m sub-metrics, where A and M are learnable parameters. We use the flexible alpha-beta divergence [3, 4, 22] as the sub-metric \(g_A^k(\cdot ,\cdot )\). For two SPD sets \(\mathbf X _{i}= \{X_{i}^{k}\}_{k=1}^{m}\), \(\mathbf X _{j}= \{X_{j}^{k}\}_{k=1}^{m}\), the distance \(d_{ij}^{k}\) between \(X_{i}^{k}\) and \(X_{j}^{k}\) is computed by the k-th alpha-beta divergence,

$$\begin{aligned} \begin{aligned} d_{ij}^{k}\! =\! g_A^k (X_{i}^{k},X_{j}^{k})\! =\! D^{ \left( \alpha _{k},\beta _{k} \right) } \left( X_{i}^{k}\Vert X_{j}^{k} \right) \! =\! \frac{1}{\alpha _{k} \beta _{k}} \sum _{u=1}^{p} log\left( \frac{\alpha _{k} (\lambda _{iju}^{^{k}})^{\beta _{k}} + \beta _{k} (\lambda _{iju}^{^{k}})^{-\alpha _{k}} }{\alpha _{k} + \beta _{k}} \right) , \end{aligned} \end{aligned}$$
(3)

where \(\lambda _{iju}^{k}\) is the u-th eigenvalue of \( X_{i}^{k}(X_{j}^{k})^{-1}\), and \(\left( \alpha _{k}, \beta _{k} \right) \) is the individual parameter of the k-th alpha-beta divergence. We denote all alpha-beta divergence parameters as a matrix \(A= [(\alpha _{1}, \beta _{1}),(\alpha _{2}, \beta _{2}),..., (\alpha _{m}, \beta _{m}) ] \in \mathbb {R}^{m \times 2} \), and a distance vector between \(\mathbf X _{i}\) and \(\mathbf X _{j}\) as \(\mathbf d _{ij} = [d_{ij}^{1}, d_{ij}^{2},..., d_{ij}^{m}] \in \mathbb {R}^{m \times 1}\). Since \(\left( \alpha _{k}, \beta _{k} \right) \) needs to be adaptive to the k-th low-dimensional manifold, we exploit a learnable strategy to update \(\left( \alpha _{k}, \beta _{k} \right) \), which is detailed in the next subsection. After computing all sub-metrics, the distance metric \(D^{\varTheta }(X_{i},X_{j})\) between two original SPD matrices \(X_{i}\) and \(X_{j}\) is formulated as

$$\begin{aligned} \begin{aligned}&D^{\varTheta }(X_{i},X_{j}) = D_{s}(\mathbf X _{i},\mathbf X _{j}) = h_M (d_{ij}^{1}, d_{ij}^{2},..., d_{ij}^{m}) = \mathbf d _{ij}^{\top }M\mathbf d _{ij} \\ =&\sum _{k=1}^{m} \sum _{l=1}^{m} \Big ( D^{ ( \alpha _{k},\beta _{k} ) } \big ( W_{k}^{\top }X_{i}W_{k} \Vert W_{k}^{\top }X_{j}W_{k} \big ) \cdot M_{kl} \cdot D^{ ( \alpha _{l},\beta _{l} ) } \big ( W_{l}^{\top }X_{i}W_{l} \Vert W_{l}^{\top }X_{j}W_{l} \big ) \Big ), \end{aligned} \end{aligned}$$
(4)

where \(M \in \mathbb {R}^{m \times m}\) is the metric parameter, and \(M_{kl}\) is an element of M in the k-th row and l-th column, reflecting the significance and relationship of properties. If \(X_{i} = X_{j}\), then \(\mathbf d _{ij}\) is a zero vector, and \(D^{\varTheta }(X_{i},X_{j}) = 0 \). If \(X_{i} \ne X_{j}\), then \(\mathbf d _{ij}\) is a non-zero vector, and \(D^{\varTheta }(X_{i},X_{j})\) should be larger than 0. The nonnegativity of the metric forces M to be an SPD matrix and \(M \in Sym_{m}^{+}\).

To learn the parameter \(\varTheta \), we formulate loss function \(\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)\) as

$$\begin{aligned} \begin{aligned} \min _{\varTheta } \mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)&= \frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} y_{ij} \cdot max \big ( D^{\varTheta }(X_{i},X_{j}) -\zeta _{s},0 \big )^{2}\\&+ \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D} } (1-y_{ij}) \cdot max \big ( \zeta _{d}-D^{\varTheta }(X_{i},X_{j}) ,0 \big )^{2}\\&+ \xi \cdot \gamma (M,M_{0}). \end{aligned} \end{aligned}$$
(5)

We expect that the distance between similar samples is smaller than a threshold \(\zeta _{s}\), and the distance between dissimilar samples is larger than a threshold \(\zeta _{d}\). We add two coefficients \(\frac{1}{|\mathcal {S}|}\) and \(\frac{1}{|\mathcal {D}|}\) to solve the imbalance issue of similar and dissimilar sample pairs, where \(|\mathcal {S}|\) and \(|\mathcal {D}|\) are the pair numbers of sets \(\mathcal {S}\) and \(\mathcal {D}\). In addition, we add a regularization term \(\xi \cdot \gamma (M,M_{0})\) on M in Eq. (5). \(\gamma (M,M_{0})=Tr(MM_{0}^{-1})-log det (MM_{0}^{-1}) - m\) is the burgman divergence [5, 8, 10], where \(Tr(\cdot )\) is the trace of a matrix, \(M_{0}\) is the prior information, and \(\xi \) is the trade-off coefficient.

2.4 Optimization

\(\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)\) in Eq. (5) is not a convex function with respect to W, A, and M. Accordingly, we apply the gradient descent to learn \(\varTheta \). The gradients are computed as follows.

(1) The gradient of \(\mathcal {L}\) with respect to M

The gradient of \(\mathcal {L}\) with respect to M can be computed by

$$\begin{aligned} \begin{aligned} \nabla _{M}(\mathcal {L}) =&\frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} \mathbf d _{ij} \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \mathbf d _{ij}^{\top } + \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D}} \mathbf d _{ij} \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \mathbf d _{ij}^{\top } + \xi \cdot \nabla _{M}(\gamma (M,M_{0})), \end{aligned} \end{aligned}$$
(6)

where \(\nabla _{D^{\varTheta }_{ij}}(\mathcal {L})\) is the gradient of \(\mathcal {L}\) with respect to \(D^{\varTheta }(X_{i},X_{j})\),

$$\begin{aligned} \begin{aligned} \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) =2 \cdot y_{ij} \cdot max(D^{\varTheta }_{ij}-\zeta _{s},0)+2 \cdot (y_{ij}-1) \cdot max(\zeta _{d}-D^{\varTheta }_{ij},0), \end{aligned} \end{aligned}$$
(7)

and \(\nabla _{M}(\gamma (M,M_{0}))\) is the gradient of \(\gamma (M,M_{0})\) with respect to M,

$$\begin{aligned} \begin{aligned} \nabla _{M}(\gamma (M,M_{0}))= M_{0}^{-1}-M^{-1}. \end{aligned} \end{aligned}$$
(8)

(2) The gradient of \(\mathcal {L}\) with respect to A

The gradients of \(\mathcal {L}\) with respect to \({\alpha }_k\) and \({\beta }_k\) in A are

$$\begin{aligned} \begin{aligned} \nabla _{\alpha _{k}}(\mathcal {L}) = \frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\alpha _{k}}(d^{k}_{ij}) + \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\alpha _{k}}(d^{k}_{ij}), \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned} \begin{aligned} \nabla _{\beta _{k}}(\mathcal {L}) = \frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\beta _{k}}(d^{k}_{ij}) + \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\beta _{k}}(d^{k}_{ij}). \end{aligned} \end{aligned}$$
(10)

\(\nabla _{d^{k}_{ij}}(\mathcal {L})\) is the k-th element of \(\nabla _{ \mathbf d _{ij}}(\mathcal {L})\) which is the gradient of \(\mathcal {L}\) with respect to \( \mathbf d _{ij}\),

$$\begin{aligned} \begin{aligned} \nabla _{ \mathbf d _{ij}}(\mathcal {L})= \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \cdot \nabla _{ \mathbf d _{ij}}(D^{\varTheta }_{ij}) =\nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \mathbf d _{ij}^{\top } (M^{\top }+M). \end{aligned} \end{aligned}$$
(11)

\(\nabla _{\alpha _{k}}(d^{k}_{ij})\) and \(\nabla _{\beta _{k}}(d^{k}_{ij})\) are the gradients of \(d^{k}_{ij}\) with respect to \(\alpha _{k}\) and \(\beta _{k}\), respectively,

$$\begin{aligned} \begin{aligned} \nabla _{\alpha _{k}}(d^{k}_{ij}) =&\frac{1}{\alpha _{k}^{2} \beta _{k}} \sum _{u=1}^{p} \bigg ( \frac{ \alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} - \alpha _{k}\beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}} log \lambda _{iju}^{k}}{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}}} \\&- \frac{\alpha _{k}}{\alpha _{k}+ \beta _{k}} - log \frac{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}} }{\alpha _{k}+\beta _{k}} \bigg ), \end{aligned} \end{aligned}$$
(12)
$$\begin{aligned} \begin{aligned} \nabla _{\beta _{k}}(d^{k}_{ij}) =&\frac{1}{\alpha _{k} \beta _{k}^{2}} \sum _{u=1}^{p} \bigg ( \frac{ \beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}} - \alpha _{k}\beta _{k} (\lambda _{iju}^{k})^{\beta _{k}} log \lambda _{iju}^{k}}{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}}} \\&- \frac{\beta _{k}}{\alpha _{k}+ \beta _{k}} - log \frac{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}} }{\alpha _{k}+\beta _{k}} \bigg ). \end{aligned} \end{aligned}$$
(13)

(3) The gradient of \(\mathcal {L}\) with respect to W

The gradient of \(\mathcal {L}\) with respect to each \(W_{k}\) is

$$\begin{aligned} \begin{aligned} \nabla _{W_{k}}(\mathcal {L})= \sum _{i}^{N} \big ( (X_{i})^{\top } W_{k} \nabla _{X_{i}^{k}}(\mathcal {L}) + X_{i} W_{k} \nabla _{X_{i}^{k}}(\mathcal {L})^{\top } \big ), \end{aligned} \end{aligned}$$
(14)

where N is the number of training samples, and \(N=2 \times (|\mathcal {S}|+|\mathcal {D}|)\). \(\nabla _{X_{i}^{k}}(\mathcal {L})\) is the gradient of \(\mathcal {L}\) with respect to the low-dimensional SPD matrix \(X_{i}^{k}\). The eigenvalue decomposition of \(X_{i}^{k}(X_{j}^{k})^{-1}\) is \(X_{i}^{k}(X_{j}^{k})^{-1} = U_{ij}^{k}\varSigma _{ij}^{k} (U_{ij}^{k})^{\top }\). \(\varSigma _{ij}^{k}\) is the diagonal matrix eigenvalues, and \(\lambda _{iju}^{k}\) is the u-th eigenvalue. The gradients \(\nabla _{X_{i}^{k}}(\mathcal {L})\) and \(\nabla _{X_{j}^{k}}(\mathcal {L})\) are

$$\begin{aligned} \begin{aligned} \nabla _{X_{i}^{k}}(\mathcal {L})= U_{ij}^{k} \nabla _{\varSigma _{ij}^{k}}(\mathcal {L}) (U_{ij}^{k})^{\top } (X_{i}^{k})^{-\top }, \end{aligned} \end{aligned}$$
(15)
$$\begin{aligned} \begin{aligned} \nabla _{X_{j}^{k}}(\mathcal {L})= (-1) \cdot (X_{j}^{k})^{-\top } (X_{i}^{k})^{\top } U_{ij}^{k} \nabla _{\varSigma _{ij}^{k}}(\mathcal {L}) (U_{ij}^{k})^{\top } (X_{j}^{k})^{-\top }, \end{aligned} \end{aligned}$$
(16)

where \(\nabla _{\varSigma _{ij}^{k}}(\mathcal {L})\) is the gradient of \(\varSigma _{ij}^{k}\) with respect to \(\mathcal {L}\). \(\nabla _{\varSigma _{ij}^{k}}(\mathcal {L})\) is a diagonal matrix, and the u-th element is

$$\begin{aligned} \begin{aligned} \nabla _{\lambda _{iju}^{k}}(\mathcal {L})&= \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\lambda _{iju}^{k}}(d^{k}_{ij}) \\&= \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \frac{1}{\alpha _{k} \beta _{k}} \frac{\alpha _{k} \beta _{k} (\lambda _{iju}^{k})^{\beta _{k}-1} - \alpha _{k} \beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}-1}}{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}}}. \end{aligned} \end{aligned}$$
(17)

Since the gradients \(\nabla _{W}(\mathcal {L})\), \(\nabla _{M}(\mathcal {L})\), and \(\nabla _{A}(\mathcal {L})\) are obtained, the metric parameter set \(\varTheta \) can be updated. A is optimized by the standard gradient descent, \(A:= A-\eta \nabla _{A}(\mathcal {L})\), where \(\eta \) is the learning rate. W and M are updated by the Riemannian optimization algorithm [1, 6, 20]. The computation details are presented below,

$$\begin{aligned} \begin{aligned} \left\{ \begin{aligned}&\nabla _{W_{R}}(\mathcal {L}) = \nabla _{W}(\mathcal {L}) - W \frac{1}{2} ( W^{\top }\nabla _{W}(\mathcal {L}) + \nabla _{W}(\mathcal {L})^{\top } W )\\&W:= q \big ( W-\eta \nabla _{W_{R}}(\mathcal {L}) \big ) \end{aligned} \right. , \end{aligned} \end{aligned}$$
(18)

and

$$\begin{aligned} \begin{aligned} \left\{ \begin{aligned}&\nabla _{M_{R}}(\mathcal {L}) = M \frac{1}{2} \big ( \nabla _{M}(\mathcal {L}) + \nabla _{M}(\mathcal {L}) ^{\top } \big ) M \\&M:= M^{\frac{1}{2}} expm \big ( -\eta M^{-\frac{1}{2}} \nabla _{M_{R}}(\mathcal {L}) M^{-\frac{1}{2}} \big ) M^{\frac{1}{2}} \end{aligned} \right. , \end{aligned} \end{aligned}$$
(19)

where \(\nabla _{W_{R}}(\mathcal {L})\) and \(\nabla _{M_{R}}(\mathcal {L})\) are the Riemannian gradients with respect to W and M. In Eq. (18), \(q\left( \cdot \right) \) is the retraction operation mapping the data back to the Stiefel manifold. \(q\left( W \right) \) denotes the Q matrix of the QR decomposition to a matrix W, i.e., for the matrix \(W \in \mathbb {R}^{n \times p}\), \(W = QR\), where \(Q \in \mathbb {R}^{n \times p}\) is a semi-orthogonal matrix and \(R \in \mathbb {R}^{p \times p}\) is a upper triangular matrix. In Eq. (19), \(expm(\cdot )\) is the matrix exponential function. We summarize the learning process of our method in Algorithm 1, w.

figure a

3 Experiments

In order to test the efficiency of our method, we conduct experiments on the object recognition, video-based face recognition, action recognition, and texture classification tasks. Four datasets are utilized: the ETH-80 [15], the MSR-Action3D [17], the YouTube Celebrities (YTC) [13], and the UIUC [18] datasets.

3.1 Datasets and Settings

The ETH-80 is an object image dataset, which contains 80 image sets of eight categories. Each category consists of 10 image sets, and each set includes 41 images captured under different views. In our experiment, all the images of the ETH-80 are resized to \(20 \times 20\) and denoted by the intensity features. The YTC is a video-based face dataset, collecting 1910 videos of 47 persons. Face regions are detected from each frame by a cascaded face detector and resized to \(30 \times 30\), followed by the histogram equalized operation, and represented by the gray values. The MSR-Action3D is a 3D action dataset, containing totally 567 videos of 20 actions. There are 20 skeleton joints in the body of actions. In the experiments, each frame is represented by a 120-dimensional feature, which is the 3D coordinate differences of skeleton joints between this frame and its two neighborhood frames. The UIUC material dataset contains 216 samples of 18 categories. We resize each image to \(400 \times 400\). Then 128-dimensional dense SIFT features are extracted from each image with 4-pixel space concatenated by 27-dimensional RGB color features from \(3 \times 3\) patches centered at the locations of dense SIFT features.

On the ETH-80, YTC, and UIUC datasets, we compute a covariance matrix C to represent each sample and add a small ridge \(\delta I\) to avoid the matrix singularity, where \(\delta = 0.001 \times Tr(C)\). On the MSR-Action3D dataset, we first compute the covariance matrix C with size of \(120 \times 120\), then transform it to a \(121 \times 121\) Gaussian distribution SPD matrix, \(\tiny C= |C|^{-\frac{1}{121}} \left[ \begin{array}{cc} C+\mathbf{mm }^{\top } &{} \mathbf m \\ \mathbf m ^{\top } &{} 1 \end{array} \right] \) as the sample representation, where \(\mathbf m \) is the mean vector of 120-dimensional features. Following the standard protocols [7, 11, 24, 29], for each category, we randomly select half of the samples for training and the rest for testing on the ETH-80, MSR-Action3D, and UIUC datasets. On the YTC dataset, for each person, three videos are randomly selected as the gallery, and six as the probe. In experiments, we set \(\xi = 0.01\), \(M_0 = I_m\), \(\zeta _{s} = 5\), and \(\zeta _{d}=100\).

3.2 Evaluation

We exploit the 1-NN classifier to evaluate the performance of all metric learning methods. The following methods are evaluated in our experiments: AIM [19], Stein Divergence [21], LEM [2], SPD-DR [7], CDL [29], RSR-ML [9], LEML [11], and \(\alpha \)-CML [31]. AIM, Stein Divergence, and LEM are the basic SPD matrix metrics, measuring the geodesic distance between SPD matrices. SPD-DR implements the dimensionality reduction on the SPD matrix and then applies the AIM or Stein Divergence between samples. CDL is a Riemannian kernel discriminative learning approach on the SPD manifold. RSR-ML employs sparse coding and dictionary learning scheme on the SPD manifold. LEML and \(\alpha \)-CML are two LEM based SPD matrix metric learning methods which project SPD matrices to the tangent space and utilize the LEM to compute the distance between them.

Table 1 shows the comparisons of the four visual tasks. In the object recognition task, we set the dimensionality of the low-dimensional manifolds is \(10 \times 10\) and the number of them is 20, i.e., \(m=20\). We find that LEM has a better performance than AIM, 93.0 vs 85.0, showing that the point on the tangent space is more discriminative. If the manifold point is projected to a low-dimensional discriminative space, i.e., the SPD-DR method, the performance can be improved to 96.0, 0.5 better than LEML. Compared with SPD-DR, our method achieves 97.5, 1.5 higher than it, which shows the power of discovering discriminative properties and their roles.

In the video-based face recognition task, the dimensionality of projected manifolds is \(10 \times 10\), and the number of them is 40. We achieve 49.2 in this task, 2.5 higher than SPD-DR and 10 percent higher than the basic SPD matrix metrics approximately. However, due to the large variable faces caused by posture, illumination, scale, and occlusion, the performance of linear metric learning methods is far less than it of the nonlinear kernel method CDL. The reason we think is that the samples in the original space are not separable, a more higher-dimensional RKHS space can relieve this problem.

In the action recognition task, the dimensionality of the low-dimensional manifolds is \(8 \times 8\) and the number of them is 15. Nonlinear kernel methods CDL and RSR-ML achieve 95.4 and 95.0 respectively and have a better performance than the existing metric methods [7, 11, 31]. In this case, our linear method obtains the comparable performance with CDL and RSR-ML, achieving 95.8. Besides, Wang et al. [26] shows that the nonlinear kernel matrix representation has a better performance than the linear SPD representation, while our accuracy is 3.1 higher than \(\alpha \)-CML whose performance is based on the kernel matrix [26] rather than the Gaussian distribution SPD matrix.

In the texture classification task, in our method, we set the dimensionality of the low-dimensional manifolds is \(8 \times 8\), and there are totally 18 low-dimensional manifolds. We can see that, the three basic SPD matrix metrics i.e., AIM, Stein Divergence, and LEM achieve comparable performance in the UIUC dataset, 35.6, 35.8 and 36.7 respectively. Meanwhile, metric learning methods can bring a remarkable improvement. CDL achieves 54.9, and the accuracy of LEML is 53.9. SPD-DR achieves a better performance 58.3, showing that there are too much noise and information redundancy in the original SPD representation. Our method further improves the result to 60.8 showing that our method can not only remove the noise and information redundancy but also bring the benefits of discovering discriminative intrinsic properties and their different roles.

Table 1. Accuracies (%) on the four visual tasks. Our method is bold in the last line.

4 Conclusions

In this paper, we have proposed a novel metric learning method on the SPD manifold, which can discover discriminative intrinsic properties and computes the metric considering their different roles. We can formulate the SPD manifold metric learning process as the multiple projections and a set-to-set metric joint optimization problem. Moreover, we force the projection matrix and the metric matrix on manifolds, obtaining a robust metric. Extensive experiments have shown that our method outperforms existing metric learning methods on the SPD manifold. As our method is differentiable in the whole process, in the future, we will endow it with deep learning for the desirable nonlinearity.