Set-to-Set Distance Metric Learning on SPD Manifolds

Gao, Zhi; Wu, Yuwei; Jia, Yunde

doi:10.1007/978-3-030-03338-5_38

Zhi Gao²⁰,
Yuwei Wu²⁰ &
Yunde Jia²⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11258))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2078 Accesses

Abstract

The Symmetric Positive Definite (SPD) matrix on the Riemannian manifold has become a prevalent representation in many computer vision tasks. However, learning a proper distance metric between two SPD matrices is still a challenging problem. Existing metric learning methods of SPD matrices only regard an SPD matrix as a global representation and thus ignore different roles of intrinsic properties in the SPD matrix. In this paper, we propose a novel SPD matrix metric learning method of discovering SPD matrix intrinsic properties and measuring the distance considering different roles of intrinsic properties. In particular, the intrinsic properties of an SPD matrix are discovered by projecting the SPD matrix to multiple low-dimensional SPD manifolds, and the obtained low-dimensional SPD matrices constitute a set. Accordingly, the metric between two original SPD matrices is transformed into a set-to-set metric on multiple low-dimensional SPD manifolds. Based on the learnable alpha-beta divergence, the set-to-set metric is computed by summarizing multiple alpha-beta divergences assigned on low-dimensional SPD manifolds, which models different roles of intrinsic properties. The experimental results on four visual tasks demonstrate that our method achieves the state-of-the art performance.

You have full access to this open access chapter, Download conference paper PDF

Riemannian Sparse Coding for Positive Definite Matrices

From Manifold to Manifold: Geometry-Aware Dimensionality Reduction for SPD Matrices

Sparse Coding for Symmetric Positive Definite Matrices with Application to Image Set Classification

Keywords

1 Introduction

The Symmetric Positive Definite (SPD) matrix has become a prevalent representation in many visual tasks, such as face recognition [12], action recognition [30], and object detection [25]. It utilizes the second-order or higher-order statistics information to capture the desirable feature distribution. There are several works try to model a more discriminative SPD matrix [16, 27, 28] from local features. Meanwhile, calculating the distance metric in the SPD manifold is a crucial problem coming along with the SPD matrix representation. Due to the no-Euclidean structure of SPD manifolds, the Euclidean metric can’t be applied directly on it. In this paper, we focus on a robust metric learning method on SPD manifolds.

Many efforts have been devoted to the SPD matrix metric, such as the Affine Invariant Metric (AIM) [19], Log-Euclidean Metric (LEM) [2], Bregman divergence [14], Stein divergence [21], and alpha-beta divergence [3, 4, 22]. Given a concrete metric, metric learning aims at learning proper metric parameters that keep similar pairs close and separate dissimilar pairs. Most of the existing metric learning methods on the SPD manifold learn a discriminative metric on the tangent Euclidean space [11, 23, 31].

However, how to learn a proper SPD matrix metric is still a challenging problem. The SPD matrix is aggregated from local features, and contains different essential intrinsic properties. Existing SPD matrix metric learning methods [11, 23, 31] just regard an SPD matrix as a global representation and exploit a direct metric on the complex manifold, ignoring the different roles of intrinsic properties in the SPD matrix. It is unsuitable to treat intrinsic properties equally when they have different roles, e.g., different distribution or significance. Therefore, we argue that an SPD matrix metric modeling different roles of intrinsic properties will achieve a better performance.

In this paper, a novel metric learning method on SPD manifolds is proposed to solve the issues mentioned above. Firstly we discover intrinsic properties of an SPD matrix, and then calculate the SPD matrix metric considering different roles of them. In particular, our method aims to jointly learn multiple low-dimensional projections and a set-to-set metric. As the property discovery can be seen as the feature extraction, we apply multiple low-dimensional manifold projections on the SPD matrix to discover discriminative intrinsic properties. Thus, the distance metric between two original SPD matrices is transformed into the distance metric between the two sets which contain several corresponding projected low-dimensional SPD matrices. The alpha-beta divergences is a learnable SPD matrix metric, so it is applied in our set-to-set metric to be adaptive to the intrinsic property. We assign multiple alpha-beta divergences on different low-dimensional manifolds as the sub-metrics and summarize these sub-metrics discriminatively as the SPD matrix metric. Through this, the different roles of intrinsic properties are involved in the SPD matrix metric. Evaluated by experiments, the proposed learnable metric is extremely helpful to capture meaningful nearest neighbors of different original SPD matrices.

In summary, our contributions are three-fold.

(1)
We propose a robust SPD matrix metric learning method of discovering discriminative intrinsic properties and modeling their different roles in metric computation.
(2)
We formulate the metric learning as the two-component joint optimization problem, i.e., multiple low-dimensional manifold projections and a set-to-set metric are learned jointly.
(3)
We introduce the manifold optimization method which can learn metric parameters to guarantee the robustness of the proposed metric.

2 The Proposed Method

Throughout this paper, scalars are denoted by the lower-case letters; the vectors are represented by the bold lower-case letters; the matrices are denoted by the upper-case letters; the sets are represented by the bold upper-case letters.

2.1 Problem Definition

This work aims to discover discriminative intrinsic properties in an SPD matrix and compute the distance of SPD matrices considering different roles of discovered properties. The property discovery can be regarded as a feature extraction process that projects an original SPD matrix to multiple low-dimensional SPD manifolds to form a set of the low-dimensional SPD matrices. We propose a set-to-set metric to consider different roles of intrinsic properties. Individual sub-metrics are assigned on low-dimensional manifolds and summarized discriminatively. Consequently, our metric learning method is composed of two components, multiple low-dimensional manifold projections and a set-to-set metric. Given two SPD matrices $X_{i}$ and $X_{j}$, the distance $D^{\varTheta } (X_{i}, X_{j})$ is

$$\begin{aligned} \begin{aligned} D^{\varTheta } (X_{i}, X_{j})&= D_{s}(\mathbf X _{i},\mathbf X _{j}) \\&= D_{s} \Big ( \{f_W^1(X_{i}), \cdots , f_W^m(X_{i})\}, \{f_W^1(X_{j}), \cdots , f_W^m(X_{j})\} \Big ) \\&= h_{M} \Big ( g_A^1 \big (f_W^1(X_{i} ) ,f_W^1( X_{j} ) \big ), \cdots , g_A^m \big (f_W^m(X_{i} ) ,f_W^m( X_{j} ) \big ) \Big ), \end{aligned} \end{aligned}$$

(1)

where $ f_W^k(\cdot ) $ is the low-dimensional manifold projection, and $\mathbf X _{i} = \{ f_W^k(X_{i}) \}_{k=1}^m$ is the set containing low-dimensional SPD matrices. The distance $D^{\varTheta } (X_{i}, X_{j})$ between original SPD matrices $X_{i}$ and $X_{j}$ is transformed into a set-to-set distance $D_{s}(\mathbf X _{i},\mathbf X _{j})$, where the sub-metric on the k-th low-dimensional manifold is calculated by $g_{A}^{k}(\cdot ,\cdot )$ and all sub-metrics of properties are summarized by $h_{M}(\cdot )$. W, A, M are the projection parameter, the sub-metric parameter, and the summarization parameter, respectively. We exploit a learnable parameter set $\varTheta = \{W,A,M \}$ to represent the parameters. The framework of our metric learning method for the SPD matrix is shown in Fig. 1.

The goal of metric learning is to learn the metric parameter $\varTheta $ from an SPD matrix similar pair set $\mathcal {S}$, a dissimilar pair set $\mathcal {D}$, and their labels Y, where $y_{ij}=1$ means $X_{i}$ and $X_{j}$ are similar, otherwise $y_{ij}=0$. The metric parameter $\varTheta $ can be learned by optimizing the loss function $\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)$ which is the punishment of both far similar sample pairs and close dissimilar sample pairs. We define $\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)$ in the following subsection. Moreover, we impose the manifold constraints on W and M to obtain a more robust metric.

2.2 Multiple Low-Dimensional Manifold Projections

For an SPD matrix sample $X_{i} \in \mathbb {R}^{n \times n} $, we project $X_{i}$ to m low-dimensional manifolds to discover the intrinsic properties,

$$\begin{aligned} \begin{aligned}&X_{i}^{1} = f_W^1(X_{i}) = W_{1}^{\top }X_{i}W_{1}\\&... \\&X_{i}^{m} = f_W^m(X_{i}) = W_{m}^{\top }X_{i}W_{m}, \end{aligned} \end{aligned}$$

(2)

where $X_{i}^{k} \in \mathbb {R}^{p \times p}$ is the k-th low-dimensional SPD matrix, $k \in \{1,2,\cdots ,m \}$. An SPD matrix $X_{i}$ is projected to a set $\mathbf X _{i}=\{X_{i}^{k}\}_{k=1}^{m}$, which contains several low-dimensional SPD matrices.

We expect that each low-dimensional matrix $X_{i}^{k}$ is guaranteed to be still an SPD matrix having the ability of capturing desirable feature distribution, and any two low-dimensional SPD manifolds are unrelated to preserve as much information as possible in the low-dimensional SPD matrix set. The learnable parameter $W_{k}$ needs to be a column full rank matrix to make $X_{i}^{k}$ be an SPD matrix as well. Based on the affine invariance [3, 7] of the alpha-beta divergence, we relax the column full rank constraint of $W_{k}$ to the semi-orthogonal constraint, i.e., $W_{k}^{\top }W_{k}=I_{p}$. In order to preserve more information in the $\mathbf X _{i}=\{X_{i}^{k}\}_{k=1}^{m}$ set, we expect that any two low-dimensional manifolds have a low relevance. For any $k \ne l $, we set $W_{k}^{\top }W_{l} = \mathbf {0}$, where $\mathbf {0} \in \mathbb {R}^{p \times p}$ is a matrix whose elements are all “0”s, to reduce relevance between $X_{i}^{k}$ and $X_{i}^{l}$. These low-dimensional SPD manifolds can be seen as analogies of different PCA subspaces. A total projection matrix W is composed of all $W_{k}$, $W= [W_{1},W_{2},\cdots ,W_{m}] \in \mathbb {R}^{n \times mp}$, in which $W_{k}$ is a partitioned matrix of W containing p columns. Note that, W is a semi-orthogonal matrix, i.e., $W^{\top }W =I_{mp}$, which is on the non-Euclidean Stiefel manifold [1].

2.3 The Set-to-Set Metric

Based on multiple manifold projections, the distance $D^{\varTheta } (X_{i}, X_{j})$ of two SPD matrices is transformed into the set-to-set distance $D_{s}(\mathbf X _{i},\mathbf X _{j})$. Firstly $\{ g_A^k(\cdot ,\cdot )\}_{k=1}^{m} $ is exploited to compute sub-metrics on m low-dimensional SPD manifolds, and then $h_M(\cdot )$ is utilized to summarize the m sub-metrics, where A and M are learnable parameters. We use the flexible alpha-beta divergence [3, 4, 22] as the sub-metric $g_A^k(\cdot ,\cdot )$. For two SPD sets $\mathbf X _{i}= \{X_{i}^{k}\}_{k=1}^{m}$, $\mathbf X _{j}= \{X_{j}^{k}\}_{k=1}^{m}$, the distance $d_{ij}^{k}$ between $X_{i}^{k}$ and $X_{j}^{k}$ is computed by the k-th alpha-beta divergence,

$$\begin{aligned} \begin{aligned} d_{ij}^{k}\! =\! g_A^k (X_{i}^{k},X_{j}^{k})\! =\! D^{ \left( \alpha _{k},\beta _{k} \right) } \left( X_{i}^{k}\Vert X_{j}^{k} \right) \! =\! \frac{1}{\alpha _{k} \beta _{k}} \sum _{u=1}^{p} log\left( \frac{\alpha _{k} (\lambda _{iju}^{^{k}})^{\beta _{k}} + \beta _{k} (\lambda _{iju}^{^{k}})^{-\alpha _{k}} }{\alpha _{k} + \beta _{k}} \right) , \end{aligned} \end{aligned}$$

(3)

where $\lambda _{iju}^{k}$ is the u-th eigenvalue of $ X_{i}^{k}(X_{j}^{k})^{-1}$, and $\left( \alpha _{k}, \beta _{k} \right) $ is the individual parameter of the k-th alpha-beta divergence. We denote all alpha-beta divergence parameters as a matrix $A= [(\alpha _{1}, \beta _{1}),(\alpha _{2}, \beta _{2}),..., (\alpha _{m}, \beta _{m}) ] \in \mathbb {R}^{m \times 2} $, and a distance vector between $\mathbf X _{i}$ and $\mathbf X _{j}$ as $\mathbf d _{ij} = [d_{ij}^{1}, d_{ij}^{2},..., d_{ij}^{m}] \in \mathbb {R}^{m \times 1}$. Since $\left( \alpha _{k}, \beta _{k} \right) $ needs to be adaptive to the k-th low-dimensional manifold, we exploit a learnable strategy to update $\left( \alpha _{k}, \beta _{k} \right) $, which is detailed in the next subsection. After computing all sub-metrics, the distance metric $D^{\varTheta }(X_{i},X_{j})$ between two original SPD matrices $X_{i}$ and $X_{j}$ is formulated as

$$\begin{aligned} \begin{aligned}&D^{\varTheta }(X_{i},X_{j}) = D_{s}(\mathbf X _{i},\mathbf X _{j}) = h_M (d_{ij}^{1}, d_{ij}^{2},..., d_{ij}^{m}) = \mathbf d _{ij}^{\top }M\mathbf d _{ij} \\ =&\sum _{k=1}^{m} \sum _{l=1}^{m} \Big ( D^{ ( \alpha _{k},\beta _{k} ) } \big ( W_{k}^{\top }X_{i}W_{k} \Vert W_{k}^{\top }X_{j}W_{k} \big ) \cdot M_{kl} \cdot D^{ ( \alpha _{l},\beta _{l} ) } \big ( W_{l}^{\top }X_{i}W_{l} \Vert W_{l}^{\top }X_{j}W_{l} \big ) \Big ), \end{aligned} \end{aligned}$$

(4)

where $M \in \mathbb {R}^{m \times m}$ is the metric parameter, and $M_{kl}$ is an element of M in the k-th row and l-th column, reflecting the significance and relationship of properties. If $X_{i} = X_{j}$, then $\mathbf d _{ij}$ is a zero vector, and $D^{\varTheta }(X_{i},X_{j}) = 0 $. If $X_{i} \ne X_{j}$, then $\mathbf d _{ij}$ is a non-zero vector, and $D^{\varTheta }(X_{i},X_{j})$ should be larger than 0. The nonnegativity of the metric forces M to be an SPD matrix and $M \in Sym_{m}^{+}$.

To learn the parameter $\varTheta $, we formulate loss function $\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)$ as

$$\begin{aligned} \begin{aligned} \min _{\varTheta } \mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)&= \frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} y_{ij} \cdot max \big ( D^{\varTheta }(X_{i},X_{j}) -\zeta _{s},0 \big )^{2}\\&+ \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D} } (1-y_{ij}) \cdot max \big ( \zeta _{d}-D^{\varTheta }(X_{i},X_{j}) ,0 \big )^{2}\\&+ \xi \cdot \gamma (M,M_{0}). \end{aligned} \end{aligned}$$

(5)

We expect that the distance between similar samples is smaller than a threshold $\zeta _{s}$, and the distance between dissimilar samples is larger than a threshold $\zeta _{d}$. We add two coefficients $\frac{1}{|\mathcal {S}|}$ and $\frac{1}{|\mathcal {D}|}$ to solve the imbalance issue of similar and dissimilar sample pairs, where $|\mathcal {S}|$ and $|\mathcal {D}|$ are the pair numbers of sets $\mathcal {S}$ and $\mathcal {D}$. In addition, we add a regularization term $\xi \cdot \gamma (M,M_{0})$ on M in Eq. (5). $\gamma (M,M_{0})=Tr(MM_{0}^{-1})-log det (MM_{0}^{-1}) - m$ is the burgman divergence [5, 8, 10], where $Tr(\cdot )$ is the trace of a matrix, $M_{0}$ is the prior information, and $\xi $ is the trade-off coefficient.

2.4 Optimization

$\mathcal {L}(\varTheta ,\mathcal {S},\mathcal {D},Y)$ in Eq. (5) is not a convex function with respect to W, A, and M. Accordingly, we apply the gradient descent to learn $\varTheta $. The gradients are computed as follows.

(1) The gradient of $\mathcal {L}$ with respect to M

The gradient of $\mathcal {L}$ with respect to M can be computed by

$$\begin{aligned} \begin{aligned} \nabla _{M}(\mathcal {L}) =&\frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} \mathbf d _{ij} \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \mathbf d _{ij}^{\top } + \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D}} \mathbf d _{ij} \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \mathbf d _{ij}^{\top } + \xi \cdot \nabla _{M}(\gamma (M,M_{0})), \end{aligned} \end{aligned}$$

(6)

where $\nabla _{D^{\varTheta }_{ij}}(\mathcal {L})$ is the gradient of $\mathcal {L}$ with respect to $D^{\varTheta }(X_{i},X_{j})$,

$$\begin{aligned} \begin{aligned} \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) =2 \cdot y_{ij} \cdot max(D^{\varTheta }_{ij}-\zeta _{s},0)+2 \cdot (y_{ij}-1) \cdot max(\zeta _{d}-D^{\varTheta }_{ij},0), \end{aligned} \end{aligned}$$

(7)

and $\nabla _{M}(\gamma (M,M_{0}))$ is the gradient of $\gamma (M,M_{0})$ with respect to M,

$$\begin{aligned} \begin{aligned} \nabla _{M}(\gamma (M,M_{0}))= M_{0}^{-1}-M^{-1}. \end{aligned} \end{aligned}$$

(8)

(2) The gradient of $\mathcal {L}$ with respect to A

The gradients of $\mathcal {L}$ with respect to ${\alpha }_k$ and ${\beta }_k$ in A are

$$\begin{aligned} \begin{aligned} \nabla _{\alpha _{k}}(\mathcal {L}) = \frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\alpha _{k}}(d^{k}_{ij}) + \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\alpha _{k}}(d^{k}_{ij}), \end{aligned} \end{aligned}$$

(9)

$$\begin{aligned} \begin{aligned} \nabla _{\beta _{k}}(\mathcal {L}) = \frac{1}{|\mathcal {S}|}\sum _{{i,j}\in \mathcal {S}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\beta _{k}}(d^{k}_{ij}) + \frac{1}{|\mathcal {D}|}\sum _{{i,j}\in \mathcal {D}} \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\beta _{k}}(d^{k}_{ij}). \end{aligned} \end{aligned}$$

(10)

$\nabla _{d^{k}_{ij}}(\mathcal {L})$ is the k-th element of $\nabla _{ \mathbf d _{ij}}(\mathcal {L})$ which is the gradient of $\mathcal {L}$ with respect to $ \mathbf d _{ij}$,

$$\begin{aligned} \begin{aligned} \nabla _{ \mathbf d _{ij}}(\mathcal {L})= \nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \cdot \nabla _{ \mathbf d _{ij}}(D^{\varTheta }_{ij}) =\nabla _{D^{\varTheta }_{ij}}(\mathcal {L}) \mathbf d _{ij}^{\top } (M^{\top }+M). \end{aligned} \end{aligned}$$

(11)

$\nabla _{\alpha _{k}}(d^{k}_{ij})$ and $\nabla _{\beta _{k}}(d^{k}_{ij})$ are the gradients of $d^{k}_{ij}$ with respect to $\alpha _{k}$ and $\beta _{k}$, respectively,

$$\begin{aligned} \begin{aligned} \nabla _{\alpha _{k}}(d^{k}_{ij}) =&\frac{1}{\alpha _{k}^{2} \beta _{k}} \sum _{u=1}^{p} \bigg ( \frac{ \alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} - \alpha _{k}\beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}} log \lambda _{iju}^{k}}{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}}} \\&- \frac{\alpha _{k}}{\alpha _{k}+ \beta _{k}} - log \frac{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}} }{\alpha _{k}+\beta _{k}} \bigg ), \end{aligned} \end{aligned}$$

(12)

$$\begin{aligned} \begin{aligned} \nabla _{\beta _{k}}(d^{k}_{ij}) =&\frac{1}{\alpha _{k} \beta _{k}^{2}} \sum _{u=1}^{p} \bigg ( \frac{ \beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}} - \alpha _{k}\beta _{k} (\lambda _{iju}^{k})^{\beta _{k}} log \lambda _{iju}^{k}}{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}}} \\&- \frac{\beta _{k}}{\alpha _{k}+ \beta _{k}} - log \frac{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k}(\lambda _{iju}^{k})^{-\alpha _{k}} }{\alpha _{k}+\beta _{k}} \bigg ). \end{aligned} \end{aligned}$$

(13)

(3) The gradient of $\mathcal {L}$ with respect to W

The gradient of $\mathcal {L}$ with respect to each $W_{k}$ is

$$\begin{aligned} \begin{aligned} \nabla _{W_{k}}(\mathcal {L})= \sum _{i}^{N} \big ( (X_{i})^{\top } W_{k} \nabla _{X_{i}^{k}}(\mathcal {L}) + X_{i} W_{k} \nabla _{X_{i}^{k}}(\mathcal {L})^{\top } \big ), \end{aligned} \end{aligned}$$

(14)

where N is the number of training samples, and $N=2 \times (|\mathcal {S}|+|\mathcal {D}|)$. $\nabla _{X_{i}^{k}}(\mathcal {L})$ is the gradient of $\mathcal {L}$ with respect to the low-dimensional SPD matrix $X_{i}^{k}$. The eigenvalue decomposition of $X_{i}^{k}(X_{j}^{k})^{-1}$ is $X_{i}^{k}(X_{j}^{k})^{-1} = U_{ij}^{k}\varSigma _{ij}^{k} (U_{ij}^{k})^{\top }$. $\varSigma _{ij}^{k}$ is the diagonal matrix eigenvalues, and $\lambda _{iju}^{k}$ is the u-th eigenvalue. The gradients $\nabla _{X_{i}^{k}}(\mathcal {L})$ and $\nabla _{X_{j}^{k}}(\mathcal {L})$ are

$$\begin{aligned} \begin{aligned} \nabla _{X_{i}^{k}}(\mathcal {L})= U_{ij}^{k} \nabla _{\varSigma _{ij}^{k}}(\mathcal {L}) (U_{ij}^{k})^{\top } (X_{i}^{k})^{-\top }, \end{aligned} \end{aligned}$$

(15)

$$\begin{aligned} \begin{aligned} \nabla _{X_{j}^{k}}(\mathcal {L})= (-1) \cdot (X_{j}^{k})^{-\top } (X_{i}^{k})^{\top } U_{ij}^{k} \nabla _{\varSigma _{ij}^{k}}(\mathcal {L}) (U_{ij}^{k})^{\top } (X_{j}^{k})^{-\top }, \end{aligned} \end{aligned}$$

(16)

where $\nabla _{\varSigma _{ij}^{k}}(\mathcal {L})$ is the gradient of $\varSigma _{ij}^{k}$ with respect to $\mathcal {L}$. $\nabla _{\varSigma _{ij}^{k}}(\mathcal {L})$ is a diagonal matrix, and the u-th element is

$$\begin{aligned} \begin{aligned} \nabla _{\lambda _{iju}^{k}}(\mathcal {L})&= \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \nabla _{\lambda _{iju}^{k}}(d^{k}_{ij}) \\&= \nabla _{d^{k}_{ij}}(\mathcal {L}) \cdot \frac{1}{\alpha _{k} \beta _{k}} \frac{\alpha _{k} \beta _{k} (\lambda _{iju}^{k})^{\beta _{k}-1} - \alpha _{k} \beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}-1}}{\alpha _{k} (\lambda _{iju}^{k})^{\beta _{k}} + \beta _{k} (\lambda _{iju}^{k})^{-\alpha _{k}}}. \end{aligned} \end{aligned}$$

(17)

Since the gradients $\nabla _{W}(\mathcal {L})$, $\nabla _{M}(\mathcal {L})$, and $\nabla _{A}(\mathcal {L})$ are obtained, the metric parameter set $\varTheta $ can be updated. A is optimized by the standard gradient descent, $A:= A-\eta \nabla _{A}(\mathcal {L})$, where $\eta $ is the learning rate. W and M are updated by the Riemannian optimization algorithm [1, 6, 20]. The computation details are presented below,

$$\begin{aligned} \begin{aligned} \left\{ \begin{aligned}&\nabla _{W_{R}}(\mathcal {L}) = \nabla _{W}(\mathcal {L}) - W \frac{1}{2} ( W^{\top }\nabla _{W}(\mathcal {L}) + \nabla _{W}(\mathcal {L})^{\top } W )\\&W:= q \big ( W-\eta \nabla _{W_{R}}(\mathcal {L}) \big ) \end{aligned} \right. , \end{aligned} \end{aligned}$$

(18)

and

$$\begin{aligned} \begin{aligned} \left\{ \begin{aligned}&\nabla _{M_{R}}(\mathcal {L}) = M \frac{1}{2} \big ( \nabla _{M}(\mathcal {L}) + \nabla _{M}(\mathcal {L}) ^{\top } \big ) M \\&M:= M^{\frac{1}{2}} expm \big ( -\eta M^{-\frac{1}{2}} \nabla _{M_{R}}(\mathcal {L}) M^{-\frac{1}{2}} \big ) M^{\frac{1}{2}} \end{aligned} \right. , \end{aligned} \end{aligned}$$

(19)

where $\nabla _{W_{R}}(\mathcal {L})$ and $\nabla _{M_{R}}(\mathcal {L})$ are the Riemannian gradients with respect to W and M. In Eq. (18), $q\left( \cdot \right) $ is the retraction operation mapping the data back to the Stiefel manifold. $q\left( W \right) $ denotes the Q matrix of the QR decomposition to a matrix W, i.e., for the matrix $W \in \mathbb {R}^{n \times p}$, $W = QR$, where $Q \in \mathbb {R}^{n \times p}$ is a semi-orthogonal matrix and $R \in \mathbb {R}^{p \times p}$ is a upper triangular matrix. In Eq. (19), $expm(\cdot )$ is the matrix exponential function. We summarize the learning process of our method in Algorithm 1, w.

3 Experiments

In order to test the efficiency of our method, we conduct experiments on the object recognition, video-based face recognition, action recognition, and texture classification tasks. Four datasets are utilized: the ETH-80 [15], the MSR-Action3D [17], the YouTube Celebrities (YTC) [13], and the UIUC [18] datasets.

3.1 Datasets and Settings

The ETH-80 is an object image dataset, which contains 80 image sets of eight categories. Each category consists of 10 image sets, and each set includes 41 images captured under different views. In our experiment, all the images of the ETH-80 are resized to $20 \times 20$ and denoted by the intensity features. The YTC is a video-based face dataset, collecting 1910 videos of 47 persons. Face regions are detected from each frame by a cascaded face detector and resized to $30 \times 30$, followed by the histogram equalized operation, and represented by the gray values. The MSR-Action3D is a 3D action dataset, containing totally 567 videos of 20 actions. There are 20 skeleton joints in the body of actions. In the experiments, each frame is represented by a 120-dimensional feature, which is the 3D coordinate differences of skeleton joints between this frame and its two neighborhood frames. The UIUC material dataset contains 216 samples of 18 categories. We resize each image to $400 \times 400$. Then 128-dimensional dense SIFT features are extracted from each image with 4-pixel space concatenated by 27-dimensional RGB color features from $3 \times 3$ patches centered at the locations of dense SIFT features.

On the ETH-80, YTC, and UIUC datasets, we compute a covariance matrix C to represent each sample and add a small ridge $\delta I$ to avoid the matrix singularity, where $\delta = 0.001 \times Tr(C)$. On the MSR-Action3D dataset, we first compute the covariance matrix C with size of $120 \times 120$, then transform it to a $121 \times 121$ Gaussian distribution SPD matrix, $\tiny C= |C|^{-\frac{1}{121}} \left[ \begin{array}{cc} C+\mathbf{mm }^{\top } &{} \mathbf m \\ \mathbf m ^{\top } &{} 1 \end{array} \right] $ as the sample representation, where $\mathbf m $ is the mean vector of 120-dimensional features. Following the standard protocols [7, 11, 24, 29], for each category, we randomly select half of the samples for training and the rest for testing on the ETH-80, MSR-Action3D, and UIUC datasets. On the YTC dataset, for each person, three videos are randomly selected as the gallery, and six as the probe. In experiments, we set $\xi = 0.01$, $M_0 = I_m$, $\zeta _{s} = 5$, and $\zeta _{d}=100$.

3.2 Evaluation

We exploit the 1-NN classifier to evaluate the performance of all metric learning methods. The following methods are evaluated in our experiments: AIM [19], Stein Divergence [21], LEM [2], SPD-DR [7], CDL [29], RSR-ML [9], LEML [11], and $\alpha $-CML [31]. AIM, Stein Divergence, and LEM are the basic SPD matrix metrics, measuring the geodesic distance between SPD matrices. SPD-DR implements the dimensionality reduction on the SPD matrix and then applies the AIM or Stein Divergence between samples. CDL is a Riemannian kernel discriminative learning approach on the SPD manifold. RSR-ML employs sparse coding and dictionary learning scheme on the SPD manifold. LEML and $\alpha $-CML are two LEM based SPD matrix metric learning methods which project SPD matrices to the tangent space and utilize the LEM to compute the distance between them.

Table 1 shows the comparisons of the four visual tasks. In the object recognition task, we set the dimensionality of the low-dimensional manifolds is $10 \times 10$ and the number of them is 20, i.e., $m=20$. We find that LEM has a better performance than AIM, 93.0 vs 85.0, showing that the point on the tangent space is more discriminative. If the manifold point is projected to a low-dimensional discriminative space, i.e., the SPD-DR method, the performance can be improved to 96.0, 0.5 better than LEML. Compared with SPD-DR, our method achieves 97.5, 1.5 higher than it, which shows the power of discovering discriminative properties and their roles.

In the video-based face recognition task, the dimensionality of projected manifolds is $10 \times 10$, and the number of them is 40. We achieve 49.2 in this task, 2.5 higher than SPD-DR and 10 percent higher than the basic SPD matrix metrics approximately. However, due to the large variable faces caused by posture, illumination, scale, and occlusion, the performance of linear metric learning methods is far less than it of the nonlinear kernel method CDL. The reason we think is that the samples in the original space are not separable, a more higher-dimensional RKHS space can relieve this problem.

In the action recognition task, the dimensionality of the low-dimensional manifolds is $8 \times 8$ and the number of them is 15. Nonlinear kernel methods CDL and RSR-ML achieve 95.4 and 95.0 respectively and have a better performance than the existing metric methods [7, 11, 31]. In this case, our linear method obtains the comparable performance with CDL and RSR-ML, achieving 95.8. Besides, Wang et al. [26] shows that the nonlinear kernel matrix representation has a better performance than the linear SPD representation, while our accuracy is 3.1 higher than $\alpha $-CML whose performance is based on the kernel matrix [26] rather than the Gaussian distribution SPD matrix.

In the texture classification task, in our method, we set the dimensionality of the low-dimensional manifolds is $8 \times 8$, and there are totally 18 low-dimensional manifolds. We can see that, the three basic SPD matrix metrics i.e., AIM, Stein Divergence, and LEM achieve comparable performance in the UIUC dataset, 35.6, 35.8 and 36.7 respectively. Meanwhile, metric learning methods can bring a remarkable improvement. CDL achieves 54.9, and the accuracy of LEML is 53.9. SPD-DR achieves a better performance 58.3, showing that there are too much noise and information redundancy in the original SPD representation. Our method further improves the result to 60.8 showing that our method can not only remove the noise and information redundancy but also bring the benefits of discovering discriminative intrinsic properties and their different roles.

Table 1. Accuracies (%) on the four visual tasks. Our method is bold in the last line.

Full size table

4 Conclusions

In this paper, we have proposed a novel metric learning method on the SPD manifold, which can discover discriminative intrinsic properties and computes the metric considering their different roles. We can formulate the SPD manifold metric learning process as the multiple projections and a set-to-set metric joint optimization problem. Moreover, we force the projection matrix and the metric matrix on manifolds, obtaining a robust metric. Extensive experiments have shown that our method outperforms existing metric learning methods on the SPD manifold. As our method is differentiable in the whole process, in the future, we will endow it with deep learning for the desirable nonlinearity.

References

Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
MATH Google Scholar
Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magn. Reson. Med. 56(2), 411 (2006)
Article Google Scholar
Cherian, A., Stanitsas, P., Harandi, M., Morellas, V., Papanikolopoulos, N.: Learning discriminative $\alpha \beta $-divergences for positive definite matrices. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4270–4279 (2017)
Google Scholar
Cichocki, A., Cruces, S., Amari, S.: Log-determinant divergences revisited: alpha-beta and gamma log-det divergences. Entropy 17(5), 2988–3034 (2015)
Article MathSciNet Google Scholar
Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 209–216 (2007)
Google Scholar
Harandi, M., Fernando, B.: Generalized backpropagation, Étude de cas: Orthogonality. arXiv preprint arXiv:1611.05927 (2016)
Harandi, M., Salzmann, M., Hartley, R.: Dimensionality reduction on SPD manifolds: the emergence of geometry-aware methods. IEEE Trans. Pattern Anal. Mach. Intell. 40, 48–62 (2017)
Article Google Scholar
Harandi, M., Salzmann, M., Hartley, R.: Joint dimensionality reduction and metric learning: a geometric take. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1404–1413 (2017)
Google Scholar
Harandi, M.T., Sanderson, C., Hartley, R., Lovell, B.C.: Sparse coding and dictionary learning for symmetric positive definite matrices: a kernel approach. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 216–229. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_16
Chapter Google Scholar
Hoffman, J., Rodner, E., Donahue, J., Kulis, B., Saenko, K.: Asymmetric and category invariant feature transformations for domain adaptation. Int. J. Comput. Vis. 109(1–2), 28–41 (2014)
Article MathSciNet Google Scholar
Huang, Z., Wang, R., Shan, S., Li, X., Chen, X.: Log-Euclidean metric learning on symmetric positive definite manifold with application to image set classification. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 720–729 (2015)
Google Scholar
Huang, Z., Wang, R., Van Gool, L., Chen, X., et al.: Cross Euclidean-to-Riemannian metric learning with application to face recognition from video. IEEE Trans. Pattern Anal. Mach. Intell. PP(99), 1 (2018)
Google Scholar
Kim, M., Kumar, S., Pavlovic, V., Rowley, H.: Face tracking and recognition with visual constraints in real-world videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)
Google Scholar
Kulis, B., Sustik, M.A., Dhillon, I.S.: Low-rank kernel learning with Bregman matrix divergences. J. Mach. Learn. Res. 10(1), 341–376 (2009)
MathSciNet MATH Google Scholar
Leibe, B., Schiele, B.: Analyzing appearance and contour based methods for object categorization. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. II-409 (2003)
Google Scholar
Li, P., Xie, J., Wang, Q., Zuo, W.: Is second-order information helpful for large-scale visual recognition? In: IEEE International Conference on Computer Vision, pp. 2089–2097 (2017)
Google Scholar
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–14 (2010)
Google Scholar
Liao, Z., Rock, J., Wang, Y., Forsyth, D.: Non-parametric filtering for geometric detail extraction and material representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 963–970 (2013)
Google Scholar
Pennec, X., Fillard, P., Ayache, N.: A Riemannian framework for tensor computing. Int. J. Comput. Vis. 66(1), 41–66 (2006)
Article Google Scholar
Roy, Kumar, S., Mhammedi, Z., Harandi, M.: Geometry aware constrained optimization techniques for deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1 (2018)
Google Scholar
Sra, S.: A new metric on the manifold of kernel matrices with application to matrix geometric means. In: Advances in Neural Information Processing Systems, pp. 144–152 (2012)
Google Scholar
Thiyam, D.B., Cruces, S., Olias, J., Cichocki, A.: Optimization of Alpha-Beta Log-Det divergences and their application in the spatial filtering of two class motor imagery movements. Entropy 19(3), 89 (2017)
Article Google Scholar
Vemulapalli, R., Jacobs, D.W.: Riemannian metric learning for symmetric positive definite matrices. arXiv preprint arXiv:1501.02393 (2015)
Vemulapalli, R., Pillai, J.K., Chellappa, R.: Kernel learning for extrinsic classification of manifold features. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1782–1789 (2013)
Google Scholar
Wang, H., Wang, Q., Gao, M., Li, P., Zuo, W.: Multi-scale location-aware kernel representation for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1 (2018)
Google Scholar
Wang, L., Zhang, J., Zhou, L., Tang, C., Li, W.: Beyond covariance: feature representation with nonlinear kernel matrices. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4570–4578 (2015)
Google Scholar
Wang, Q., Li, P., Zhang, L.: G2DeNet: global Gaussian distribution embedding network and its application to visual recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6507–6516 (2017)
Google Scholar
Wang, Q., Li, P., Zuo, W., Zhang, L.: RAID-G: robust estimation of approximate infinite dimensional Gaussian with application to material recognition. In: Computer Vision and Pattern Recognition, pp. 4433–4441 (2016)
Google Scholar
Wang, R., Guo, H., Davis, L.S., Dai, Q.: Covariance discriminative learning: a natural and efficient approach to image set classification. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2496–2503 (2012)
Google Scholar
Zhang, T., Zheng, W., Cui, Z., Zong, Y., Li, Y.: Deep manifold-to-manifold transforming network for action recognition. arXiv preprint arXiv:1705.10732 (2017)
Zhou, L., Wang, L., Zhang, J., Shi, Y., Gao, Y.: Revisiting metric learning for SPD matrix based visual representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3241–3249 (2017)
Google Scholar

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of China (NSFC) under Grants No. 61702037 and No. 61773062, and Beijing Municipal Natural Science Foundation under Grant No. L172027, in part by Beijing Institute of Technology Research Fund Program for Young Scholars.

Author information

Authors and Affiliations

Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology (BIT), Beijing, 100081, People’s Republic of China
Zhi Gao, Yuwei Wu & Yunde Jia

Authors

Zhi Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yuwei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yunde Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuwei Wu .

Editor information

Editors and Affiliations

Sun Yat-sen University, Guangzhou, China
Jian-Huang Lai
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Tsinghua University, Beijing, China
Jie Zhou
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi'an Jiaotong University, Xi'an, China
Nanning Zheng
Peking University, Beijing, China
Hongbin Zha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, Z., Wu, Y., Jia, Y. (2018). Set-to-Set Distance Metric Learning on SPD Manifolds. In: Lai, JH., et al. Pattern Recognition and Computer Vision. PRCV 2018. Lecture Notes in Computer Science(), vol 11258. Springer, Cham. https://doi.org/10.1007/978-3-030-03338-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-03338-5_38
Published: 03 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03337-8
Online ISBN: 978-3-030-03338-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Set-to-Set Distance Metric Learning on SPD Manifolds

Abstract

Similar content being viewed by others

Riemannian Sparse Coding for Positive Definite Matrices

From Manifold to Manifold: Geometry-Aware Dimensionality Reduction for SPD Matrices

Sparse Coding for Symmetric Positive Definite Matrices with Application to Image Set Classification

Keywords

1 Introduction

2 The Proposed Method

2.1 Problem Definition

2.2 Multiple Low-Dimensional Manifold Projections

2.3 The Set-to-Set Metric

2.4 Optimization

3 Experiments

3.1 Datasets and Settings

3.2 Evaluation

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Set-to-Set Distance Metric Learning on SPD Manifolds

Abstract

Similar content being viewed by others

Riemannian Sparse Coding for Positive Definite Matrices

From Manifold to Manifold: Geometry-Aware Dimensionality Reduction for SPD Matrices

Sparse Coding for Symmetric Positive Definite Matrices with Application to Image Set Classification

Keywords

1 Introduction

2 The Proposed Method

2.1 Problem Definition

2.2 Multiple Low-Dimensional Manifold Projections

2.3 The Set-to-Set Metric

2.4 Optimization

3 Experiments

3.1 Datasets and Settings

3.2 Evaluation

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation