1 Introduction

Typical problems in pattern recognition and machine learning deal with predictors for a single discrete label or continuous value depending on whether we are dealing with classification or regression. The natural (and most common) extension to formulate the problem of predicting multiple labels/values consists of considering it as an appropriate group of independent predictors. But this approach is prone to obviate correlations among output values which may be of capital importance in many challenging and recent application domains. These methods have been coined with different names as multi-target, multi-variate or multi-response regression [2]. When the different output values are organized using more complex structures as strings or trees we talk about Structured Predictors [1, 15]. Among domain applications considered we have ecological modelling [11], gas tank control [8], remote sensing [22] and signal processing [6].

Particular methods for multi-target regression can be categorized either as problem transformation methods (when the original problem is transformed into one or several independent single-output problems), or algorithm adaptation methods (when a particular learning strategy is adapted to deal with multiple interdependent outputs). The latter methods are usually considered as more challenging as an appropriate and interpretable model is obtained usually as a subproduct for the prediction problem [2].

The purpose of this work is to improve previous approaches for muti-target regression by introducing metric learning [12] in the context of nearest neighbor methods [15]. In particular, an input-output homogeneity criterion is introduced to learn a particular distance that consistently leads to improvements according to the empirical validation carried out. In the next section, the proposed methodology is put in the context of distance based muti-target regression while Sect. 3 contains the proposal itself. The empirical section follows with details and results obtained and a final section with conclusions and further work closes the present paper.

2 General Notation and State of the Art

Let \({\varvec{x}}=\left[ x_1,\ldots ,x_p\right] \in \mathbb {R}^p\), \({\varvec{y}}=\left[ y_1,\ldots ,y_q\right] \in \mathbb {R}^q\), be two random input and output vectors, respectively. Each training instance is written as \(\left( {\varvec{x}}^j,{\varvec{y}}^j\right) \in \ \mathbb {R}^p\times \mathbb {R}^q\), and the corresponding multi-target regression problem consists of estimating a unique predictor \(h: \mathbb {R}^p\rightarrow \mathbb {R}^q\) in such a way that the expected deviation between true and predicted outputs is minimized for all possible inputs.

The most straightforward approach consists of obtaining a univariate predictor for each one of the output variables in an independent way using any of the available methods for single-target prediction [2] which constitutes the simplest of the so-called problem transformation (also known as local) methods that consist of transforming the given multi-target prediction problem into one or more single-target ones [16, 21].

The alternative approach to tackle multi-target prediction is through algorithm adaptation (also known as global) methods [2] which consist of adapting any previous strategy to deal directly with multiple targets. Global methods are interesting because they focus on explicitly capturing all interdependencies and internal relationships among targets. According to [2], these methods can be categorized as statistical, support-vector, kernel, trees or rule based, respectively. Apart from these, other strategies can be used. This is the case of one of the best known and used nonparametric methods in classification and estimation: the Nearest Neighbor (NN) family of rules [3]. Using NN for classification and estimation leads to interesting benefits as they behave quite smoothly accross a wide range of applications. These methods are known to approach an optimal behavior regardless of the distance used as the number of samples grows. But nevertheless, distance becomes of capital importance in the finite case.

The \(K\)-NN for Structured Predicitions (KNN-SP) method [15] has been proposed for different kind of predicition problems and for multi-target regression in particular. Using the size of the neighborhood, \(K\), as a parameter, the KNN-SP method starts by selecting the \(K\) nearest neghbors for a given query point according to a fixed distance (usually a weighted version of the Euclidean distance). The final prediction is constructed as the (weighted) average of the corresponding \(K\) target values. These weights are set according to the (Euclidean) distance in the target space [15]. Even though the KNN-SP is very straightforward compared to other approaches, the empirical results show that it is very competitive compared to other methods which constitute the state of the art. Moreover, neighborhood size is the only parameter to tune.

3 Distance Metric Learning for Multi-target Prediction (DMLMTP)

Nearest Neighbor methods have been very widely used, specially for classification. Even though it was introduced very early [18], Distance Metric Learning (DML) has been recently deeply studied as a very convenient way to improve the behavior of distance-based methods [12]. Many powerful methods have been proposed to look for the best distance (in the input space) one may have for a particular problem.

A possible way to improve the results obtained by KNN-SP is by adapting the input space distance to the particular problem according to the final goal in the same way that it has been used for classification.

Many different criteria and approaches have been proposed to learn distances for classification but all of them share the same rationale: a distance is good if it keeps same-class points close and puts points from other classes far away. Many recent approaches implement this rationale as constraints relating pairs or triplets of training points. In the case of pairs, one must select pairs of points that need to be kept close (similar points) or far away (dissimilar points). In the case of triplets, one must select some triplets, \(({\varvec{x}}^{\varvec{i}},{\varvec{x}}^{\varvec{j}},{\varvec{x}}^{\varvec{\ell }})\), where \({\varvec{x}}^{\varvec{i}}\) and \({\varvec{x}}^{\varvec{j}}\) are similar and should be kept close, and \({\varvec{x}}^{\varvec{i}}\) and \({\varvec{x}}^{\varvec{\ell }}\) are dissimilar and should be taken farther.

In contrast to classification problems, it is far from obvious that similar ideas are to be useful in regression problems without introducing more information about both input and output spaces. In the present work, a first attempt to learn an input distance for multi-target regression is proposed by introducing an homogeneity criterion between input and output spaces using triplets. In particular, we propose to select the same kind of triplets as in classification problems and use a different criterion for similarity. Instead of using labels, similarity between points will be established according to their outputs in such a way that the relative ordering introduced by distances in input and output spaces are preserved.

We formulate an optimization problem to learn an input distance for multi-target regression by following an approach similar to the one in [17] and also in [13, 26]. The goal is to obtain a Mahalanobis-like distance, parametrized by a matrix, W, which maximizes a margin criterion. As usual, this problem is converted into minimizing a regularizer for W (its Frobenius norm) subject to several (soft) constraints using triplets. In our particular case we have

$$\begin{aligned} \min _{W,\rho ,\xi _{ij\ell }}&\quad \frac{1}{2} \Vert W \Vert _F - \rho + \frac{1}{\nu |{\mathcal{T}_K}|} \sum _{i,j,\ell \in \mathcal{T}_K} \xi _{ij\ell }\\ s.t.&\quad d^2_{W}({\varvec{x}}^{\varvec{i}},{\varvec{x}}^{\varvec{\ell }})-d^2_{W}({\varvec{x}}^{\varvec{i}},{\varvec{x}}^{\varvec{j}})\ge \rho -\xi _{ij\ell },\\&\quad \xi _{ij\ell }\ge 0, \qquad \forall ~i,j,\ell \in \mathcal{T}_K\end{aligned}$$

where \(d^2_W({\varvec{x}}^{\varvec{i}},{\varvec{x}}^{\varvec{j}}) = ({\varvec{x}}^{\varvec{i}}-{\varvec{x}}^{\varvec{j}})^T W ({\varvec{x}}^{\varvec{i}}-{\varvec{x}}^{\varvec{j}})\) is the (squared) distance in the input space and the set of triplets is defined as

$$\begin{aligned} {\mathcal{T}_K} = \{ ({\varvec{x}}^{\varvec{i}}, {\varvec{x}}^{\varvec{j}}, {\varvec{x}}^{\varvec{\ell }}) ~:~ {\varvec{x}}^{\varvec{j}},{\varvec{x}}^{\varvec{\ell }} \in \mathcal{N}_K({\varvec{x}}^{\varvec{i}}) \quad \text{ and }\quad d({\varvec{y}}^{\varvec{i}},{\varvec{y}}^{\varvec{\ell }})-d({\varvec{y}}^{\varvec{i}},{\varvec{y}}^{\varvec{j}})\ge 0 \} \end{aligned}$$

where \(\mathcal{N}_K({\varvec{x}})\) is the considered neighborhood around \({\varvec{x}}\).

Note that the formulation of the optimization problem is the same used for other metric learning and support vector learning approaches and the main change is in the way the particular restrictions have been selected.

In the formulation above, we must introduce an extra constraint to make the matrix W positive semi-definite. This makes the problem considerably more difficult but there are a number of ways in which this can be tackled [13, 26]. Nevertheless, in this preliminary work we will simplify the above formulation further. On one hand, we consider only a diagonal matrix, \(W={\varvec{w}}=[{w_1},\ldots ,{w_p}]\), and on the other hand, we will introduce the corresponding restrictions, \({w_i} \ge 0\), \(i=1,\ldots , q\) into the above optimization. The corresponding dual problem can be written in terms of two new sets of variables as

$$\begin{aligned} {\displaystyle \min _{\alpha _i,\lambda _j}}&\quad \frac{1}{2}\left( \varvec{\alpha }^TH\varvec{\alpha } + 2\varvec{\alpha }^T\phi \varvec{\lambda } + \varvec{\lambda }^T\varvec{\lambda }\right) \\ \text {s.t.}&\quad {\displaystyle \sum _{i=1}^{|\mathcal{T}_K|}}\alpha _i = 1 \\&\quad 0 \le \alpha _i\le \frac{1}{\nu |\mathcal{T}_K|}\quad i=1,\ldots ,\mathcal{T}_K\\&\quad \lambda _j\succeq 0 \quad j=1,\ldots ,q \end{aligned}$$

\(\phi \in \mathbb {R}^{|\mathcal{T}_K|\times q}\) is a matrix with a row, \(({\varvec{x}}^i-{\varvec{x}}^\ell )\circ ({\varvec{x}}^i-{\varvec{x}}^\ell )-({\varvec{x}}^i-{\varvec{x}}^j)\circ ({\varvec{x}}^i-{\varvec{x}}^j)\), for each considered triplet where \(\circ \) is the Hadamard or entrywise vector product. The kernel matrix is \(H=\phi \phi ^T\) and the weight vector is obtained as \({\varvec{w}}=\varvec{\alpha }^T\phi +\varvec{\lambda }\).

An adhoc solver using an adapted SMO approach [10, 14] has been implemented specifically for this work. This solver is able to arrive to relatively good results in reasonable times for all databases considered in the empirical work carried out as will be shown in the next section.

4 Experiments

In this section, we describe the experimental setup and discuss the main results of the proposed DMLMTP algorithm. In the first place, we present technical details related to the datasets, parameter setup and implementations. Next, we present comparative results when using the learned distance compared to the Euclidean one when predicting multivariate outputs with the KNN-SP approach over fifteen datasets publicly available for multi target prediction.

In the experiments we distinguish between the number of neighbors used to learn the distance using DMLMTP, \(K\), and the number of neighbors used to obtain the final prediction using the KNN-SP approach, \(k_p\). The value of \(K\) should be small to keep the number of triplets small for efficiency reasons. For all experiments reported in this paper, the number of nearest neighbors for training in DMLMTP was set to \(K= 2,\ldots ,6\) while the neighborhood sizes for prediction have been taken as odd values from 3 to 35. The final prediction is done computing the average of the target values of these \(k_p\) nearest neighbors.

In the experiments, 5-fold cross-validation has been used on each dataset except for 4 of them that have been split into train and test subsets for efficiency and compatibility reasons. Table 1 summarizes the main details [2, 21]. The cross validation procedure has been integrated into MULAN software package [20].

Table 1. Datasets used in the experimentation and corresponding details. Datasets partitioned in train and test subsets are indicated by the corresponding two sizes in the second column.

As in other similar works, we use the average Relative Root Mean Squared Error (aRRMSE) given a test set, \(D_{test}\), and a predictor, h, which is given as

$$\begin{aligned} aRRMSE(h;D_{test})=\frac{1}{q}\overset{}{\begin{array}{c} q\\ \sum \\ i=1 \end{array}\sqrt{\frac{ \sum _{\left( {\varvec{x}},{\varvec{y}}\right) \in D_{test}} \left( \hat{y_i} -{y_i}\right) ^{2}}{ \sum _{\left( {\varvec{x}},{\varvec{y}}\right) \in D_{test}} \left( {\overline{y_i} - y_i}\right) ^{2}}}} \end{aligned}$$
(1)

where \(\varvec{\overline{y}}\) is the mean value of the target variable \({\varvec{y}}\), and \(\varvec{\hat{y}}= h({\varvec{x}})\). We use the Wilcoxon signed rank test and the Friedman procedure with different post-hoc tests to compare algorithms over multiple datasets [4, 7].

Fig. 1.
figure 1

aRMSE values corresponding to different neighborhood sizes, \(k_p\), using Euclidean (KNN-SP) and learned (DMLMTP) distances. The best neighborhood size used by DMLMTP is indicated along the name of each database.

Figure 1 contains the aRRMSE versus the neighborhood size, \(k_p\) for 9 datasets out of the 16 considered. Only the best neighborhood size used for training, \(K\) is shown. Moreover, the best results in the curves are marked with a circle and a diamond, respectively. These best results are shown for all the datasets in Table 2.

Contrary to our expectations, the best performance for DMLMTP over large datasets is obtained with small \(K\). This could be strongly related to the growth in the number of triplets that violate the considered constraints.

The last columns in Table 2 contain the absolute difference between aRRMSE for KNN-SP and DMLMTP, its sign and the average ranking with regard to absolute differences. The DMLMTP method is better with a significance level of 5% according to the Wilcoxon test that leads to a p-value of 6.1035e-5. For all datasets, DMLMTP has equal or better performance than (Euclidean) KNN-SP and the difference increases for datasets of higher dimensionality. This situation could be related to the learned input transformation that generates some values equal to zero and ignores some irrelevant attributes. In fact, if we compute the sparsity index of the corresponding transformation vector, as the relative number of zeros with regard to dimensionality, we obtain for our algorithm values below 0.5 except for datasets osales, rf1 and scpf.

Table 2. aRRMSE obtained for DMLMTP and KNN-SP algorithms on each dataset along with comparison details.

5 Concluding Remarks and Further Work

An attempt to improve nearest neighbor based multi-target prediction has been done by introducing an specific distance metric learning algorithm. The mixing of these strategies has lead to very competitive results in the preliminary experimentation carried out. In a wide range of situations and for large variations of the corresponding parameters, the proposal behaves smoothly over the datasets considered paving the way to develop more specialized algorithms. Future work is being planned in several directions. On one hand, different optimization schemes can be adopted both to improve efficiency and performance. On the other hand, different formulations can be adopted by establishing more accurate constraints able to properly capture all kinds of dependencies among input and output vectors in challenging multi output regression problems.