Multi-task learning via robust regularized clustering with non-convex group penalties

Okazaki, Akira; Kawano, Shuichi

doi:10.1007/s11222-024-10550-1

Multi-task learning via robust regularized clustering with non-convex group penalties

Original Paper
Open access
Published: 30 December 2024

Volume 35, article number 23, (2025)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Multi-task learning via robust regularized clustering with non-convex group penalties

Download PDF

Akira Okazaki¹ &
Shuichi Kawano²

629 Accesses
2 Altmetric
Explore all metrics

Abstract

Multi-task learning (MTL) aims to improve estimation and prediction performance by sharing common information among related tasks. One natural assumption in MTL is that tasks are classified into clusters based on their characteristics. However, existing MTL methods based on this assumption often ignore outlier tasks that have large task-specific components or no relation to other tasks. To address this issue, we propose a novel MTL method called Multi-Task Learning via Robust Regularized Clustering (MTLRRC). MTLRRC incorporates robust regularization terms inspired by robust convex clustering, which is further extended to handle non-convex and group-sparse penalties. The extension allows MTLRRC to simultaneously perform robust task clustering and outlier task detection. The connection between the extended robust clustering and the multivariate M-estimator is also established. This provides an interpretation of the robustness of MTLRRC against outlier tasks. An efficient algorithm based on a modified alternating direction method of multipliers is developed for the estimation of the parameters. The effectiveness of MTLRRC is demonstrated through simulation studies and application to real data.

Interpetable Target-Feature Aggregation for Multi-task Learning Based on Bias-Variance Analysis

Also for k-means: more data does not imply better performance

Article Open access 25 July 2023

Multiple kernel k-means clustering with block diagonal property

Article 11 July 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multi-task learning (MTL) (Caruana 1997) is a methodology where we simultaneously learn multiple related tasks so that each task uses information from other tasks. In the field of statistics, MTL is formulated as an estimation problem for multiple statistical models corresponding to multiple datasets with prior information about relationships among models. Because, in real problems, related tasks tend to have some common information, MTL is expected to have better estimation and prediction accuracy than independently estimated models (Zhang and Yang 2021). The key part of MTL is the assumption of how tasks relate to each other and transfer the common information across models. When tasks are sufficiently related as assumed, MTL can leverage the shared information to improve the efficiency of estimation and generalization performance. However, when tasks are unrelated or only weakly related, forcing information sharing among them can be detrimental.

A reasonable assumption to address this issue is that tasks are classified into some groups sharing common characteristics. MTL methods based on this assumption are achieved by clustering the parameters of models. For instance, Argyriou et al. (2007) introduced the k-means algorithm for task clustering, while several other studies (Zhong and Kwok 2012; Yamada et al. 2017; He et al. 2019; Dondelinger et al. 2020; Okazaki and Kawano 2024) utilized fused regularization techniques such as fused lasso (Tibshirani et al. 2005) and network lasso (Hallac et al. 2015). However, these methods have not considered the existence of outlier tasks that have large task-specific uniqueness or do not share any common characteristics with other tasks. The clustering techniques employed for MTL methods intend to classify all tasks into some clusters and estimate the model to be close to the center of the cluster. Thus, their existence would deteriorate the performance of clustering and lead to a false interpretation of the estimated clusters. On the other hand, some robust MTL methods (Chen et al. 2011; Gong et al. 2012) have been proposed to detect them and reduce their influence on a common structure. However, these methods also have certain limitations. First, they typically classify tasks into only one shared structure and other outlier tasks. If multiple common characteristics exist among tasks, it would be difficult to extract true task structures. Second, these methods often rely on the group lasso regularization (Yuan and Lin 2006) for outlier parameters to identify the outlier tasks, which restricts the value of outlier parameters. Limiting the value of the outlier parameters may ignore the nature of outlier tasks.

To address these limitations of existing MTL methods, we propose a novel robust MTL method called Multi-Task Learning with Robust Regularized Clustering (MTLRRC). MTLRRC aims to simultaneously cluster tasks and detect outlier tasks by integrating loss functions for tasks with robust regularization terms. The regularization term is inspired by the robust convex clustering (Quan and Chen 2020), which is further extended to incorporate non-convex and group-sparse penalties. This extension allows MTLRRC to effectively identify outlier tasks and achieve robust task clustering. We further establish a connection between the extended robust clustering and the multivariate M-estimator. This connection provides an intuitive interpretation of the robustness of MTLRRC against outlier tasks. The parameters included in MTLRRC are estimated by a modified version of the alternating direction method of multipliers (ADMM; Boyd et al. 2011). The theoretical and empirical analysis of convergence are provided.

The remainder of this paper is structured as follows. In Sect. 2, we first illustrate the problem set-up of the MTL method. After that, we review MTL methods based on convex clustering and introduce robust convex clustering along with its non-convex and group-sparse extensions. We also explore the connection between the extended robust clustering problem and the multivariate M-estimator. In Sect. 3, we present the MTLRRC and discuss its interpretation concerning the multivariate M-estimator. In Sect. 4, the modified ADMM algorithm for estimating the parameters of MTLRRC is established. Simulation studies are presented in Sect. 5 to demonstrate the effectiveness of MTLRRC and the application to real-world datasets is illustrated in Sect. 6. The conclusions are given in Sect. 7.

2 Motivation and methodology

2.1 Problem set-up

Suppose that we have T datasets. For each dataset $m \ (m=1,\ldots ,T)$, we observed $n_{m}$ pairs of data points $\{({\varvec{x}}_{mi},y_{mi}); i=1,\ldots ,n_m\}$, where ${\varvec{x}}_{mi}$ is a p-dimensional explanatory variables and $y_{mi}$ is the corresponding response variable following distribution in the exponential family with mean $\mu _{mi}={\mathbb {E}}[y_{mi}|{\varvec{x}}_{mi}]$. Let $X_{m}=({\varvec{x}}_{m1},\ldots , {\varvec{x}}_{mn_{m}})^{\top }\in {\mathbb {R}}^{n_{m}\times p}$ be the design matrix, ${\varvec{y}}_{m}=(y_{m1},\ldots ,y_{mn_{m}})^{\top }\in {\mathbb {R}}^{n_{m}}$ be the response vector for dataset m, and $n=\sum _{m=1}^{T}n_{m}$ be the total number of samples. We assume that each feature vector ${\varvec{x}}_{mi}$ is standardized to have zero mean and unit variance and the response vector ${\varvec{y}}_{m}$ is centered to have zero mean when it is a continuous.

Our goal is to estimate T generalized linear models (GLMs) simultaneously, which take the form:

$$\begin{aligned} & \eta _{mi}=g(\mu _{mi}) = w_{m0} + {\varvec{x}}_{mi}^{\top }{\varvec{w}}_{m}, \nonumber \\ & i=1,\ldots ,n_{m},\;m=1,\ldots ,T, \end{aligned}$$

(1)

where $w_{m0}$ is an intercept for m-th task, ${\varvec{w}}_{m}=(w_{m1}, \ldots ,w_{mp})^\top $ is a p-dimensional regression coefficient vector for m-th task, $\eta _{mi}$ is a linear predictor, and $g(\cdot )$ is a canonical link function. The density function of the response given the explanatory variables is expressed as

$$\begin{aligned} & f(y_{mi} |{\varvec{x}}_{mi};\theta ({\varvec{x}}_{mi}),\phi ) \\ & \quad = \exp \left\{ \frac{y_{mi}\theta ({\varvec{x}}_{mi})-b(\theta ({\varvec{x}}_{mi}))}{a(\phi )}+c(y_{mi},\phi )\right\} , \end{aligned}$$

where $a(\cdot ),b(\cdot )$, and $c(\cdot )$ are known functions that vary according to the distributions, $\phi $ is a known dispersion parameter, and $\theta (\cdot )$ is the natural parameter, which is expressed as $g(\mu _{mi})=\theta ({\varvec{x}}_{mi})$. In this paper, we assume that all ${\varvec{y}}_m$ follow only the same type of distribution, regardless of the tasks.

Let $W=({\varvec{w}}_{1},\ldots ,{\varvec{w}}_{T})^{\top }\in {\mathbb {R}}^{T\times p}$ be the regression coefficient matrix, and ${\varvec{w}}_{0}=(w_{10},\ldots ,w_{T0})^{\top }\in {\mathbb {R}}^{T}$ be the intercept vector. To estimate T GLMs in (1) simultaneously, we formulate the regularization-based multi-task learning (MTL) method as

$$\begin{aligned} \min _{\begin{array}{c} {\varvec{w}}_{0}, W \end{array}} \left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})+ \lambda \Omega (W) \right\} , \end{aligned}$$

(2)

where $L(w_{m0},{\varvec{w}}_{m})$ is a loss function derived from the negative log-likelihood of the m-th GLM task, $\lambda $ is a regularization parameter with a non-negative value, and $\Omega (\cdot )$ is a regularization term that encourages sharing the information among tasks. When continuous response vectors ${\varvec{y}}_{m}\in {\mathbb {R}}^{n_{m}}$ are considered, the linear regression is given by $a(\phi )=\phi , \phi =1$, and $b(\theta ) =\frac{\theta ^{2}}{2}$. The regression loss function is

$$\begin{aligned} L(w_{m0},{\varvec{w}}_{m}) = \frac{1}{2}\Vert {\varvec{y}}_{m} - {\varvec{X}}_{m}{\varvec{w}}_{m} \Vert _{2}^{2}. \end{aligned}$$

Note that the intercepts are excluded from the model without a loss of generality. When binary response vectors ${\varvec{y}}_{m}\in \{0,1\}^{n_{m}}$ are considered, the logistic regression is given by $a(\phi )=\phi , \phi =1$, and $b(\theta ) =\log (1+e^{\theta })$. The loss function of logistic regression is

$$\begin{aligned} \begin{aligned} L(w_{m0},{\varvec{w}}_{m})=&- \sum _{i=1}^{n_{m}}\Big \{y_{mi}(w_{m0}+{\varvec{w}}_{m}^{\top }{\varvec{x}}_{mi})\\ &-\log \{1+{\exp {(w_{m0}+{\varvec{w}}_{m}^{\top }{\varvec{x}}_{mi})}}\} \Big \}. \end{aligned} \end{aligned}$$

2.2 Multi-task learning via convex clustering

To classify tasks into some clusters that share common characteristics, Okazaki and Kawano (2024) proposed an MTL method called multi-task learning via convex clustering (MTLCVX) as follows:

$$\begin{aligned} & \min _{{\varvec{w}}_{0}, W,U} \left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})+\frac{\lambda _{1}}{2}\sum _{m=1}^{T}\Vert {\varvec{w}}_{m}-{\varvec{u}}_{m}\Vert _{2}^{2} \right. \\ & \qquad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}} \Vert _{2} \right\} , \\ \end{aligned}$$

where ${\varvec{u}}_{m}=(u_{m1},\ldots ,u_{mp})^{\top }\in {\mathbb {R}}^{p}$ is a centroid parameter for m-th task, $U=({\varvec{u}}_{1},\ldots ,{\varvec{u}}_{T})^{\top }$ is a $T\times p$ matrix, $r_{m_{1},m_{2}}$ is a weight between $m_{1}$-th and $m_{2}$-th task, ${\mathcal {E}}$ is a set of task pairs $(m_{1},m_{2})$, and $\lambda _{1}$ and $\lambda _{2}$ are non-negative regularization parameters. In this problem, the regularization terms are based on the convex clustering (Hocking et al. 2011), which is to perform clustering of tasks. The second term encourages the centroids ${\varvec{u}}_m$ to be close to their corresponding regression coefficients ${\varvec{w}}_m$. The third term induces ${\varvec{u}}_{m_{1}}\simeq {\varvec{u}}_{m_{2}}$. If the value of ${\varvec{u}}_{m_{1}}$ and ${\varvec{u}}_{m_{2}}$ are estimated to be the same, $m_{1}$-th and $m_{2}$-th task are interpreted as belonging to the same cluster and sharing same characteristics. Therefore, ${\varvec{u}}_{m}$ can be viewed as a center of cluster.

This minimization problem is solved by the block coordinate descent (BCD) algorithm. The algorithm is performed by alternately minimizing ${\varvec{w}}_{m}$ and ${\varvec{u}}_{m}$: we solve the regression model regularized by ridge penalty whose center is ${\varvec{u}}_{m}$ and convex clustering where each ${\varvec{w}}_{m} \;(m=1,\ldots , T)$ is considered as a sample, alternately.

The weights $r_{m_{1},m_{2}}$ are set by the k-nearest neighbor as follows:

$$\begin{aligned} R= & \frac{S^{\top }+S}{2}, \quad (S)_{m_{1}m_{2}}\nonumber \\= & {\left\{ \begin{array}{ll} 1 & \widehat{{\varvec{w}}}^{\textrm{STL}}_{m_{1}}\;\textrm{is}\; \textrm{a}\;k\text {-nearest}\; \textrm{neighbor} \; \textrm{of}\;\widehat{{\varvec{w}}}_{m_{2}}^{\textrm{STL}}, \\ 0 & \textrm{otherwise}, \end{array}\right. } \end{aligned}$$

(3)

where R is a $T \times T$ matrix whose each component is $(R)_{m_{1}m_{2}}= r_{m_{1},m_{2}}$ and $\widehat{{\varvec{w}}}^{\textrm{STL}}_{m}$ is an estimated regression coefficient vector for m-th task by single-task learning such as OLS, ridge, and lasso. This setting is also used in Yamada et al. (2017). This constructed weights would be based on that of convex clustering (Hocking et al. 2011), which is used to reduce the computational costs by setting some $r_{m_{1},m_{2}}$ to zero and to improve the estimation results of clustering.

2.3 Robust convex clustering

Suppose that we have n observed p-dimensional data $\left\{ {\varvec{x}}_{i};i=1,\ldots ,n\right\} $. To classify these data into distinct clusters, convex clustering (Pelckmans et al. 2005; Hocking et al. 2011; Lindsten et al. 2011) has been proposed.

Quan and Chen (2020) pointed out that convex clustering is sensitive to just a few outliers, and they proposed robust convex clustering (RCC) as follows:

$$\begin{aligned} \begin{aligned}&\min _{U,O} \left\{ \sum _{i=1}^{n} \frac{1}{2}\Vert {\varvec{x}}_{i} - {\varvec{u}}_{i} -{\varvec{o}}_{i}\Vert _{2}^{2}\right. \\ &\quad \left. +\lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}}r_{i_{1},i_{2}}\Vert {\varvec{u}}_{i_{1}} -{\varvec{u}}_{i_{2}} \Vert _{2} +\lambda _{2}\sum _{i=1}^{n}\Vert {\varvec{o}}_{i}\Vert _{1} \right\} , \end{aligned} \end{aligned}$$

(4)

where ${\varvec{u}}_{i}=(u_{i1},\ldots ,u_{ip})^{\top }\in {\mathbb {R}}^{p}$ is a vector of centroid parameters for i-th sample, ${\varvec{o}}_{i}=(o_{i1},\ldots ,o_{ip})^{\top }\in {\mathbb {R}}^{p}$ is a vector of outlier parameters for i-th sample, $U=({\varvec{u}}_{1},\ldots ,{\varvec{u}}_{n})^{\top }$ and $O=({\varvec{o}}_{1},\ldots ,{\varvec{o}}_{n})^{\top }$ are $n\times p$ matrices, respectively. The third term selects the outlier parameters by shrinking each element of ${\varvec{o}}_{i}$ toward exactly zero. If $o_{ij}$ is estimated to be a non-zero value, the j-th feature of the i-th sample is considered an outlier.

From the relationship between the loss function of least square and $L_{1}$ penalty for the outlier parameters (Antoniadis 2007; Gannaz 2007), the minimization problem (4) is equivalent to the following minimization problem:

$$\begin{aligned} & \min _{U} \left\{ \sum _{i=1}^{n}\sum _{j=1}^{p} h_{\lambda _{2}}(x_{ij} - u_{ij})\right. \nonumber \\ & +\left. \lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}}r_{i_{1},i_{2}}\Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2} \right\} , \end{aligned}$$

(5)

where $h_{\lambda }(\cdot )$ is a Huber’s loss function defined as

$$\begin{aligned} h_{\lambda _{2}}(z)={\left\{ \begin{array}{ll} \frac{1}{2}z^{2} & |z| \le \lambda , \\ \lambda _{2} |z|-\frac{\lambda ^{2}}{2} & |z|\ge \lambda . \end{array}\right. } \end{aligned}$$

This would indicate that RCC is a robust version of convex clustering derived from replacing the loss function with the component-wise robust loss function.

2.4 Non-convex extensions of robust convex clustering

Although Quan and Chen (2020) only considered $\ell _{1}$ penalty to select outlier parameters, it also can be considered other types of shrinkage penalties such as non-convex penalties and group penalties. For example, if group lasso (Yuan and Lin 2006) is employed for the third term in (4), we can detect the sample-wise outliers. Since our purpose in this paper is to detect task-wise outliers, we first consider the generalized robust clustering problem to detect sample-wise outliers as follows:

$$\begin{aligned} & \min _{U,O} \left\{ \sum _{i=1}^{n} \frac{1}{2}\Vert {\varvec{x}}_{i} - {\varvec{u}}_{i} -{\varvec{o}}_{i}\Vert _{2}^{2}\right. \nonumber \\ & \left. +\lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}}r_{i_{1},i_{2}}\Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2} + \sum _{i=1}^{n}P({\varvec{o}}_{i};\lambda _{2},\gamma ) \right\} , \end{aligned}$$

(6)

where $P(\cdot ;\lambda ,\gamma )$ is a penalty function that induces group sparsity and $\gamma $ is a tuning parameter that adapts the shape of the penalty function. By estimating ${\varvec{o}}_{i}$ as a zero vector through group penalties, this problem aims to detect sample-wise outliers. For i-th sample ${\varvec{x}}_{i}$, even if the value of one feature $x_{ij}$ has extensive value compared with its cluster center ${\widehat{u}}_{ij}$, $\widehat{{\varvec{o}}}_{i}$ would not be non-zero vector. Only when the $L_{2}$-distance $\Vert {\varvec{x}}_{i}-\widehat{{\varvec{u}}}_{i}\Vert _{2}$ has the extensive value, $\widehat{{\varvec{o}}}_{i}$ is estimated to be non-zero vector and ${\varvec{x}}_{i}$ is interpreted as an outlier sample.

If a non-convex penalty such as group SCAD and group MCP (Huang et al. 2012) is employed for the third term, the minimization problem is no longer a convex optimization problem. Therefore, we refer to the minimization problem as Robust Regularized Clustering (RRC).

The minimization problem (6) is solved by the BCD algorithm shown in Algorithm 1. Here, $X= ({\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n})^{\top }\in {\mathbb {R}}^{n \times p}$, L(U, O) is the objective function in (6), and the superscripts in parentheses indicate the number of updates. The problem (7) can be solved by algorithms for the convex clustering. Since the problem (8) is separable in terms of ${\varvec{o}}_{i}$, it is expressed as

$$\begin{aligned} {\varvec{o}}_{i}^{(t+1)}= & \mathop {\mathrm{arg~min}}\limits _{{\varvec{o}}_{i}} \left\{ \frac{1}{2}\Vert {\varvec{x}}_{i}-{\varvec{u}}^{(t+1)}_{i}-{\varvec{o}}_{i} \Vert _{2}^{2} + P({\varvec{o}}_{i};\lambda _{2},\gamma )\right\} ,\nonumber \\ & i=1,\ldots ,n. \end{aligned}$$

(9)

Therefore, the update can be obtained by

$$\begin{aligned} {\varvec{o}}_{i}^{(t+1)} = {\varvec{\Theta }}({\varvec{x}}_{i}-{\varvec{u}}^{(t+1)}_{i};\lambda _{2},\gamma ), \quad i=1,\ldots ,n, \end{aligned}$$

where ${\varvec{\Theta }}(\cdot ;\lambda ,\gamma )$ is a group-thresholding function defined for the corresponding penaly function $P(\cdot ;\lambda ,\gamma )$. For example, ${\varvec{\Theta }}(\cdot ;\lambda ,\gamma )$ for group lasso is given by

$$\begin{aligned} {\varvec{\Theta }}^{\textrm{glasso}}({\varvec{z}};\lambda ,\gamma )= S({\varvec{z}};\lambda ), \end{aligned}$$

where $S(\cdot ;\lambda )$ is a group soft-thresholding function defined as

$$\begin{aligned} S({\varvec{z}};\lambda ) = \max \left( 0,1-\frac{\lambda }{\Vert {\varvec{z}}\Vert _{2}}\right) {\varvec{z}}. \end{aligned}$$

The solution of (6) obtained by Algorithm 1 is related to M-estimators, which is similar to the connection between the minimization problems (4) and (5). Let $A_{r}$ be a $|{\mathcal {E}}|\times n$ matrix whose each row is $r_{i_{1},i_{2}}{\varvec{a}}_{(i_{1},i_{2})}^{\top }$ and we set

$$\begin{aligned} D = A_{r} \otimes I_{p}, \end{aligned}$$

where $\otimes $ denotes the Kronecker product, $I_{p}$ is a $p\times p$ identity matrix, and ${\varvec{a}}_{(i_{1},i_{2})}$ is defined as

$$\begin{aligned} ({\varvec{a}}_{(i_{1},i_{2})})_{i}= {\left\{ \begin{array}{ll} 1 \quad & i=i_{1},\\ -1 \quad & i=i_{2},\\ 0 \quad & \textrm{otherwise}, \end{array}\right. } i=1,\ldots ,n. \end{aligned}$$

(10)

We also define the mixed (2, 1)-norm (Lounici et al. 2011) for a $|{\mathcal {E}}|p$-dimensional vector ${\varvec{z}}$ as

$$\begin{aligned} \Vert {\varvec{z}} \Vert _{2,1} = \sum _{k=1}^{|{\mathcal {E}}|}\left( \sum _{j= (k-1)p + 1}^{kp}z_{j}^{2}\right) ^{1/2}. \end{aligned}$$

Using these definitions, the second term in (6) can be written as

$$\begin{aligned} \sum _{(i_{1},i_{2})\in {\mathcal {E}}} r_{i_{1},i_{2}} \Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2} = \Vert D\textrm{vec}(U) \Vert _{2,1}. \end{aligned}$$

Based on the above definitions, we summarize the relationship between the solution of the RRC and M-estimator in the following proposition.

Proposition 1

Suppose that ${\widehat{U}}$ is a convergence point in Algorithm 1 and ${\varvec{\psi }}({\varvec{o}};\lambda ,\gamma )={\varvec{o}} - {\varvec{\Theta }}({\varvec{o}};\lambda ,\gamma )$. Then, the ${\widehat{U}}$ satisfies

$$\begin{aligned} -\Psi (X-{\widehat{U}};\lambda _{2},\gamma )+\lambda _{1} \partial _{\textrm{vec}(U)}(\Vert D\textrm{vec}(U) \Vert _{2,1})|_{U={\widehat{U}}} \ni {\varvec{0}},\nonumber \!\!\!\!\!\\ \end{aligned}$$

(11)

where $\partial _{\textrm{vec}(U)}$ is the subdifferential with respect to $\textrm{vec}(U)$, and $\Psi (X-{\widehat{U}};\lambda _{2},\gamma )$ is an np-dimensional vector defined as

$$\begin{aligned} \Psi (X-{\widehat{U}};\lambda _{2},\gamma ) = \begin{pmatrix} {\varvec{\psi }} ({\varvec{x}}_{1}-{\varvec{{\widehat{u}}}}_{1};\lambda _{2},\gamma ) \\ {\varvec{\psi }} ({\varvec{x}}_{2}-{\varvec{{\widehat{u}}}}_{2};\lambda _{2},\gamma ) \\ \vdots \\ {\varvec{\psi }} ({\varvec{x}}_{n}-{\varvec{{\widehat{u}}}}_{n};\lambda _{2},\gamma ) \end{pmatrix}. \end{aligned}$$

The proof is given in Appendix B.1. This proposition gives the relationship between the ${\widehat{U}}$ calculated by Algorithm 1 and the following minimization problem:

$$\begin{aligned} \min _{U} \left\{ \sum _{i=1}^{n} \rho _{\lambda _{2},\gamma }({\varvec{x}}_{i}-{\varvec{u}}_{i})+ \lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}} r_{i_{1},i_{2}} \Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2}\right\} ,\nonumber \!\!\!\!\\ \end{aligned}$$

(12)

where $\rho _{\lambda ,\gamma }(\cdot )$ is a multivariate loss function that satisfies

$$\begin{aligned} \frac{\partial }{\partial {\varvec{z}}} \rho _{\lambda ,\gamma }({\varvec{z}}) = {\varvec{\psi }}({\varvec{z}};\lambda ,\gamma ). \end{aligned}$$

If we choose a regularization term $P({\varvec{z}};\lambda ,\gamma )$ so that $\rho _{\lambda ,\gamma }({\varvec{z}})$ is a weakly convex function (see e.g. Davis et al. 2018), the sum rule of the subdifferential (Ngai et al. 2000; Corollary 3.9) can be applied to the objective function of Problem (12). Then, the optimality condition regarding the stationary points coincides with the inclusion relationship (11), which means any output ${\widehat{U}}$ is one of the stationary points of the Problem (12). The loss functions induced from group SCAD and group MCP are weakly convex, while the skipped mean loss induced from group hard thresholding is not weakly convex. The proofs for the weakly convexity of these functions are shown in Appendix B.4.

The sketch of some multivariate loss functions and corresponding group-thresholding functions is shown in Fig. 1. Those formulations are summarized in Appendix A.

Proposition 1 is inspired by similar propositions in She and Owen (2011) and Katayama and Fujisawa (2017). While they considered linear regression case and only the situation where ${\varvec{\Theta }}(\cdot ;\lambda ,\gamma )$ is defined as the component-wise thresholding function, we consider clustering problem and the situation where $\Theta (\cdot ;\lambda ,\gamma )$ is a group-thresholding function. This enables us to solve clustering problems with multivariate robust loss functions by optimizing the problems with group penalties for outlier parameters instead.

3 Proposed method

3.1 Multi-task learning via robust regularized clustering

Almost all existing MTL methods based on clustering have not considered the presence of outlier tasks. For instance, MTLCVX shrinks the difference of centroids including those of outlier tasks, which contaminates the estimation of centroids. As a result, the estimation of regression coefficients concerning tasks corresponding to contaminated centroids also worsens. This motivates us to separate the estimation of the parameters concerning task clusters from outlier tasks. To the best of our knowledge, only Yao et al. (2019) consider the presence of outlier tasks and clustering of tasks simultaneously. However, their way of making the robustness of their method is not clear. On the other hand, some robust MTL methods (Chen et al. 2011; Gong et al. 2012) have attempted to address the issues of outlier tasks by introducing outlier parameters and selecting them using group lasso regularization. However, group lasso (Yuan and Lin 2006) limits the value of the outlier parameters, which may not adequately represent their nature. To overcome these problems, we propose the Multi-Task Learning via Robust Regularized Clustering (MTLRRC). MTLRRC is formulated as follows:

$$\begin{aligned} \begin{aligned}&\min _{\varvec{w}_{0},W,U,O} \left\{ \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) +\frac{\lambda _{1}}{2}\sum _{m=1}^{T}\Vert {\varvec{w}}_{m}-{\varvec{u}}_{m}-{\varvec{o}}_{m}\Vert _{2}^{2} \right. \\ &\qquad \quad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}} \Vert _{2}\right. \\ &\qquad \quad \left. + \sum _{m=1}^{T} P( {\varvec{o}}_{m};\lambda _{3},\gamma ) \right\} , \end{aligned} \end{aligned}$$

(13)

where ${\varvec{o}}_{m}=(o_{m1},\ldots ,o_{mp})^{\top }\in {\mathbb {R}}^{p}$ is a vector of outlier parameters for m-th task, $\lambda _{3}$ is a regularization parameter with a non-negative value. The second through the fourth term is based on the minimization problem (6). Then, if ${\varvec{o}}_{m}$ is estimated to be a non-zero vector, m-th task is considered to be an outlier task that does not share a common structure with any tasks. We set the weights $r_{m_{1},m_{2}}$ based on Eq. (3)

In this problem, the centroid parameters ${\varvec{u}}_{m}$ whose difference is shrank and outlier parameters ${\varvec{u}}_{m}$ estimated to be zero vector or not are separated in the light of regularization. Thus, the cluster center component and the potential outlier component included in a task are estimated separately. Furthermore, if we employ non-convex regularization terms such as group SCAD, we can allow ${\varvec{o}}_{m}$ to have a large value, which also leads to the large difference between ${\varvec{w}}_{m}$ and ${\varvec{u}}_{m}$. Then, the contamination from outlier tasks with significant unique characteristics is expected to be reduced. Next, we introduce another interpretation of the proposed method.

3.2 Interpretation through the BCD algorithm

3.2.1 Convex case

First, we consider that group lasso is employed for $P(\cdot ;\lambda ,\gamma )$ in (13). Then, since MTLRRC is a convex optimization problem, we can obtain another representation for (13) by minimizing in terms of O as follows:

$$\begin{aligned} & \min _{{\varvec{w}}_{0},W,U }\left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \right. +\lambda _{1}\sum _{m=1}^{T}h_{\lambda _{3}/\lambda _{1}}^{\textrm{M}}({\varvec{w}}_{m}-{\varvec{u}}_{m}) \nonumber \\ & \qquad \quad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}} \Vert _{2} \right\} , \end{aligned}$$

(14)

where $h_{\lambda }^{\textrm{M}}(\cdot )$ is a multivariate Huber’s loss function (Hampel et al. 1986) defined as

$$\begin{aligned} h_{\lambda }^{\textrm{M}}({\varvec{z}})= {\left\{ \begin{array}{ll} \frac{1}{2}\Vert {\varvec{z}}\Vert _{2}^{2} & \Vert {\varvec{z}}\Vert _{2}\le \lambda ,\\ \lambda \Vert {\varvec{z}}\Vert _{2}-\frac{\lambda ^{2}}{2} & \Vert {\varvec{z}}\Vert _{2}> \lambda . \end{array}\right. } \end{aligned}$$

This representation helps us understand the interpretation of the proposed method (13).

Let the objective function of the minimization problem (14) be $L^{\textrm{MH}}({\varvec{w}}_{0},W,U)$. We consider solving the minimization problem (14) by Algorithm 2 based on the BCD algorithm. Since the minimization in terms of ${\varvec{w}}_{m}$ is separable, the update (15) is expressed as

$$\begin{aligned} \begin{aligned}&(w_{m0}^{(t+1)},{\varvec{w}}_{m}^{(t+1)})\\ &\quad = \mathop {\mathrm {arg~min}}\limits _{w_{m0},{\varvec{w}}_{m}}\left\{ \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) + \lambda _{1} h_{\lambda _{3}/\lambda _{1}}^{\text {M}}({\varvec{w}}_{m}-{\varvec{u}}_{m}^{(t)}) \right\} ,\\ &\qquad m=1,\ldots ,T. \end{aligned} \end{aligned}$$

These updates estimate the regression coefficients for the m-th task to be close to the corresponding centroid. However, when $\Vert {\varvec{w}}-{\varvec{u}}_{m}^{(t)}\Vert _{2}$ tends to take a larger value than $\lambda _{3}/\lambda _{1}$, the shrinkage toward the centroid is reduced by the part of $\ell _{2}$-norm in the multivariate Huber’s function. Thus, if m-th task is an outlier task, the estimated $\widehat{{\varvec{w}}}_{m}$ is expected to be less affected by the common structure $\widehat{{\varvec{u}}}_{m}$.

The minimization problem in terms of the update (16) is in the framework of the minimization problem (12). This can be seen by replacing $({\varvec{x}}_{i};i=1,\ldots ,n)$ in (12) with $({\varvec{w}}_{m}^{(t+1)};m=1,\ldots ,T)$ and $\rho _{\lambda _{2},\gamma }(\cdot )$ with $h_{\lambda _{3}/\lambda _{1}}^{\text {HM}}(\cdot )$. Consequently, the update of $U^{(t+1)}$ is performed under the robust clustering of tasks. Based on the discussions of Algorithm 2, the estimated values ${\widehat{W}}$ and ${\widehat{U}}$ in MTLRRC can be regarded as a convergence point of alternative estimation, which consists of a regression step that reduces shrinkage of outlier tasks toward cluster center and a robust clustering step for tasks. Therefore, MTLRRC is expected to be robust to the outlier tasks.

3.2.2 Non-convex case

Next, we consider that non-convex group penalties are employed for $P(\cdot ;\lambda ,\gamma )$ in (13). The estimates of the parameters can be calculated by Algorithm 3. Here, $L^{\textrm{MR}}({\varvec{w}}_{0},W,U, O)$ is the objective function of the minimization problem (13). Then, the following proposition similar to Proposition 1 holds.

Proposition 2

Let ${\varvec{w}}_{m}^{\prime }=(w_{m0},{\varvec{w}}_{m}^{\top })^{\top }$. Suppose that $\left( \widehat{{\varvec{w}}}_{0},{\widehat{W}},\right. $ $\left. {\widehat{U}}\right) $ is a pair of convergence point in Algorithm 3 and ${\varvec{\psi }}({\varvec{o}};\lambda ,\gamma )={\varvec{o}} - {\varvec{\Theta }}({\varvec{o}};\lambda ,\gamma )$. Then, $(\widehat{{\varvec{w}}}_{0},{\widehat{W}},{\widehat{U}})$ satisfies

$$\begin{aligned}&\frac{\partial }{\partial {\varvec{w}}_{m}^{\prime }}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})|_{{\varvec{w}}_{m}^{\prime }=\widehat{{\varvec{w}}}_{m}^{\prime }}\nonumber \\&\quad + \lambda _{1} \begin{pmatrix} 0 \\ {\varvec{\psi }}(\widehat{{\varvec{w}}}_{m}-\widehat{{\varvec{u}}}_{m};\lambda _{3}/\lambda _{1},\gamma ) \end{pmatrix} = {\varvec{0}},\quad m=1,\ldots ,T, \end{aligned}$$

(17)

$$\begin{aligned}&-\lambda _{1}\Psi ({\widehat{W}}-{\widehat{U}};\lambda _{3}/\lambda _{1},\gamma )\nonumber \\&\quad +\lambda _{2}\partial _{\textrm{vec}(U)}(\Vert D\textrm{vec}(U) \Vert _{2,1})|_{U={\widehat{U}}} \ni {\varvec{0}}. \end{aligned}$$

(18)

The proof is given in Appendix B.2. When the $P({\varvec{z}};\lambda ,\gamma )$ is employed such that corresponding $\rho _{\lambda ,\gamma }({\varvec{z}})$ is a weakly convex function, the equations (17) and (18) are the same first-order conditions for the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{{\varvec{w}}_{0},W,U}&\left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \right. \\ &\quad +\lambda _{1}\sum _{m=1}^{T}\rho _{\lambda _{3}/\lambda _{1},\gamma }({\varvec{w}}_{m}-{\varvec{u}}_{m}) \\ &\quad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{v{u}}_{m_{2}} \Vert _{2} \right\} . \end{aligned}\nonumber \\ \end{aligned}$$

(19)

Therefore, we may expect that the solution in the case of non-convex penalties has a similar interpretation of the minimization problem (14).

4 Estimation algorithm via modified ADMM

MTLRRC can be estimated by Algorithm 3. However, this estimation procedure is computationally expensive, because the update of $U^{(t+1)}$ involves solving the convex clustering, which is computationally demanding. To avoid this computation, we consider estimating parameters included in MTLRRC by alternating direction method of multipliers (ADMM; Boyd et al. 2011).

We consider the following minimization problem equivalent to Problem (13):

$$\begin{aligned} \begin{aligned} \min _{{\varvec{w}}_{0},W,U,O}&\left\{ \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) +\frac{\lambda _{1}}{2}\Vert W-U-O\Vert _{F}^{2} \right. \\ &\left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{b}}_{(m_{1},m_{2})} \Vert _{2}\right. \\ &\left. + \sum _{m=1}^{T} P( {\varvec{o}}_{m};\lambda _{3},\gamma ) \right\} , \\ &\mathrm {s.t.} \quad {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}}= {\varvec{b}}_{(m_{1},m_{2})},\quad (m_{1},m_{2})\in {\mathcal {E}}. \end{aligned} \end{aligned}$$

For this minimization problem, we consider the following augmented Lagrangian:

$$\begin{aligned} L_{\nu } ({\varvec{w}}_{0},W,U,O,B,S)= & \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})\nonumber \\ & \quad +\frac{\lambda _{1}}{2}\Vert W-U-O\Vert _{F}^{2} \nonumber \\ & \quad +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{b}}_{(m_{1},m_{2})} \Vert _{2} \!\!\!\!\nonumber \\ & \quad + \sum _{m=1}^{T} P( {\varvec{o}}_{m};\lambda _{3},\gamma ) \nonumber \\ & \quad + \textrm{tr}(S^{\top }(B-A_{{\mathcal {E}}}U)) \nonumber \\ & \quad + \frac{\nu }{2} \Vert B-A_{{\mathcal {E}}}U\Vert _{F}^{2}, \end{aligned}$$

(20)

where $\textrm{tr}(\cdot )$ denotes the trace operator, $A_{{\mathcal {E}}}$ is a $|{\mathcal {E}}|\times T$ matrix whose each row is ${\varvec{a}}_{m_{1},m_{2}}^{\top }$ defined with the same manner of Eq. (10), S is a $|{\mathcal {E}}|\times p$ Lagrangian multipliers matrix, and $\nu $ is a tuning parameter with non-negative value.

For this augmented Lagrangian, we consider the following updates of the modified ADMM:

$$\begin{aligned} \begin{aligned}&({\varvec{w}}_{0}^{(t+1)},W^{(t+1)}) = \mathop {\mathrm{arg~min}}\limits _{{\varvec{w}}_{0},W} L_{\nu } ({\varvec{w}}_{0},W,U^{(t)},O^{(t)},B,S^{(t)}), \\&U^{(t+1)} = \mathop {\mathrm{arg~min}}\limits _{U} \left( \min _{B}L_{\nu }({\varvec{w}}_{0}^{(t+1)},W^{(t+1)},U,O^{(t)},B,S^{(t)})\right) , \\&O^{(t+1)} = \mathop {\mathrm{arg~min}}\limits _{O} L_{\nu } ({\varvec{w}}_{0}^{(t+1)},W^{(t+1)},U^{(t+1)},O^{(t)},B,S^{(t)}), \\&S^{(t+1)}_{(m_{1},m_{2})} = \textrm{prox}((S^{(t)}\\&\qquad \qquad \quad +\nu A_{{\mathcal {E}}} U^{(t+1)})_{(m_{1},m_{2})}, \lambda _{2}r_{m_{1},m_{2}}),\quad (m_{1},m_{2})\in {\mathcal {E}}, \\ \end{aligned} \end{aligned}$$

where $\textrm{prox}(\cdot ,\lambda )$ is defined as

$$\begin{aligned} \textrm{prox}({\varvec{z}},\lambda )=\textrm{min}(\Vert {\varvec{z}}\Vert _{2},\lambda )\frac{{\varvec{z}}}{\Vert {\varvec{z}}\Vert _{2}}. \end{aligned}$$

(21)

The minimization problem in terms of U and B is jointly done. The minimization in terms of B can be written explicitly, and we only need to solve that in terms of U using the gradient method. These modifications about updates of U, B, and S are based on the idea of Shimmura and Suzuki (2022). Those details and derivation are provided in Appendix C.

The update for ${\varvec{w}}_{0}$ and W is given by solving the independent regularized GLMs for each task, which are expressed as

$$\begin{aligned} \begin{aligned}&(w_{m0}^{(t+1)},{\varvec{w}}_{m}^{(t+1)}) = \mathop {\mathrm {arg~min}}\limits _{w_{m0},{\varvec{w}}_{m}}\left\{ \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \right. \\ &\quad \left. + \frac{\lambda _{1}}{2}\Vert {\varvec{w}}_{m}-{\varvec{u}}_{m}^{(t)}-{\varvec{o}}_{m}^{(t)} \Vert _{2}^{2} \right\} ,\;m=1,\ldots ,T. \end{aligned} \end{aligned}$$

These minimization problems are solved by the Newton–Raphson method provided in Algorithm 4. The update of O is given by the same manner as (9). As a result, we obtain the estimation algorithm for MTLRRC as Algorithm 5. Here, $\textrm{STL}(\cdot ,\cdot )$ is a function returning an estimated regression coefficient vector by an arbitrary single-task learning method.

Although the convergence of ADMM for non-convex functions has been shown in some studies (e.g. Wang et al. 2019 and Fan and Yin 2024), the convergence of the proposed method is non-trivial because of the modification of the ADMM algorithm based on Shimmura and Suzuki (2022). We provide a theoretical guarantee regarding convergence to a limit point, which is summarized as the following theorem.

Theorem 1

Let $\lambda _{++}(\cdot )$ be a strictly positive minimum eigenvalue of a matrix and assume that $L(\cdot ,\cdot )$ is a convex function and $P(\cdot ;\lambda ,\gamma )$ is a weakly convex function with modules $C_{p}$. If $\lambda _{1}>C_{p}$ and $\nu >\frac{2\lambda _{1}^{2}}{\lambda _{++}(A_{{\mathcal {E}}}^{\top }A_{{\mathcal {E}}})(\lambda _{1}-C_{p})}$ are satisfied, the sequence $\{{\varvec{w}}_{0}^{(t)},W^{(t)}, U^{(t)},O^{(t)},B^{(t)},S^{(t)}\}$ generated by the Algorithm 5 converges to a limit point $\{\widehat{{\varvec{w}}}_{0}^{(*)},{\widehat{W}}^{(*)},{\widehat{U}}^{(*)},{\widehat{O}}^{(*)},{\widehat{B}}^{(*)},{\widehat{S}}^{(*)}\}$. In addition, any limit point is a stationary point of the augmented Lagrangian (20).

This theorem ensures that our modified ADMM algorithm converges to a stationary point under mild conditions. The conditions provide practical guidelines for parameter selection in the algorithm. Specifically, we can ensure convergence by setting $\lambda _1$ larger than the weak convexity modulus $C_p$ and choosing an appropriate step size $\nu $. However, as we will see in later simulation studies, even if these conditions are not satisfied, the algorithm will converge empirically in many situations. The proof of Theorem 1 is given in the Supplementary Material.

Furthermore, we consider the following augmented Lagrangian derived from Problem (19):

$$\begin{aligned} \begin{aligned} L_{\nu } ({\varvec{w}}_{0},W,U,B,S)&= \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \\ &\quad +\frac{\lambda _{1}}{2}\sum _{m=1}^{T}\rho _{\lambda _{3}/\lambda _{1},\gamma }({\varvec{w}}_{m}-{\varvec{u}}_{m}) \\ &\quad +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{b}}_{(m_{1},m_{2})} \Vert _{2}\!\!\!\!\! \\ &\quad + \text {tr}(S^{\top }(B-A_{{\mathcal {E}}}U)) \\ &\quad + \frac{\nu }{2} \Vert B-A_{{\mathcal {E}}}U\Vert _{F}^{2}. \end{aligned} \end{aligned}$$

(22)

Then, the following proposition holds.

Proposition 3

Under the assumption of Theorem 1 with $\lambda _{1}>C_{p}$ and $\nu >\frac{2\lambda _{1}^{2}}{\lambda _{++}(A_{{\mathcal {E}}}^{\top }A_{{\mathcal {E}}})(\lambda _{1}-C_{p})}$. Then, any limit point $\{\widehat{{\varvec{w}}}_{0}^{(*)},{\widehat{W}}^{(*)},{\widehat{U}}^{(*)},{\widehat{B}}^{(*)},{\widehat{S}}^{(*)}\}$ is one of the stationary points for the minimization problem concerning (22).

This proposition is the ADMM version of Proposition 2, which would justify minimizing MTLRRC by using the modified ADMM algorithm instead of the BCD algorithm. Note that due to the non-convexity of the objective function except when group lasso is used, there may be a duality gap between the original MTLRRC and the minimization problem concerning its augmented Lagrangian.

We analyze the computational complexity of our proposed method. The initialization of $\iota $ requires ${\mathcal {O}}(|{\mathcal {E}}|T^{2})$ operations, which is computed only once before the ADMM iterations. Within each ADMM iteration, updating $({\varvec{w}}_{0}^{(t+1)},W^{(t+1)})$ via the Newton–Raphson method requires ${\mathcal {O}}(N_{\textrm{NR}} p^{2}(n+Tp))$ operations, where $N_{\textrm{NR}}$ denotes the number of Newton-Raphson iterations. The update of $U^{(t+1)}$ via accelerated gradient method requires ${\mathcal {O}}(N_{\text {AG}}|{\mathcal {E}}|Tp)$ operations, where $N_{\text {AG}}$ represents the number of accelerated gradient method iterations. The update of $O^{(t+1)}$ requires ${\mathcal {O}}(Tp)$ operations. The update of $S^{(t+1)}$ requires ${\mathcal {O}}(|{\mathcal {E}}|Tp)$ operations. Consequently, each iteration of Algorithm 5 has an overall computational complexity of ${\mathcal {O}}(N_{\text {NR}}p^{2}(n+Tp)+N_{\text {AG}}|{\mathcal {E}}|Tp)$. In contrast, Algorithm 3 requires ${\mathcal {O}}(N_{\text {NR}}p^{2}(n+Tp)+N_{\text {A}}N_{\text {AG}}|{\mathcal {E}}|Tp)$ operations, where $N_{\textrm{A}}$ denotes the number of ADMM iterations required for solving the convex clustering problem. Thus, the computational complexity of the BCD algorithm regarding the second term is $N_{\textrm{A}}$ times larger than that of the modified ADMM algorithm, which can be substantial when $N_{\textrm{A}}$ is large.

We further compare our method with RCMTL (Yao et al. 2019), which is the only existing method that addresses both robustness against outlier tasks and task clustering. RCMTL employs a BCD algorithm incorporating two inner ADMM algorithms. Their algorithm requires ${\mathcal {O}}(N_{\mathrm {A_{1}}}N_{\textrm{G}}Tp(n+T)+N_{\textrm{A}_{2}}T^{2}p)$ operations in each iteration, where $N_{\mathrm {A_{1}}}$, $N_{\textrm{A}_{2}}$, and $N_{\textrm{G}}$ denote the iteration counts for the first inner ADMM, second inner ADMM, and gradient method within the first inner ADMM, respectively. In comparing MTLRRC and RCMTL, the first term in both complexity expressions primarily reflects the computational cost of updating regression coefficients, while the second term corresponds to updating the task relationship components. For the first term, direct comparison is difficult as RCMTL employs backtracking to avoid matrix inversion computations. Regarding the second term in our modified ADMM, when the weights $r_{m_{1},m_{2}}$ are constructed according to Eq. (3), the computational complexity becomes ${\mathcal {O}}(N_{\text {AG}}kT^{2}p)$ since $|{\mathcal {E}}|\le kT$. Given that k is typically small (e.g., $k = 5$) in the context of convex clustering and considering accelerated gradient method’s quadratic convergence rate, our modified ADMM algorithm achieves computational efficiency comparable to or better than RCMTL in the task relationship part.

5 Simulation studies

In this section, we report simulation studies in the linear regression setting. We generated data by the true model:

$$\begin{aligned} \begin{aligned} {\varvec{y}}_{m} = X_{m}{\varvec{w}}_{m}^{*}+{\varvec{\epsilon }}_{m}, \quad m=1,\ldots ,T, \end{aligned} \end{aligned}$$

where ${\varvec{\epsilon }}_{m}$ is an error term whose each component is distributed as $N(0,\sigma ^{2})$ independently, $X_{m}$ is a design matrix generated from $N_{p}({\varvec{0}},I_{p})$ independently, and ${\varvec{w}}^{*}_{m}$ is a true regression coefficient vector for m-th task. For this true model, T tasks consist of C true clusters and other outlier tasks. First, all tasks were assigned to C clusters with the same number of tasks in each cluster as T/C. Then, some of them were randomly assigned to outlier tasks.

For the true structure of regression coefficient vectors, we considered the following two cases:

$$\begin{aligned} \begin{aligned} \text{ Case } \text{1: }&\quad {\varvec{w}}_{m}^{*}={\varvec{u}}_{c}^{*} + {\varvec{v}}_{m}^{c*}+I(\tau _{m}=1){\varvec{o}}_{m}^{c*}, \\ \text{ Case } \text{2: }&\quad {\varvec{w}}_{m}^{*} = {\left\{ \begin{array}{ll} {\varvec{u}}_{c}^{*} + {\varvec{v}}_{m}^{c*} & \textrm{if}\;\tau _{m} =0, \\ {\varvec{o}}_{m}^{*} & \textrm{if}\; \tau _{m} = 1, \end{array}\right. } \end{aligned} \end{aligned}$$

where ${\varvec{u}}_{c}^{*}$ is a true cluster center for c-th cluster, ${\varvec{v}}_{m}^{c{*}}$ is a true task-specific parameter for m-th task belonging to c-th cluster, ${\varvec{o}}_{m}$ is a true outlier parameter for m-th task, and $\tau _{m}$ is a random variable distributed as $ P(\tau _{m}=1)=\kappa $ and $P(\tau _{m}=0)=1-\kappa $. $\tau _{m}=1$ means that the m-th task is assigned to an outlier task. Case 1 considers a situation where outlier tasks share the same cluster center with other tasks, but the outlier parameter is added. On the other hand, Case 2 considers a situation where outlier tasks do not have any common structure with other tasks.

The parameters ${\varvec{u}}_{c}^{*},{\varvec{v}}_{m}^{c*}$, and ${\varvec{o}}_{m}^{*}$ were generated as follows. First, each explanatory variable $\{j=1,\ldots ,p\}$ was randomly assigned to the c-th clusters $\{c=1,\ldots ,C \}$ with the same probability. Then, we generated a true centroid parameter for c-th cluster ${\varvec{u}}^{*}_{c} = (u^{*}_{c1},\ldots ,u^{*}_{cp})^{\top }$ by

$$\begin{aligned} u_{cj}^{*} {\left\{ \begin{array}{ll} \sim N(0,10) & \text {if }j\text {-th variable is assigned to } c\text {-th cluster},\\ =0 & \text {otherwise}, \end{array}\right. } \quad j=1,\ldots ,p. \end{aligned}$$

Next, we generated a true task-specific parameter for m-th task that belongs to c-th cluster ${\varvec{v}}^{c*}_{m} = (v^{c*}_{m1},\ldots ,v^{c*}_{mp})^{\top }$ by

$$\begin{aligned} v_{mj}^{c*} {\left\{ \begin{array}{ll} \sim N(0,1) & \text {if }j\text {-th variable is assigned to }c\text {-th cluster},\\ =0 & \text {otherwise}, \end{array}\right. } \quad j=1,\ldots ,p. \end{aligned}$$

For Case 1, we generated a true outlier parameter for m-th task belonging to c-th cluster but assigned to an outlier task ${\varvec{o}}_{m}^{c*}=(o_{m1}^{c*},\ldots ,o_{mp}^{c*})^{\top }$ by

$$\begin{aligned} & o_{mj}^{c*} {\left\{ \begin{array}{ll} \sim f^{\textrm{MTN}}(o) & \text {if }j\text {-th variable is assigned to} c\text {-th cluster},\\ =0 & \text {otherwise}, \end{array}\right. } \quad j=1,\ldots ,p, \end{aligned}$$

where $f^{\textrm{MTN}}(o)$ is a mixture of truncated normal distribution given by

$$\begin{aligned} f^{\textrm{MTN}}(o)= & 0.5f(o,-\infty ,-3,-3,\sigma _{o}^{2})\\ & + 0.5f(o,3,\infty ,3,\sigma _{o}^{2}), \end{aligned}$$

where $f(o,a,b,\mu ,\sigma )$ is a truncated normal distribution on $a\le o \le b$ whose original normal distribution has mean $\mu $ and variance $\sigma ^{2}$. The reason for generating $o_{mj}^{(c)*}$ from $f^{\textrm{MTM}}(o)$ is to leave the absolute value of ${\varvec{o}}_{mj}^{c*}$ away from zero so that the outlier task is located away from the cluster. For Case 2, we generated a true outlier parameter for m-th task ${\varvec{o}}_{m}^{*}=(o_{m1}^{*},\ldots ,o_{mp}^{*})^{\top }$ by

$$\begin{aligned} o_{mj}^{*} \sim U(-10,10), \quad j=1,\ldots ,p, \end{aligned}$$

where U(a, b) is the continuous uniform distribution.

From these generating ways of ${\varvec{u}}_{c}^{*},{\varvec{v}}_{m}^{c*}$, and ${\varvec{o}}_{m}^{*}$, true regression coefficient vectors for non-outlier tasks that belong to different clusters have different non-zero variables. In other words, tasks belonging to different clusters are orthogonal to each other. This orthogonal setting has been used in some studies of MTL (Jacob et al. 2008; Zhou and Zhao 2016).

For our true model, we set as $n_{m}=200$, $p=100$, $T=150$, and $\sigma ^{2}=5$. 200 samples in each task were split into 50 samples for the train, 100 samples for the validation, and left samples for the test. We considered settings: $\kappa =\{0,0.1,0.2,0.3,0.4\}$.

We compared MTLRRC with several methods for the evaluation. For MTLRRC, we consider the three cases where group lasso (GL), group SCAD (GS), and group MCP (GM) are used for the fourth term in (13). Here, we set $\gamma =3.7$ for group SCAD and $\gamma =3$ for group MCP. As other competing methods, we employed MTLCVX, MTLK (multi-task learning via k-means; Argyriou et al. 2007), RCMTL (Yao et al. 2019), and Hotelling-like outlier task detection with MTLK (HMTLK). Here, HMTLK was done as follows. First, $\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}\; (m=1,\ldots ,T)$ were estimated and these sample mean $\bar{{\varvec{w}}}^{\textrm{STL}}$ and covariance matrix $\bar{\Sigma }^{\textrm{STL}}$ were also calculated. For these values, we calculated the statistic $h_{m}=(\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}-\bar{{\varvec{w}}}^{\textrm{STL}})^{\top }(\bar{\Sigma }^{\textrm{STL}})^{-1}(\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}-\bar{{\varvec{w}}}^{\textrm{STL}})$. Then, we detected tasks that satisfied $h_{m}\ge \chi ^{(95)}_{p}$, where $\chi ^{(95)}_{p}$ is a 95 percentile point of $\chi ^{2}$ distribution having p degrees of freedom. Finally, we estimated regression coefficients by MTLK except for detected outlier tasks. Note that the detection based on Hotelling’s $T^{2}$ is not theoretically justified for this simulation setting. However, as we will see later, it is possible to detect some outlier tasks.

The weights $r_{m_{1},m_{2}}$ for both MTLCVX and MTLRRC were calculated by Eq. (3). k was set to five. The estimation of $\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}$ in Eq. (3) and HMTLK were performed by the lasso in R package “glmnet”. In addition, the initialization of RCMTL was done by singular value decomposition of the initial regression coefficient matrix, which is also calculated by “glmnet”. The tuning parameter $\nu $ included in Algorithm 5 was set to one. The regularization parameters were determined by the validation data. For the evaluation, we calculated the normalized mean squared error (NMSE), root mean squared error (RMSE), true positive rate (TPR), and false positive rate (FPR):

$$\begin{aligned} \begin{aligned} \textrm{NMSE}&= \frac{1}{T} \sum _{m=1}^{T}\frac{\Vert {\varvec{y}}^{*}_{m}-X_{m}\widehat{{\varvec{w}}}_{m} \Vert _{2}^{2}}{n_{m}\textrm{Var}({\varvec{y_{m}^{*}}})},\\ \textrm{RMSE}&= \frac{1}{T}\sqrt{\sum _{m=1}^{T}\Vert {\varvec{w}}^{*}_{m}-\widehat{{\varvec{w}}}_{m} \Vert _{2}^{2}},\\ \textrm{TPR}&= \frac{\#\left\{ m;{\varvec{o}}_{m}^{*} \ne {\varvec{0}} \;\wedge \;\widehat{{\varvec{o}}}_{m}\ne {\varvec{0}}\right\} }{\#\left\{ m;{\varvec{o}}_{m}^{*}\ne {\varvec{0}}\right\} }, \\ \textrm{FPR}&= \frac{\#\left\{ m;{\varvec{o}}_{m}^{*} = {\varvec{0}} \;\wedge \;\widehat{{\varvec{o}}}_{m}\ne {\varvec{0}}\right\} }{\#\left\{ m;{\varvec{o}}_{m}^{*}= {\varvec{0}}\right\} }. \end{aligned} \end{aligned}$$

NMSE and RMSE evaluate the accuracy of the prediction and estimated regression coefficients, respectively. TPR and FPR evaluate the accuracy of outlier detection. They were computed 40 times, and the mean and standard deviation were obtained in each setting.

Tables 1 and 2 show the results of simulation studies for Cases 1 and 2, respectively. For Case 1, MTLRRC and MTLCVX show almost the same accuracy in NMSE and RMSE with all $\kappa $s. MTLRRC, MTLCVX, and MTLK, which do not remove the outlier task a priori, outperform HMTLK in terms of NMSE and RMSE. This may suggest that multi-task learning improves the estimation accuracy even for outlier tasks that are located away from other tasks. As for TPR, HMTLK shows better results than MTLRRC. If the purpose is only to detect and eliminate outlier tasks, HMTLK is probably a better choice than robust MTL methods. For FPR, MTLRRC with non-convex penalties archives almost the best performance with any $\kappa $. For Case 2, MTLRRC with non-convex penalties shows better performance than MTLCVX in terms of NMSE and RMSE. Furthermore, MTLRRC with non-convex penalties is superior to HMTLK in terms of both TPR and FPR. As for RCMTL, the results do not show any superior performance in all settings. This is possible because the initialization values obtained from “glmnet” were incompatible with their parameter initialization method based on singular value decomposition, which causes the optimization algorithm to fall into local optima with poor accuracy. Note that in Tables 1 and 2, we do not highlight any TPR of RCMTL regardless of the large value, because RCMTL does not select almost all non-outlier tasks, and the value is meaningless.

On the whole, MTLRRC with non-convex penalties detected true outlier tasks while greatly minimizing the detection of false outlier tasks for both Case 1 and Case 2. However, the differences in estimation accuracy between MTLCVX and MTLRRC are small, particularly for small $\kappa $. On the other hand, MTLRRC with the group lasso regularization shows poor performance even for TPR and FPR. For outlier task detection, the group lasso regularization would not be recommended.

Table 1 Simulation result of Case 1

Full size table

Table 2 Simulation result of Case 2

Full size table

We conducted a sensitivity analysis for the selection of regularization parameters and $\gamma $. The analysis used a fixed dataset generated from Case 1 and Case 2 with $\kappa =0.2$, maintaining other settings consistent with our previous simulation studies. For the regularization term, we employed the group SCAD penalty. First, we selected optimal regularization parameters except for fixed $\gamma =3.7$ through grid search. Then, we estimated MTLRRC and calculated the evaluation values by varying only one parameter while fixing the other parameters to their optimal values. Consequently, we obtain curves representing the empirical relationship between each tuning parameter and the evaluation values as shown in Figs. 2 and 3 for Case 1 and Case 2, respectively.

For $\lambda _{1}$, the selected value is the same for both Case 1 and Case 2. Only in Case 2, the selection of $\lambda _{1}$ is sensitive to all evaluation values. Since the trends of all evaluation values are synchronized, selecting the optimal value that minimizes NMSE yields favorable TPR and FPR values. Therefore it would be essential to search $\lambda _{1}$ in detail. For $\lambda _{2}$, in Case 1, while the curves of NMSE, RMSE and FPR are synchronized, that of TPR is not. This would explain the small TPR in the Case 1 results, as seen in Table 1. In Case 2, the evaluation values are consistent and less sensitive to the value of $\lambda _{2}$. For $\lambda _{3}$, in Case 1, there is not better $\lambda _{3}$ improving both FPR and TPR. The reason for this is that the appropriate $\lambda _{2}$ is not selected, and the search for the true structure regarding outlier tasks and cluster structure has failed. On the other hand, in Case 2, when $\lambda _{3}$ is larger than a certain value, TPR and NMSE are worsened simultaneously. This result probably coincides with our motivation that selecting outliers improves estimation accuracy. For $\gamma $, its value has a limited impact when other tuning parameters are appropriately selected. On the whole, if the outlier task has large characteristics, as in Case 2, then the proposed method can appropriately detect the outlier task by choosing parameters that minimize the NMSE. However, if the latent outlier components are subtle, as in Case 1, then selecting the parameters based on NMSE may fail to identify the outlier task. This limitation requires further investigation in future work.

We also check the convergence of our modified ADMM algorithm with group SCAD penalty. Figure 4 shows the results of the convergence under Case 2. Because the $C_{p}$ of group SCAD is $\frac{1}{\gamma -1}$, $\lambda _{1}>0.371$ satisfies $\lambda _{1}>C_{p}$ for $\gamma =3.7$. The figure shows that all parameters converge under forty iterations for both $\lambda _{1}>C_{p}$ and $\lambda _{1}<C_{p}$.

6 Application to real datasets

In this section, we apply MTLRRC to three real datasets. The first is the landmine data (Xue et al. 2007), which consists of nine-dimensional features and the corresponding binary labels for 29 tasks. Each task corresponds to a landmine field where data were collected: 1–15 tasks correspond to regions that are relatively highly foliated, and 16–29 tasks correspond to regions that are bare earth or desert. Thus, there may be two clusters depending on the ground surface conditions. The responses represent landmines or clutter. The features are four moment-based features, three correlation-based features, one energy ratio feature, and one spatial variance feature. These are extracted from radar images. This dataset contains 14,820 samples in total, but the number of positive samples and negative samples is quite unbalanced. To perform the analysis stably, down-sampling was done by reducing negative samples to equal the number of positive samples. In the results, we used 1,808 samples in total. The second dataset is the school data (Bakker and Heskes 2003), which consists of examination scores of 15,362 students, four school-specific attributes, and three student-specific attributes from 139 sary schools in London from 1985 to 1987. The dataset was obtained by the “MALSAR” package in MATLAB. In the package, categorical attributes were replaced with binary attributes. Then, we used 28-dimensional explanatory variables and the examination scores as a response. Each school is considered a task. The third dataset is microarray data (Wille et al. 2004), which consists of microarray gene expression data focusing on isoprenoid biosynthesis in plants. The dataset contains expression levels of 21 genes in the mevalonate pathway and expression levels of 18 genes in the plastidial pathway. We used those 21 genes as feature variables and each of the 18 genes as a task. Since the dataset consists of only one design matrix, the setting is rather multivariate regression.

Table 3 AUC and NMSE for real data in 100 repetitions

Full size table

We compared MTLRRC with MTLCVX, RCMTL, DTFLR (distributed spanning-tree-based fused-lasso regression; Zhang et al. 2024), and MTLK. First, we split the samples in each task into $60\%$ for the train, $20\%$ for the validation, and $20\%$ for the test. The regularization parameters are determined by the validation data. In MTLRRC, we use non-convex penalties and consider four cases. In the first case, $\gamma $ is fixed with 3 for the group MCP (GM). In the second case, $\gamma $ is fixed with 3.7 for the group SCAD (GS). In the third and fourth cases, $\gamma $ is chosen by the validation data for group MCP and group SCAD, respectively (GM$\gamma $ and GS$\gamma $). For the evaluation, we calculated NMSE for the school data and the microarray data and AUC for the landmine data, respectively. These values were calculated 100 times with the random splitting of the dataset.

Table 3 shows the mean and standard deviation in 100 repetitions. Due to memory restrictions, the implementation code published by Zhang et al. (2024) was infeasible for the school data. From this table, we observe that MTLRRC (GS$\gamma $) gives a smaller NMSE and larger AUC than other methods for the school data, the microarray data, and the landmine data, respectively.

Next, we calculated the mean of estimated parameters ${\widehat{W}},{\widehat{U}}$, and ${\widehat{O}}$ in MTLRRC (GS$\gamma $). Figures 5, 6 and 7 show the mean of estimated parameter values for the landmine data, the school data, and the microarray data, respectively. The vertical axis is the index of tasks, and the horizontal axis is the index of features. Each color shows the mean of the estimated parameters. Figures 8, 9, and 10 are bar plots that show the ratios of each task detected as an outlier task. Here, note that tasks that have never been detected are removed from the bar plots.

For the landmine data, Fig. 5b suggests the presence of two clusters, which is consistent with the fact that tasks 1–15 and 16–29 are obtained from regions corresponding to different surface conditions. For outliers detection in Fig. 5c, because the 10-th task has a relatively larger or smaller value than other tasks and the task is a task detected as an outlier with a ratio greater than 0.3 from Fig. 8, it may indicate that the task is a potential outlier task. Furthermore, the 9-th task also has a relatively high detection ratio that is greater than 0.3, although the mean of $\widehat{{\varvec{o}}}_{m}$ is not as clear as the 10-th task from Fig. 5c. The estimated regression coefficients for the 9-th and 10-th tasks in Fig. 5a show a similarity to other tasks within the same estimated cluster. These results may suggest the underlying structure among tasks in the landmine data is rather Case 1 than Case 2 in Sect. 5. Furthermore, from Fig. 8, we observe that only two tasks in tasks 16–29 have an outlier task ratio of more than 0.1, while 13 tasks in tasks 1–15 have. This result suggests that the cluster composed of tasks 1–15 may have relatively large variability. This may provide some insight into the structure concerning sub-groups within the cluster.

For the school data, we obtained the homogeneous pattern in Fig. 6b. The school data have been considered rather homogeneous in some studies (Bakker and Heskes (2003); Evgeniou et al. (2005)). Thus, this result would be reasonable and shows that MTLRRC can consistently estimate cluster structure even when the true number of underlying clusters is one. From Fig. 9, the 2nd, 28th, 66th, 76th, and 83rd tasks were detected with a rate greater than 0.3. In addition, these tasks show larger values in ${\varvec{o}}_{m}$ than other tasks from Fig. 6c. For the regression coefficients in Fig. 6a, although these tasks share many characteristics with other tasks, some regression coefficients have different characteristics. For example, the regression coefficients for 76th and 83rd tasks of 16th feature have, respectively, relatively large and small values that are much different from other tasks.

For the microarray data, the analysis revealed heterogeneous patterns as shown in Figs. 7a, b, without exhibiting distinctive features. Figure 10 indicates that none of the tasks were clearly identified as outliers, suggesting the absence of outlier tasks in the microarray dataset. This finding presents an interesting contrast to the school data and landmine data, where the existence of outlier tasks was indicated.

7 Conclusion

In this paper, we proposed a robust multi-task learning method called Multi-Task Learning via Robust Regularized Clustering (MTLRRC). To perform the clustering of tasks and detection of outlier tasks simultaneously, we incorporated regularization terms based on robust regularized clustering (RRC), which can detect outlier samples by selecting outlier parameters through group sparse penalties. We showed that the solution of the RRC obtained by the BCD algorithm shares the same first-order condition with convex clustering whose loss function is replaced by the multivariate robust loss function. Thus, MTLRRC is expected to perform robust clustering of tasks. Furthermore, the solution of MTLRRC by the BCD algorithm is also viewed as a convergence point of alternative optimization that involves the RRC for tasks and regression problems reducing the shrinkage of outlier tasks toward the estimated centroid. To mitigate computational costs, we developed an estimation algorithm based on the modified ADMM.

We conducted simulation studies to evaluate the performance of MTLRRC under two scenarios of outlier task structures. In Case 1, outlier tasks share the same characteristics as the centroid but have additional outlier parameters. In Case 2, outlier tasks do not share any common structure and are independent of other tasks. In both of these cases, the proposed method with non-convex group penalties exhibited a near-zero FNR in outlier detection and a much larger TPR in Case 2. However, the improvement in estimation and prediction accuracy was slight, when the proportion of outlier tasks was few.

In the application to real data, we observed that the proposed method effectively estimates multiple cluster structures and identifies potential outlier tasks resembling Case 1. These findings suggest that MTLRRC not only estimates clusters but also provides insights into the heterogeneity of outlier tasks within clusters. In the usual convex clustering for observed and fixed data, one may look at the behavior of the cluster from the solution path. On the other hand, our clustering targets are regression coefficients, whose values also depend on the other regularization parameters like $\lambda _{1}$. Therefore, it is difficult to see the solution path. However, the outlier task detection enables us to obtain information about the behavior of the clusters at the determined regularization parameters. Furthermore, the circumstances corresponding to detected tasks can be subject to re-examination, because the dataset would have unique characteristics compared with other datasets.

One limitation of our study is that MTLRRC includes three or four regularization parameters to be determined. The computational cost of searching for the optimal value of these parameters can be demanding. On the other hand, the definition of the multivariate M-estimator having a connection to the group penalties is different from that of the traditional one (Maronna 1976): while the former M-estimator is defined as a straightforward extension of the univariate M-estimator with robust multivariate loss function, the traditional one is defined as the solution of weighted log-likelihood equations. Although there may be some relationship between these two definitions, they are probably not equivalent. Moreover, outlier tasks similar to Case 2 were not detected in the analyzed real data. We leave these topics as future work.

Data availability

The datasets that support the findings of this study are openly available. The school data and the landmine data can be obtained from the “MALSAR” package (https://jiayuzhou.github.io/MALSAR/) and https://www.cs.columbia.edu/jebara/code/multisparse/. The microarray data can be obtained by Han and Zhang (2015) (http://www.leihan.org/codes/MeTaG-Codes.zip).

Code availability

Implementations of the proposed method, simulation studies, and application to the real data are available as R codes in https://github.com/aokazaki255/MTLRRC.

References

Antoniadis, A.: Wavelet methods in statistics: some recent developments and their applications. Stat. Surv. 1, 16–55 (2007). https://doi.org/10.1214/07-ss014
Article MathSciNet MATH Google Scholar
Argyriou, A., Pontil, M., Ying, Y., Micchelli, C.A.: A spectral regularization framework for multi-task structure learning. Adv. Neural. Inf. Process. Syst. 20, 25–32 (2007)
MATH Google Scholar
Bakker, B., Heskes, T.: Task clustering and gating for Bayesian multitask learning. J. Mach. Learn. Res. 4, 83–99 (2003)
MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Article MathSciNet MATH Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. (2011). https://doi.org/10.1561/2200000016
Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997). https://doi.org/10.1023/A:1007379606734
Article MATH Google Scholar
Chen, J., Zhou, J., Ye, J.: Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th acm sigkdd international conference on knowledge discovery and data mining, pp. 42–50 (2011)
Davis, D., Drusvyatskiy, D., MacPhee, K.J., Paquette, C.: Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. 179(3), 962–982 (2018). https://doi.org/10.1007/s10957-018-1372-8
Article MathSciNet MATH Google Scholar
Dondelinger, F., Mukherjee, S., Initiative, Alzheimer’s Disease Neuroimaging.: The joint lasso: high-dimensional regression for group structured data. Biostatistics 21(2), 219–235 (2020). https://doi.org/10.1093/biostatistics/kxy035
Article MathSciNet MATH Google Scholar
Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 6, 615–637 (2005)
MathSciNet MATH Google Scholar
Fan, Y., Yin, G.: Gaussian mixture models with concave penalized fusion. Stat. Sin. 3, 2115–2139 (2024). https://doi.org/10.5705/ss.202022.0092
Article MathSciNet MATH Google Scholar
Gannaz, I.: Robust estimation and wavelet thresholding in partially linear models. Stat. Comput. 17, 293–310 (2007). https://doi.org/10.1007/s11222-007-9019-x
Article MathSciNet MATH Google Scholar
Gong, P., Ye, J., Zhang, C.: Robust multi-task feature learning. In: Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining, pp. 895–903 (2012)
Hallac, D., Leskovec, J., Boyd, S.: Network lasso: clustering and optimization in large graphs. In: Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining, pp. 387–396 (2015)
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust statistics: The approach based on influence functions. John Wiley & Sons (1986)
Han, L., Zhang, Y.: Learning multi-level task groups in multi-task learning. In: Proceedings of the aaai conference on artificial intelligence (Vol. 29, pp. 2638–2644) (2015)
He, X., Alesiani, F., Shaker, A.: Efficient and scalable multi-task regression on massive number of tasks. In: Proceedings of the 33th AAAI conference on artificial intelligence (Vol. 462, pp. 3763–3770) (2019)
Hocking, T.D., Joulin, A., Bach, F., Vert, J.P.: Clusterpath an algorithm for clustering using convex fusion penalties. In: Proceedings of the 28th international conference on machine learning (pp. 745–752) (2011)
Huang, J., Breheny, P., Ma, S.: A selective review of group selection in high-dimensional models. Stat. Sci. 27(4), 481–499 (2012). https://doi.org/10.1214/12-STS392
Article MathSciNet MATH Google Scholar
Jacob, L., Bach, F., Vert, J. P.: Clustered multi-task learning: a convex formulation. Adv. Neural Inform. Process. Syst. 745–752 (2008)
Katayama, S., Fujisawa, H.: Sparse and robust linear regression: an optimization algorithm and its statistical properties. Stat. Sin. 27(3), 1243–1264 (2017). https://doi.org/10.5705/ss.202015.0179
Article MathSciNet MATH Google Scholar
Lindsten, F., Ohlsson, H., Ljung, L.: Clustering using sum-of-norms regularization: with application to particle filter output computation. In: 2011 IEEE statistical signal processing workshop, pp. 201–204 (2011)
Lounici, K., Pontil, M., van de Geer, S., Tsybakov, A.B.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011). https://doi.org/10.1214/11-AOS896
Article MathSciNet MATH Google Scholar
Maronna, R.A.: Robust m-estimators of multivariate location and scatter. Ann. Stat. 4(1), 51–67 (1976)
Article MathSciNet MATH Google Scholar
Ngai, H.V., Luc, D.T., Théra, M.: Approximate convex functions. J. Nonlinear Convex Anal. 1(2), 155–176 (2000)
MathSciNet MATH Google Scholar
Okazaki, A., Kawano, S.: Multi-task learning regression via convex clustering. Comput. Stat. Data Anal. 195, 107956 (2024). https://doi.org/10.1016/j.csda.2024.107956
Article MathSciNet MATH Google Scholar
Pelckmans, K., De Brabanter, J., Suykens, J., De Moor, B.: Convex clustering shrinkage. Pascal workshop on statistics and optimization of clustering workshop (2005)
Quan, Z., Chen, S.: Robust convex clustering. Soft. Comput. 24(2), 731–744 (2020). https://doi.org/10.1109/ICDM.2016.0170
Article MATH Google Scholar
She, Y., Owen, A.B.: Outlier detection using nonconvex penalized regression. J. Am. Stat. Assoc. 106(494), 626–639 (2011). https://doi.org/10.1198/jasa.2011.tm10390
Article MathSciNet MATH Google Scholar
Shimmura, R., Suzuki, J.: Converting admm to a proximal gradient for efficient sparse estimation. Jpn. J. Stat. Data Sci. 5, 725–745 (2022). https://doi.org/10.1007/s42081-022-00150-6
Article MathSciNet MATH Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Series B Stat. Methodol. (2005). https://doi.org/10.1111/j.1467-9868.2005.00490.x
Wang, Y., Yin, W., Zeng, J.: Global convergence of admm in nonconvex nonsmooth optimization. J. Sci. Comput. 78, 29–63 (2019). https://doi.org/10.1007/s10915-018-0757-z
Article MathSciNet MATH Google Scholar
Wille, A., Zimmermann, P., Vranová, E., Fürholz, A., Laule, O., Bleuler, S., et al.: Sparse graphical gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome Biol. 5, 1–13 (2004). https://doi.org/10.1186/gb-2004-5-11-r92
Article Google Scholar
Xue, Y., Liao, X., Carin, L., Krishnapuram, B.: Multi-task learning for classification with dirichlet process priors. J. Mach. Learn. Res. 8(2), 35–63 (2007)
MathSciNet MATH Google Scholar
Yamada, M., Koh, T., Iwata, T., Shawe-Taylor, J., Kaski, S.: Localized lasso for high-dimensional regression. In: Proceedings of the 20th international conference on artificial intelligence and statistics (Vol. 54, pp. 325–333) (2017)
Yao, Y., Cao, J., Chen, H.: Robust task grouping with representative tasks for clustered multi-task learning. In: Proceedings of the 25th acm sigkdd international conference on knowledge discovery & data mining, pp. 1408–1417 (2019)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc., Series B; Stat. Methodol. 68(1), 49–67 (2006). https://doi.org/10.1111/j.1467-9868.2005.00532.x
Zhang, X., Liu, J., Zhu, Z.: Learning coefficient heterogeneity over networks: a distributed spanning-tree-based fused-lasso regression. J. Am. Stat. Assoc. 119(545), 485–497 (2024). https://doi.org/10.1080/01621459.2022.2126363
Article MathSciNet MATH Google Scholar
Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 32(12), 5586–5609 (2021). https://doi.org/10.1109/TKDE.2021.3070203
Article MATH Google Scholar
Zhong, W., Kwok, J.T.Y.: Convex multitask learning with flexible task clusters. In: Proceedings of the 29th international conference on machine learning, pp. 483–490 (2012)
Zhou, Q., Zhao, Q.: Flexible clustered multi-task learning by learning representative tasks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 266–278 (2016). https://doi.org/10.1109/TPAMI.2015.2452911

Download references

Acknowledgements

The authors thank the reviewers for their helpful comments and constructive suggestions. The computational resource was provided by the Super Computer System, Human Genome Center, Institute of Medical Science, The University of Tokyo.

Funding

A. O. was supported by JST SPRING, Grant Number JPMJSP2136. S. K. was supported by JSPS KAKENHI Grant Numbers JP23K11008, JP23H03352, and JP23H00809.

Author information

Authors and Affiliations

Graduate School of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, 819-0395, Fukuoka, Japan
Akira Okazaki
Faculty of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, 819-0395, Fukuoka, Japan
Shuichi Kawano

Authors

Akira Okazaki
View author publications
You can also search for this author inPubMed Google Scholar
Shuichi Kawano
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization, AO; methodology, AO; formal analysis, AO; data curation, AO; writing-original draft preparation, AO and SK; supervision, SK; funding acquisition, AO and SK All authors have read and agreed to the manuscript.

Corresponding author

Correspondence to Akira Okazaki.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare relevant to this article’s content.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 200 KB)

Appendices

Multivariate robust loss functions

In this appendix, we show some multivariate loss functions and corresponding group-thresholding functions.

1.1 Group SCAD

For $\gamma >2$, the group SCAD thresholding function and loss function are, respectively, expressed as

$$\begin{aligned} & \rho _{\lambda ,\gamma }^{\textrm{gSCAD}}({\varvec{z}})\\ & = {\left\{ \begin{array}{ll} \frac{1}{2} \Vert {\varvec{z}}\Vert _{2}^{2} & \Vert {\varvec{z}}\Vert _{2} \le \lambda ,\\ \lambda \Vert {\varvec{z}}\Vert _{2} - \frac{\lambda ^{2}}{2} & \lambda \le \Vert {\varvec{z}}\Vert _{2}< 2\lambda , \\ \frac{\gamma \lambda }{\gamma -2}\Vert {\varvec{z}}\Vert _{2}{-}\frac{1}{2(\gamma -2)}\Vert {\varvec{z}}\Vert _{2}^{2}{-}\frac{\gamma +2}{2(\gamma -2)}\lambda ^{2} & 2\lambda \le \Vert {\varvec{z}}\Vert _{2} \le \gamma \lambda , \\ \frac{\gamma +1}{2}\lambda ^{2} & \gamma \lambda< \Vert {\varvec{z}}\Vert _{2}, \end{array}\right. }\\ & {\varvec{\Theta }}^{\textrm{gSCAD}}({\varvec{o}};\lambda ,\gamma )= {\left\{ \begin{array}{ll} S({\varvec{o}};\lambda ), & \Vert {\varvec{o}}\Vert _{2}\le 2\lambda ,\\ \frac{\gamma -1}{\gamma -2}S({\varvec{o}},\frac{\gamma \lambda }{\gamma -1}), & 2\lambda <\Vert {\varvec{o}}\Vert _{2}\le \gamma \lambda ,\\ {\varvec{o}}, & \Vert {\varvec{o}}\Vert _{2}>\gamma \lambda . \end{array}\right. } \end{aligned}$$

1.2 Group MCP

For $\gamma >1$, the group MCP thresholding function and loss function are, respectively, expressed as

$$\begin{aligned} & \rho _{\lambda ,\gamma }^{\textrm{gMCP}}({\varvec{z}})\\ & = {\left\{ \begin{array}{ll} \frac{1}{2} \Vert {\varvec{z}}\Vert _{2}^{2} & \Vert {\varvec{z}}\Vert _{2} \le \lambda ,\\ \frac{\gamma \lambda }{\gamma -1}\Vert {\varvec{z}}\Vert _{2}-\frac{1}{2(\gamma -1)}\Vert {\varvec{z}}\Vert _{2}^{2}-\frac{\gamma \lambda ^{2}}{2(\gamma -1)} & \lambda \le \Vert {\varvec{z}}\Vert _{2} \le \gamma \lambda , \\ \frac{\gamma \lambda ^{2}}{2} & \gamma \lambda < \Vert {\varvec{z}}\Vert _{2}, \end{array}\right. } \\ & {\varvec{\Theta }}^{\textrm{gMCP}}({\varvec{o}};\lambda ,\gamma )= {\left\{ \begin{array}{ll} \frac{\gamma }{\gamma -1}S({\varvec{o}},\lambda ), & \Vert {\varvec{o}}\Vert _{2}\le \gamma \lambda ,\\ {\varvec{o}}, & \Vert {\varvec{o}}\Vert _{2}>\gamma \lambda . \end{array}\right. } \end{aligned}$$

1.3 Multivariate skipped mean loss

We define the multivariate version of skipped mean loss function as

$$\begin{aligned} \rho _{\lambda }^{\textrm{MS}}({\varvec{z}}) ={\left\{ \begin{array}{ll} \frac{\Vert {\varvec{z}}\Vert _{2}^{2}}{2} & \Vert {\varvec{z}}\Vert _{2}\le \lambda , \\ \frac{\lambda ^{2}}{2} & \Vert {\varvec{z}}\Vert _{2}> \lambda . \end{array}\right. } \end{aligned}$$

The corresponding group hard thresholding function is given by

$$\begin{aligned} {\varvec{\Theta }}^{\textrm{MS}}({\varvec{o}};\lambda )= {\left\{ \begin{array}{ll} {\varvec{0}}, & \Vert {\varvec{o}}\Vert _{2}\le \lambda ,\\ {\varvec{o}}, & \Vert {\varvec{o}}\Vert _{2}>\lambda . \end{array}\right. } \end{aligned}$$

1.4 Multivariate Tukey

We define the multivariate version of Tukey’s loss function as

$$\begin{aligned} \rho _{\lambda }^{\textrm{MT}}({\varvec{z}}) ={\left\{ \begin{array}{ll} 1-\left( 1-\frac{\Vert {\varvec{z}}\Vert _{2}^{2}}{\lambda ^{2}}\right) ^{3} & \Vert {\varvec{z}}\Vert _{2}\le \lambda , \\ 1 & \Vert {\varvec{z}}\Vert _{2} > \lambda . \end{array}\right. } \end{aligned}$$

The corresponding group thresholding function can be expressed as

$$\begin{aligned} \Theta ^{\textrm{MT}}({\varvec{o}};\lambda ,\gamma ) ={\left\{ \begin{array}{ll} {\varvec{o}}-{\varvec{o}}\left( 1-\frac{\Vert {\varvec{o}}\Vert _{2}^{2}}{\lambda ^{2}}\right) ^{2} & \Vert {\varvec{o}}\Vert _{2} \le \lambda , \\ {\varvec{o}} & \Vert {\varvec{o}}\Vert _{2} > \lambda . \end{array}\right. } \end{aligned}$$

Proofs

1.1 Proof of Proposition 1

From the update of Algorithm 1, a convergence point $({\widehat{U}},{\widehat{O}})$ satisfies

$$\begin{aligned} \textrm{vec}({\widehat{O}}) = \begin{pmatrix} \widehat{{\varvec{o}}}_{1} \\ \widehat{{\varvec{o}}}_{2} \\ \vdots \\ \widehat{{\varvec{o}}}_{n} \end{pmatrix}=\begin{pmatrix} {\varvec{\Theta }}({\varvec{x}}_{1}-\widehat{{\varvec{u}}}_{1};\lambda _{2},\gamma ) \\ {\varvec{\Theta }}({\varvec{x}}_{2}-\widehat{{\varvec{u}}}_{2};\lambda _{2},\gamma ) \\ \vdots \\ {\varvec{\Theta }}({\varvec{x}}_{n}-\widehat{{\varvec{u}}}_{n};\lambda _{2},\gamma ) \\ \end{pmatrix}. \end{aligned}$$

On the other hand, because the update (7) is equivalently expressed as

$$\begin{aligned} {\widehat{U}}= & \mathop {\mathrm{arg~min}}\limits _{\begin{array}{c} U \end{array}}\left\{ \frac{1}{2}\Vert \textrm{vec}(X)-\textrm{vec}(U)-\textrm{vec}({\widehat{O}})\Vert _{2}^{2}\right. \\ & \left. + \lambda _{1}\Vert D\textrm{vec}(U)\Vert _{2,1}\right\} , \end{aligned}$$

it follows:

$$\begin{aligned} & \quad -\{\textrm{vec}(X)-\textrm{vec}({\widehat{U}})-\textrm{vec}({\widehat{O}})\}\\ & + \lambda _{1} \partial _{\textrm{vec}(U)} \Vert D\textrm{vec}(U)\Vert _{2,1}|_{U={\widehat{U}}} \ni {\varvec{0}}. \end{aligned}$$

Thus, we conclude Proposition 1 from

$$\begin{aligned} \begin{aligned}&\textrm{vec}(X)-\textrm{vec}({\widehat{U}})-\textrm{vec}({\widehat{O}}) \\&=\begin{pmatrix} {\varvec{x}}_{1}-\widehat{{\varvec{u}}}_{1}-{\varvec{\Theta }}({\varvec{x}}_{1}-\widehat{{\varvec{u}}}_{1};\lambda _{2},\gamma ) \\ {\varvec{x}}_{2}-\widehat{{\varvec{u}}}_{2}-{\varvec{\Theta }}({\varvec{x}}_{2}-\widehat{{\varvec{u}}}_{2};\lambda _{2},\gamma ) \\ \vdots \\ {\varvec{x}}_{n}-\widehat{{\varvec{u}}}_{n}-{\varvec{\Theta }}({\varvec{x}}_{n}-\widehat{{\varvec{u}}}_{n};\lambda _{2},\gamma ) \\ \end{pmatrix} \\&=\begin{pmatrix} \psi ({\varvec{x}}_{1}-{\varvec{{\widehat{u}}}}_{1};\lambda _{2},\gamma ) \\ \psi ({\varvec{x}}_{2}-{\varvec{{\widehat{u}}}}_{2};\lambda _{2},\gamma ) \\ \vdots \\ \psi ({\varvec{x}}_{n}-{\varvec{{\widehat{u}}}}_{n};\lambda _{2},\gamma ) \end{pmatrix} \\&= \Psi (X-{\widehat{U}};\lambda _{2},\gamma ). \end{aligned} \end{aligned}$$

1.2 Proof of Proposition 2

From Algorithm 3, a convergence point $(\widehat{{\varvec{w}}}_0, {\widehat{W}},{\widehat{U}},{\widehat{O}})$ satisfies

$$\begin{aligned} & \textrm{vec}({\widehat{O}}) = \begin{pmatrix} \widehat{{\varvec{o}}}_{1} \\ \widehat{{\varvec{o}}}_{2} \\ \vdots \\ \widehat{{\varvec{o}}}_{T} \end{pmatrix}=\begin{pmatrix} {\varvec{\Theta }}(\widehat{{\varvec{w}}}_{1}-\widehat{{\varvec{u}}}_{1};\lambda _{3}/\lambda _{1},\gamma ) \\ {\varvec{\Theta }}(\widehat{{\varvec{w}}}_{2}-\widehat{{\varvec{u}}}_{2};\lambda _{3}/\lambda _{1},\gamma ) \\ \vdots \\ {\varvec{\Theta }}(\widehat{{\varvec{w}}}_{T}-\widehat{{\varvec{u}}}_{T};\lambda _{3}/\lambda _{1},\gamma ) \\ \end{pmatrix}, \end{aligned}$$

(B1)

$$\begin{aligned} & ({\widehat{w}}_{m0},\widehat{{\varvec{w}}}_{m}) = \mathop {\mathrm{arg~min}}\limits _{w_{m0},{\varvec{w}}_{m}} \left\{ \frac{1}{n_{m}} L(w_{m0},{\varvec{w}}_{m})\right. \nonumber \\ & \qquad \qquad \left. + \frac{\lambda _{1}}{2} \Vert {\varvec{w}}_{m}-\widehat{{\varvec{u}}}_{m}-\widehat{{\varvec{o}}}_{m}\Vert _{2}^{2} \right\} , \quad m=1,\ldots ,T, \end{aligned}$$

(B2)

$$\begin{aligned} & {\widehat{U}} = \mathop {\mathrm{arg~min}}\limits _{\begin{array}{c} U \end{array}}\left\{ \frac{\lambda _{1}}{2}\Vert \textrm{vec}({\widehat{W}})-\textrm{vec}(U)\right. \nonumber \\ & \qquad \qquad \left. - \textrm{vec}({\widehat{O}})\Vert _{2}^{2}+ \lambda _{2}\Vert D\textrm{vec}(U)\Vert _{2,1}\right\} . \end{aligned}$$

(B3)

From the first-order condition of the minimization problems (B2) and (B3), we obtain

$$\begin{aligned}&\frac{\partial }{\partial {\varvec{w}}_{m}^{\prime }}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})|_{{\varvec{w}}_{m}^{\prime }=\widehat{{\varvec{w}}}_{m}^{\prime }} + \lambda _{1} \begin{pmatrix} 0 \\ \widehat{{\varvec{w}}}_{m}-\widehat{{\varvec{u}}}_{m}-\widehat{{\varvec{o}}}_{m} \end{pmatrix} = {\varvec{0}}, \\&- \lambda _{1}\{ \textrm{vec}({\widehat{W}})-\textrm{vec}({\widehat{U}})-\textrm{vec}({\widehat{O}}) \}\\&\quad + \lambda _{2} \partial _{ \textrm{vec}(U)} \Vert D\textrm{vec}(U)\Vert _{2,1}|_{U={\widehat{U}}} \ni {\varvec{0}}. \end{aligned}$$

Some algebra for them and (B1) conclude Proposition 2.

1.3 Proof of Proposition 3

Since a limit point $\{\widehat{{\varvec{w}}}_{0}^{(*)},{\widehat{W}}^{(*)},{\widehat{U}}^{(*)},{\widehat{O}}^{(*)},{\widehat{B}}^{(*)},{\widehat{S}}^{(*)}\}$ satisfies the optimality condition of augmented Lagrangian, we have

$$\begin{aligned} -\lambda _{1}\{{\widehat{W}}^{(*)}-{\widehat{U}}^{(*)}-{\widehat{O}}^{(*)}\}+{\widehat{S}}^{(*)\top }A_{{\mathcal {E}}} = 0. \end{aligned}$$

Then, similar to the proof of Proposition 2, it holds that

$$\begin{aligned} & -\lambda _{1}(\psi (\widehat{{\varvec{w}}}_{1}^{(*)}-\widehat{{\varvec{u}}}_{1}^{(*)};\lambda _{3}/\lambda _{1},\gamma ),\ldots ,\\ & \quad \psi (\widehat{{\varvec{w}}}_{T}^{(*)}-\widehat{{\varvec{u}}}_{T}^{(*)};\lambda _{3}/\lambda _{1},\gamma ))^{\top }+{\widehat{S}}^{(*)\top }A_{{\mathcal {E}}} =0. \end{aligned}$$

This is one of the optimality conditions of the minimization problem concerning augmented Lagrangian (22). Moreover, other optimality conditions are the same as the minimization problem concerning the augmented Lagrangian (20). This concludes the Proposition 3.

1.4 Proofs of the weakly convexity

First, we set $h({\varvec{z}})=\rho _{\lambda ,\gamma }^{\textrm{gSCAD}}({\varvec{z}})+\frac{1}{2(\gamma -2)}\Vert {\varvec{z}}\Vert _{2}^{2}$. The Hesse matrix in each region of $h({\varvec{z}})$ is semi-positive definite, as it can be easily verified. Next, we consider the limits of the derived function $\frac{\partial }{\partial {\varvec{z}}} h ({\varvec{z}})$ into an arbitrary point ${\varvec{z}}_{0}$ on the boundary partitioning the function from the inside and the outside of regions. The limits are given by

$$\begin{aligned} \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ \Vert {\varvec{z}}\Vert _{2}\le \lambda \end{array}} \frac{\partial }{\partial {\varvec{z}}} h ({\varvec{z}}) & = \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ \lambda \le \Vert {\varvec{z}}\Vert _{2}< 2\lambda \end{array}}\frac{\partial }{\partial {\varvec{z}}}h({\varvec{z}})\\ & = \frac{\gamma -1}{\gamma -2} {\varvec{z}}_{0}, \quad \mathrm {s.t.} \;\Vert {\varvec{z}}_{0}\Vert _{2}=\lambda ,\\ \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ \lambda \le \Vert {\varvec{z}}\Vert _{2}< 2\lambda \end{array}} \frac{\partial }{\partial {\varvec{z}}}h({\varvec{z}}) & = \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ 2\lambda \le \Vert {\varvec{z}}\Vert _{2}< \gamma \lambda \end{array}}\frac{\partial }{\partial {\varvec{z}}}h({\varvec{z}})\\ & =\frac{\gamma }{2(\gamma -2)} {\varvec{z}}_{0},\quad \mathrm {s.t.} \;\Vert {\varvec{z}}_{0}\Vert _{2}=2\lambda , \\ \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ 2\lambda \le \Vert {\varvec{z}}\Vert _{2}< \gamma \lambda \end{array}}\frac{\partial }{\partial {\varvec{z}}} h ({\varvec{z}}) & = \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ \gamma \lambda \le \Vert {\varvec{z}}\Vert _{2} \end{array}}\frac{\partial }{\partial {\varvec{z}}} h ({\varvec{z}})\\ & = \frac{1}{\gamma -2} {\varvec{z}}_{0}, \quad \mathrm {s.t.} \;\Vert {\varvec{z}}_{0}\Vert _{2}=\gamma \lambda . \end{aligned}$$

These imply that the derived function $\frac{\partial }{\partial {\varvec{z}}}h ({\varvec{z}})$ is continuous in the boundaries. Thus, $h({\varvec{z}})$ is a convex function and $\rho _{\lambda ,\gamma }^{\textrm{gSCAD}}({\varvec{z}})$ is a weakly convex function. Similarly, the group MCP is also a weakly convex function.

On the other hand, for the multivariate skipped mean loss, we set $h({\varvec{z}})= \rho _{\lambda }^{\textrm{SM}}({\varvec{z}})+\frac{\delta }{2}\Vert {\varvec{z}}\Vert _{2}^{2} \;(\delta >0)$. the limits for the derived function into the boundary are calculated as follows:

$$\begin{aligned} & \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ \Vert {\varvec{z}}\Vert _{2}\le \lambda \end{array}} \frac{\partial }{\partial {\varvec{z}}}h ({\varvec{z}}) = (\delta +1) {\varvec{z}}_{0}, \quad \mathrm {s.t.} \;\Vert {\varvec{z}}_{0}\Vert _{2}=\lambda , \\ & \lim _{\begin{array}{c} {\varvec{z}}\rightarrow {\varvec{z}}_{0}\\ \Vert {\varvec{z}}\Vert _{2} > \lambda \end{array}} \frac{\partial }{\partial {\varvec{z}}}h ({\varvec{z}}) = \delta {\varvec{z}}_{0}, \quad \mathrm {s.t.} \;\Vert {\varvec{z}}_{0}\Vert _{2}=\lambda , \end{aligned}$$

Therefore, the derived function is discontinuous at the boundary. Then, multivariate skipped mean loss is not a weakly convex function.

Derivation of update for W, B, and S in Algorithm 5

To illustrate the derivation of update for W, B, and S, we show the idea of Shimmura and Suzuki (2022).

Let A be a $p \times n$ matrix, and let $f:{\mathbb {R}}^{n}\rightarrow {\mathbb {R}}$ and $h:{\mathbb {R}}^{m}\rightarrow {\mathbb {R}}$ be convex functions. We consider the minimization problem taking form:

$$\begin{aligned} \min _{{\varvec{x}},{\varvec{y}}}\left\{ f({\varvec{x}})+h({\varvec{y}})\right\} , \quad \mathrm {s.t.} \quad A{\varvec{x}}={\varvec{y}}. \end{aligned}$$

Then, the augmented Lagrangian is expressed as

$$\begin{aligned} L_{\nu }({\varvec{x}},{\varvec{y}},{\varvec{s}})= f({\varvec{x}}) + h({\varvec{y}}) + {\varvec{s}}^{\top }(A{\varvec{x}}-{\varvec{y}}) + \frac{\nu }{2}\Vert A{\varvec{x}}-{\varvec{y}}\Vert _{2}^{2}, \end{aligned}$$

where ${\varvec{s}}$ is a p-dimensional vector of Lagrangian multipliers and $\nu \ (\ge 0)$ is a tuning parameter. For this augmented Lagrangian, the jointly update of ${\varvec{x}}$ and ${\varvec{y}}$ based on ADMM is expressed as

$$\begin{aligned} \begin{aligned} ({\varvec{x}}^{(t+1)},{\varvec{y}}^{(t+1)}) = \mathop {\mathrm{arg~min}}\limits _{{\varvec{x}},{\varvec{y}}}\left\{ L_{\nu }({\varvec{x}},{\varvec{y}},{\varvec{s}}^{(t)}) \right\} . \end{aligned} \end{aligned}$$

If we consider updating only ${\varvec{x}}$ by the joint optimization, the update can be written as

$$\begin{aligned} {\varvec{x}}^{(t+1)}= & \mathop {\mathrm{arg~min}}\limits _{{\varvec{x}}}\left\{ \min _{{\varvec{y}}}L_{\nu }({\varvec{x}},{\varvec{y}},{\varvec{s}}^{(t)}) \right\} \nonumber \\= & \mathop {\mathrm{arg~min}}\limits _{{\varvec{x}}}\left\{ f({\varvec{x}})+ \min _{{\varvec{y}}} \left\{ h({\varvec{y}})+{\varvec{s}}^{\top (t)}(A{\varvec{x}}-{\varvec{y}})\right. \right. \nonumber \\ & \quad \left. \left. + \frac{\nu }{2}\Vert A{\varvec{x}}-{\varvec{y}}\Vert _{2}^{2}\right\} \right\} \nonumber \\= & \mathop {\mathrm{arg~min}}\limits _{{\varvec{x}}}\left\{ f({\varvec{x}}) - r^{*}(\nu A{\varvec{x}}+s^{(t)}) + {\varvec{s}}^{\top (t)}A{\varvec{x}}\right. \nonumber \\ & \quad \left. +\frac{\nu }{2}\Vert A{\varvec{x}}\Vert _{2}^{2}\right\} , \end{aligned}$$

(C4)

where $r^{*}(\cdot )$ is the conjugate function of $r({\varvec{z}})=h({\varvec{z}})+\frac{\nu }{2}\Vert {\varvec{z}}\Vert _{2}^{2}$. This indicates that we can update ${\varvec{x}}$ without updating ${\varvec{y}}$. Furthermore, the following lemma motivates us to solve the minimization problem (C4) by the gradient method.

Lemma 1

(Shimmura and Suzuki (2022); Theorem 1) If we define $\phi _{1}({\varvec{x}})=f({\varvec{x}})+\min _{{\varvec{y}}}(h({\varvec{y}})+{\varvec{s}}^{\top (t)}(A{\varvec{x}}-{\varvec{y}}) + \frac{\nu }{2}\Vert A{\varvec{x}}-{\varvec{y}}\Vert _{2}^{2})$, $\phi _{1}$ is differentiable. Furthermore, we have

$$\begin{aligned} \frac{\partial }{\partial {\varvec{x}}}\phi _{1}({\varvec{x}}) = \frac{\partial }{\partial {\varvec{x}}} f({\varvec{x}})+A^{\top }(\textrm{prox}_{\nu h^{*}}(\nu A {\varvec{x}}+{\varvec{s}}^{(t)})), \end{aligned}$$

where $\textrm{prox}_{g}(\cdot )$ is the proximal map of $g(\cdot )$ defined as

$$\begin{aligned} \textrm{prox}_{g}({\varvec{z}}) = \mathop {\mathrm{arg~min}}\limits _{{\varvec{x}}}\left\{ g({\varvec{x}}) +\frac{1}{2}\Vert {\varvec{z}}-{\varvec{x}} \Vert _{2}^{2} \right\} . \end{aligned}$$

As a result, the update of ${\varvec{x}}$ is given by the convergence point of the gradient method. Here, the update of the gradient method is given by

$$\begin{aligned} {\varvec{x}}^{(l+1)}= & {\varvec{x}}^{(l)} - \iota \left\{ \frac{\partial }{\partial {\varvec{x}}} f({\varvec{x}})|_{{\varvec{x}}={\varvec{x}}^{(l)}}\right. \\ & \left. +A^{\top }(\textrm{prox}_{\nu h^{*}}(\nu A {\varvec{x}}^{(l)}+{\varvec{s}}^{(t)}))\right\} , \end{aligned}$$

where $\iota $ is a step size with a non-negative value. Shimmura and Suzuki (2022) considered accelerating the convergence rate of the gradient method based on fast iterative shrinkage-thresholding algorithm (FISTA; Beck and Teboulle (2009)), which is a combination of the proximal gradient method and the Nesterov’s acceleration method. To guarantee the convergence, Nesterov’s acceleration method requires setting $\iota \le 1/L$ for $L>0$ such that

$$\begin{aligned} \left\| \frac{\partial }{\partial {\varvec{x}}} \phi _{1}({\varvec{x}})|_{{\varvec{x}}={\varvec{x}}_{1}}-\frac{\partial }{\partial {\varvec{x}}} \phi _{1}({\varvec{x}})|_{{\varvec{x}}={\varvec{x}}_{2}} \right\| _{2}^{2} \le L\Vert {\varvec{x}}_{1}-{\varvec{x}}_{2} \Vert _{2}^{2}. \end{aligned}$$

The update of ${\varvec{s}}$ via ADMM can be converted by Moreau decomposition as

$$\begin{aligned} \begin{aligned} {\varvec{s}}^{(t+1)}&= {\varvec{s}}^{(t)}+\nu (A{\varvec{x}}^{(t+1)}-{\varvec{y}}^{(t+1)}) \\&= \textrm{prox}_{\nu h^{*}}({\varvec{s}}^{(t)}+\nu A {\varvec{x}}^{(t+1)}). \end{aligned} \end{aligned}$$

(C5)

By applying these ideas to the augmented Lagrangian of MTLRRC in (20), we obtain the update concerning the gradient method as

$$\begin{aligned} U^{(l+1)} = U^{(l)}-\iota \{\lambda _{1}(U^{(l)}+O^{(t)}-W^{(t+1)})+A_{{\mathcal {E}}}^{\top }F\},\nonumber \\ \end{aligned}$$

(C6)

where F is the matrix whose each row is updated by

$$\begin{aligned} F_{(m_{1},m_{2})}=\textrm{prox}((S^{(t)}+ \nu A_{{\mathcal {E}}} \cdot C^{(l)})_{(m_{1},m_{2})},\lambda _{2}r_{m_{1},m_{2}}). \end{aligned}$$

Here, $\textrm{prox}(\cdot ,\lambda )$ defined in Eq. (21) is a proximal map of $h^{*}$ derived from $h({\varvec{z}})=\lambda \Vert {\varvec{z}}\Vert _{2}$. As a result, by combining the update (C6) and accelerated gradient method, we obtain an update for $U^{(t+1)}$ in Algorithm 5.

Similarly, by applying update (C5) to MTLRRC, the update of S is given by

$$\begin{aligned} S^{(t+1)}_{(m_{1},m_{2})} = \textrm{prox}((S^{(t)}+\nu A_{{\mathcal {E}}} \cdot U^{(t+1)})_{(m_{1},m_{2})},\lambda _{2}r_{m_{1},m_{2}}). \end{aligned}$$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Okazaki, A., Kawano, S. Multi-task learning via robust regularized clustering with non-convex group penalties. Stat Comput 35, 23 (2025). https://doi.org/10.1007/s11222-024-10550-1

Download citation

Received: 18 August 2024
Accepted: 10 December 2024
Published: 30 December 2024
DOI: https://doi.org/10.1007/s11222-024-10550-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-task learning via robust regularized clustering with non-convex group penalties

Abstract

Similar content being viewed by others

Interpetable Target-Feature Aggregation for Multi-task Learning Based on Bias-Variance Analysis

Also for k-means: more data does not imply better performance

Multiple kernel k-means clustering with block diagonal property

Explore related subjects

1 Introduction

2 Motivation and methodology

2.1 Problem set-up

2.2 Multi-task learning via convex clustering

2.3 Robust convex clustering

2.4 Non-convex extensions of robust convex clustering

Proposition 1

3 Proposed method

3.1 Multi-task learning via robust regularized clustering

3.2 Interpretation through the BCD algorithm

3.2.1 Convex case

3.2.2 Non-convex case

Proposition 2

4 Estimation algorithm via modified ADMM

Theorem 1

Proposition 3

5 Simulation studies

6 Application to real datasets

7 Conclusion

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 200 KB)

Appendices

Multivariate robust loss functions

1.1 Group SCAD

1.2 Group MCP

1.3 Multivariate skipped mean loss

1.4 Multivariate Tukey

Proofs

1.1 Proof of Proposition 1

1.2 Proof of Proposition 2

1.3 Proof of Proposition 3

1.4 Proofs of the weakly convexity

Derivation of update for W, B, and S in Algorithm 5

Lemma 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords