1 Introduction

Multi-task learning (MTL) (Caruana 1997) is a methodology where we simultaneously learn multiple related tasks so that each task uses information from other tasks. In the field of statistics, MTL is formulated as an estimation problem for multiple statistical models corresponding to multiple datasets with prior information about relationships among models. Because, in real problems, related tasks tend to have some common information, MTL is expected to have better estimation and prediction accuracy than independently estimated models (Zhang and Yang 2021). The key part of MTL is the assumption of how tasks relate to each other and transfer the common information across models. When tasks are sufficiently related as assumed, MTL can leverage the shared information to improve the efficiency of estimation and generalization performance. However, when tasks are unrelated or only weakly related, forcing information sharing among them can be detrimental.

A reasonable assumption to address this issue is that tasks are classified into some groups sharing common characteristics. MTL methods based on this assumption are achieved by clustering the parameters of models. For instance, Argyriou et al. (2007) introduced the k-means algorithm for task clustering, while several other studies (Zhong and Kwok 2012; Yamada et al. 2017; He et al. 2019; Dondelinger et al. 2020; Okazaki and Kawano 2024) utilized fused regularization techniques such as fused lasso (Tibshirani et al. 2005) and network lasso (Hallac et al. 2015). However, these methods have not considered the existence of outlier tasks that have large task-specific uniqueness or do not share any common characteristics with other tasks. The clustering techniques employed for MTL methods intend to classify all tasks into some clusters and estimate the model to be close to the center of the cluster. Thus, their existence would deteriorate the performance of clustering and lead to a false interpretation of the estimated clusters. On the other hand, some robust MTL methods (Chen et al. 2011; Gong et al. 2012) have been proposed to detect them and reduce their influence on a common structure. However, these methods also have certain limitations. First, they typically classify tasks into only one shared structure and other outlier tasks. If multiple common characteristics exist among tasks, it would be difficult to extract true task structures. Second, these methods often rely on the group lasso regularization (Yuan and Lin 2006) for outlier parameters to identify the outlier tasks, which restricts the value of outlier parameters. Limiting the value of the outlier parameters may ignore the nature of outlier tasks.

To address these limitations of existing MTL methods, we propose a novel robust MTL method called Multi-Task Learning with Robust Regularized Clustering (MTLRRC). MTLRRC aims to simultaneously cluster tasks and detect outlier tasks by integrating loss functions for tasks with robust regularization terms. The regularization term is inspired by the robust convex clustering (Quan and Chen 2020), which is further extended to incorporate non-convex and group-sparse penalties. This extension allows MTLRRC to effectively identify outlier tasks and achieve robust task clustering. We further establish a connection between the extended robust clustering and the multivariate M-estimator. This connection provides an intuitive interpretation of the robustness of MTLRRC against outlier tasks. The parameters included in MTLRRC are estimated by a modified version of the alternating direction method of multipliers (ADMM; Boyd et al. 2011). The theoretical and empirical analysis of convergence are provided.

The remainder of this paper is structured as follows. In Sect. 2, we first illustrate the problem set-up of the MTL method. After that, we review MTL methods based on convex clustering and introduce robust convex clustering along with its non-convex and group-sparse extensions. We also explore the connection between the extended robust clustering problem and the multivariate M-estimator. In Sect. 3, we present the MTLRRC and discuss its interpretation concerning the multivariate M-estimator. In Sect. 4, the modified ADMM algorithm for estimating the parameters of MTLRRC is established. Simulation studies are presented in Sect. 5 to demonstrate the effectiveness of MTLRRC and the application to real-world datasets is illustrated in Sect. 6. The conclusions are given in Sect. 7.

2 Motivation and methodology

2.1 Problem set-up

Suppose that we have T datasets. For each dataset \(m \ (m=1,\ldots ,T)\), we observed \(n_{m}\) pairs of data points \(\{({\varvec{x}}_{mi},y_{mi}); i=1,\ldots ,n_m\}\), where \({\varvec{x}}_{mi}\) is a p-dimensional explanatory variables and \(y_{mi}\) is the corresponding response variable following distribution in the exponential family with mean \(\mu _{mi}={\mathbb {E}}[y_{mi}|{\varvec{x}}_{mi}]\). Let \(X_{m}=({\varvec{x}}_{m1},\ldots , {\varvec{x}}_{mn_{m}})^{\top }\in {\mathbb {R}}^{n_{m}\times p}\) be the design matrix, \({\varvec{y}}_{m}=(y_{m1},\ldots ,y_{mn_{m}})^{\top }\in {\mathbb {R}}^{n_{m}}\) be the response vector for dataset m, and \(n=\sum _{m=1}^{T}n_{m}\) be the total number of samples. We assume that each feature vector \({\varvec{x}}_{mi}\) is standardized to have zero mean and unit variance and the response vector \({\varvec{y}}_{m}\) is centered to have zero mean when it is a continuous.

Our goal is to estimate T generalized linear models (GLMs) simultaneously, which take the form:

$$\begin{aligned} & \eta _{mi}=g(\mu _{mi}) = w_{m0} + {\varvec{x}}_{mi}^{\top }{\varvec{w}}_{m}, \nonumber \\ & i=1,\ldots ,n_{m},\;m=1,\ldots ,T, \end{aligned}$$
(1)

where \(w_{m0}\) is an intercept for m-th task, \({\varvec{w}}_{m}=(w_{m1}, \ldots ,w_{mp})^\top \) is a p-dimensional regression coefficient vector for m-th task, \(\eta _{mi}\) is a linear predictor, and \(g(\cdot )\) is a canonical link function. The density function of the response given the explanatory variables is expressed as

$$\begin{aligned} & f(y_{mi} |{\varvec{x}}_{mi};\theta ({\varvec{x}}_{mi}),\phi ) \\ & \quad = \exp \left\{ \frac{y_{mi}\theta ({\varvec{x}}_{mi})-b(\theta ({\varvec{x}}_{mi}))}{a(\phi )}+c(y_{mi},\phi )\right\} , \end{aligned}$$

where \(a(\cdot ),b(\cdot )\), and \(c(\cdot )\) are known functions that vary according to the distributions, \(\phi \) is a known dispersion parameter, and \(\theta (\cdot )\) is the natural parameter, which is expressed as \(g(\mu _{mi})=\theta ({\varvec{x}}_{mi})\). In this paper, we assume that all \({\varvec{y}}_m\) follow only the same type of distribution, regardless of the tasks.

Let \(W=({\varvec{w}}_{1},\ldots ,{\varvec{w}}_{T})^{\top }\in {\mathbb {R}}^{T\times p}\) be the regression coefficient matrix, and \({\varvec{w}}_{0}=(w_{10},\ldots ,w_{T0})^{\top }\in {\mathbb {R}}^{T}\) be the intercept vector. To estimate T GLMs in (1) simultaneously, we formulate the regularization-based multi-task learning (MTL) method as

$$\begin{aligned} \min _{\begin{array}{c} {\varvec{w}}_{0}, W \end{array}} \left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})+ \lambda \Omega (W) \right\} , \end{aligned}$$
(2)

where \(L(w_{m0},{\varvec{w}}_{m})\) is a loss function derived from the negative log-likelihood of the m-th GLM task, \(\lambda \) is a regularization parameter with a non-negative value, and \(\Omega (\cdot )\) is a regularization term that encourages sharing the information among tasks. When continuous response vectors \({\varvec{y}}_{m}\in {\mathbb {R}}^{n_{m}}\) are considered, the linear regression is given by \(a(\phi )=\phi , \phi =1\), and \(b(\theta ) =\frac{\theta ^{2}}{2}\). The regression loss function is

$$\begin{aligned} L(w_{m0},{\varvec{w}}_{m}) = \frac{1}{2}\Vert {\varvec{y}}_{m} - {\varvec{X}}_{m}{\varvec{w}}_{m} \Vert _{2}^{2}. \end{aligned}$$

Note that the intercepts are excluded from the model without a loss of generality. When binary response vectors \({\varvec{y}}_{m}\in \{0,1\}^{n_{m}}\) are considered, the logistic regression is given by \(a(\phi )=\phi , \phi =1\), and \(b(\theta ) =\log (1+e^{\theta })\). The loss function of logistic regression is

$$\begin{aligned} \begin{aligned} L(w_{m0},{\varvec{w}}_{m})=&- \sum _{i=1}^{n_{m}}\Big \{y_{mi}(w_{m0}+{\varvec{w}}_{m}^{\top }{\varvec{x}}_{mi})\\ &-\log \{1+{\exp {(w_{m0}+{\varvec{w}}_{m}^{\top }{\varvec{x}}_{mi})}}\} \Big \}. \end{aligned} \end{aligned}$$

2.2 Multi-task learning via convex clustering

To classify tasks into some clusters that share common characteristics, Okazaki and Kawano (2024) proposed an MTL method called multi-task learning via convex clustering (MTLCVX) as follows:

$$\begin{aligned} & \min _{{\varvec{w}}_{0}, W,U} \left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})+\frac{\lambda _{1}}{2}\sum _{m=1}^{T}\Vert {\varvec{w}}_{m}-{\varvec{u}}_{m}\Vert _{2}^{2} \right. \\ & \qquad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}} \Vert _{2} \right\} , \\ \end{aligned}$$

where \({\varvec{u}}_{m}=(u_{m1},\ldots ,u_{mp})^{\top }\in {\mathbb {R}}^{p}\) is a centroid parameter for m-th task, \(U=({\varvec{u}}_{1},\ldots ,{\varvec{u}}_{T})^{\top }\) is a \(T\times p\) matrix, \(r_{m_{1},m_{2}}\) is a weight between \(m_{1}\)-th and \(m_{2}\)-th task, \({\mathcal {E}}\) is a set of task pairs \((m_{1},m_{2})\), and \(\lambda _{1}\) and \(\lambda _{2}\) are non-negative regularization parameters. In this problem, the regularization terms are based on the convex clustering (Hocking et al. 2011), which is to perform clustering of tasks. The second term encourages the centroids \({\varvec{u}}_m\) to be close to their corresponding regression coefficients \({\varvec{w}}_m\). The third term induces \({\varvec{u}}_{m_{1}}\simeq {\varvec{u}}_{m_{2}}\). If the value of \({\varvec{u}}_{m_{1}}\) and \({\varvec{u}}_{m_{2}}\) are estimated to be the same, \(m_{1}\)-th and \(m_{2}\)-th task are interpreted as belonging to the same cluster and sharing same characteristics. Therefore, \({\varvec{u}}_{m}\) can be viewed as a center of cluster.

This minimization problem is solved by the block coordinate descent (BCD) algorithm. The algorithm is performed by alternately minimizing \({\varvec{w}}_{m}\) and \({\varvec{u}}_{m}\): we solve the regression model regularized by ridge penalty whose center is \({\varvec{u}}_{m}\) and convex clustering where each \({\varvec{w}}_{m} \;(m=1,\ldots , T)\) is considered as a sample, alternately.

The weights \(r_{m_{1},m_{2}}\) are set by the k-nearest neighbor as follows:

$$\begin{aligned} R= & \frac{S^{\top }+S}{2}, \quad (S)_{m_{1}m_{2}}\nonumber \\= & {\left\{ \begin{array}{ll} 1 & \widehat{{\varvec{w}}}^{\textrm{STL}}_{m_{1}}\;\textrm{is}\; \textrm{a}\;k\text {-nearest}\; \textrm{neighbor} \; \textrm{of}\;\widehat{{\varvec{w}}}_{m_{2}}^{\textrm{STL}}, \\ 0 & \textrm{otherwise}, \end{array}\right. } \end{aligned}$$
(3)

where R is a \(T \times T\) matrix whose each component is \((R)_{m_{1}m_{2}}= r_{m_{1},m_{2}}\) and \(\widehat{{\varvec{w}}}^{\textrm{STL}}_{m}\) is an estimated regression coefficient vector for m-th task by single-task learning such as OLS, ridge, and lasso. This setting is also used in Yamada et al. (2017). This constructed weights would be based on that of convex clustering (Hocking et al. 2011), which is used to reduce the computational costs by setting some \(r_{m_{1},m_{2}}\) to zero and to improve the estimation results of clustering.

2.3 Robust convex clustering

Suppose that we have n observed p-dimensional data \(\left\{ {\varvec{x}}_{i};i=1,\ldots ,n\right\} \). To classify these data into distinct clusters, convex clustering (Pelckmans et al. 2005; Hocking et al. 2011; Lindsten et al. 2011) has been proposed.

Quan and Chen (2020) pointed out that convex clustering is sensitive to just a few outliers, and they proposed robust convex clustering (RCC) as follows:

$$\begin{aligned} \begin{aligned}&\min _{U,O} \left\{ \sum _{i=1}^{n} \frac{1}{2}\Vert {\varvec{x}}_{i} - {\varvec{u}}_{i} -{\varvec{o}}_{i}\Vert _{2}^{2}\right. \\ &\quad \left. +\lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}}r_{i_{1},i_{2}}\Vert {\varvec{u}}_{i_{1}} -{\varvec{u}}_{i_{2}} \Vert _{2} +\lambda _{2}\sum _{i=1}^{n}\Vert {\varvec{o}}_{i}\Vert _{1} \right\} , \end{aligned} \end{aligned}$$
(4)

where \({\varvec{u}}_{i}=(u_{i1},\ldots ,u_{ip})^{\top }\in {\mathbb {R}}^{p}\) is a vector of centroid parameters for i-th sample, \({\varvec{o}}_{i}=(o_{i1},\ldots ,o_{ip})^{\top }\in {\mathbb {R}}^{p}\) is a vector of outlier parameters for i-th sample, \(U=({\varvec{u}}_{1},\ldots ,{\varvec{u}}_{n})^{\top }\) and \(O=({\varvec{o}}_{1},\ldots ,{\varvec{o}}_{n})^{\top }\) are \(n\times p\) matrices, respectively. The third term selects the outlier parameters by shrinking each element of \({\varvec{o}}_{i}\) toward exactly zero. If \(o_{ij}\) is estimated to be a non-zero value, the j-th feature of the i-th sample is considered an outlier.

From the relationship between the loss function of least square and \(L_{1}\) penalty for the outlier parameters (Antoniadis 2007; Gannaz 2007), the minimization problem (4) is equivalent to the following minimization problem:

$$\begin{aligned} & \min _{U} \left\{ \sum _{i=1}^{n}\sum _{j=1}^{p} h_{\lambda _{2}}(x_{ij} - u_{ij})\right. \nonumber \\ & +\left. \lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}}r_{i_{1},i_{2}}\Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2} \right\} , \end{aligned}$$
(5)

where \(h_{\lambda }(\cdot )\) is a Huber’s loss function defined as

$$\begin{aligned} h_{\lambda _{2}}(z)={\left\{ \begin{array}{ll} \frac{1}{2}z^{2} & |z| \le \lambda , \\ \lambda _{2} |z|-\frac{\lambda ^{2}}{2} & |z|\ge \lambda . \end{array}\right. } \end{aligned}$$

This would indicate that RCC is a robust version of convex clustering derived from replacing the loss function with the component-wise robust loss function.

2.4 Non-convex extensions of robust convex clustering

Although Quan and Chen (2020) only considered \(\ell _{1}\) penalty to select outlier parameters, it also can be considered other types of shrinkage penalties such as non-convex penalties and group penalties. For example, if group lasso (Yuan and Lin 2006) is employed for the third term in (4), we can detect the sample-wise outliers. Since our purpose in this paper is to detect task-wise outliers, we first consider the generalized robust clustering problem to detect sample-wise outliers as follows:

$$\begin{aligned} & \min _{U,O} \left\{ \sum _{i=1}^{n} \frac{1}{2}\Vert {\varvec{x}}_{i} - {\varvec{u}}_{i} -{\varvec{o}}_{i}\Vert _{2}^{2}\right. \nonumber \\ & \left. +\lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}}r_{i_{1},i_{2}}\Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2} + \sum _{i=1}^{n}P({\varvec{o}}_{i};\lambda _{2},\gamma ) \right\} , \end{aligned}$$
(6)

where \(P(\cdot ;\lambda ,\gamma )\) is a penalty function that induces group sparsity and \(\gamma \) is a tuning parameter that adapts the shape of the penalty function. By estimating \({\varvec{o}}_{i}\) as a zero vector through group penalties, this problem aims to detect sample-wise outliers. For i-th sample \({\varvec{x}}_{i}\), even if the value of one feature \(x_{ij}\) has extensive value compared with its cluster center \({\widehat{u}}_{ij}\), \(\widehat{{\varvec{o}}}_{i}\) would not be non-zero vector. Only when the \(L_{2}\)-distance \(\Vert {\varvec{x}}_{i}-\widehat{{\varvec{u}}}_{i}\Vert _{2}\) has the extensive value, \(\widehat{{\varvec{o}}}_{i}\) is estimated to be non-zero vector and \({\varvec{x}}_{i}\) is interpreted as an outlier sample.

If a non-convex penalty such as group SCAD and group MCP (Huang et al. 2012) is employed for the third term, the minimization problem is no longer a convex optimization problem. Therefore, we refer to the minimization problem as Robust Regularized Clustering (RRC).

Algorithm 1
figure a

Block coordinate descent algorithm for Problem (6)

The minimization problem (6) is solved by the BCD algorithm shown in Algorithm 1. Here, \(X= ({\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n})^{\top }\in {\mathbb {R}}^{n \times p}\), L(UO) is the objective function in (6), and the superscripts in parentheses indicate the number of updates. The problem (7) can be solved by algorithms for the convex clustering. Since the problem (8) is separable in terms of \({\varvec{o}}_{i}\), it is expressed as

$$\begin{aligned} {\varvec{o}}_{i}^{(t+1)}= & \mathop {\mathrm{arg~min}}\limits _{{\varvec{o}}_{i}} \left\{ \frac{1}{2}\Vert {\varvec{x}}_{i}-{\varvec{u}}^{(t+1)}_{i}-{\varvec{o}}_{i} \Vert _{2}^{2} + P({\varvec{o}}_{i};\lambda _{2},\gamma )\right\} ,\nonumber \\ & i=1,\ldots ,n. \end{aligned}$$
(9)

Therefore, the update can be obtained by

$$\begin{aligned} {\varvec{o}}_{i}^{(t+1)} = {\varvec{\Theta }}({\varvec{x}}_{i}-{\varvec{u}}^{(t+1)}_{i};\lambda _{2},\gamma ), \quad i=1,\ldots ,n, \end{aligned}$$

where \({\varvec{\Theta }}(\cdot ;\lambda ,\gamma )\) is a group-thresholding function defined for the corresponding penaly function \(P(\cdot ;\lambda ,\gamma )\). For example, \({\varvec{\Theta }}(\cdot ;\lambda ,\gamma )\) for group lasso is given by

$$\begin{aligned} {\varvec{\Theta }}^{\textrm{glasso}}({\varvec{z}};\lambda ,\gamma )= S({\varvec{z}};\lambda ), \end{aligned}$$

where \(S(\cdot ;\lambda )\) is a group soft-thresholding function defined as

$$\begin{aligned} S({\varvec{z}};\lambda ) = \max \left( 0,1-\frac{\lambda }{\Vert {\varvec{z}}\Vert _{2}}\right) {\varvec{z}}. \end{aligned}$$

The solution of (6) obtained by Algorithm 1 is related to M-estimators, which is similar to the connection between the minimization problems (4) and (5). Let \(A_{r}\) be a \(|{\mathcal {E}}|\times n\) matrix whose each row is \(r_{i_{1},i_{2}}{\varvec{a}}_{(i_{1},i_{2})}^{\top }\) and we set

$$\begin{aligned} D = A_{r} \otimes I_{p}, \end{aligned}$$

where \(\otimes \) denotes the Kronecker product, \(I_{p}\) is a \(p\times p\) identity matrix, and \({\varvec{a}}_{(i_{1},i_{2})}\) is defined as

$$\begin{aligned} ({\varvec{a}}_{(i_{1},i_{2})})_{i}= {\left\{ \begin{array}{ll} 1 \quad & i=i_{1},\\ -1 \quad & i=i_{2},\\ 0 \quad & \textrm{otherwise}, \end{array}\right. } i=1,\ldots ,n. \end{aligned}$$
(10)

We also define the mixed (2, 1)-norm (Lounici et al. 2011) for a \(|{\mathcal {E}}|p\)-dimensional vector \({\varvec{z}}\) as

$$\begin{aligned} \Vert {\varvec{z}} \Vert _{2,1} = \sum _{k=1}^{|{\mathcal {E}}|}\left( \sum _{j= (k-1)p + 1}^{kp}z_{j}^{2}\right) ^{1/2}. \end{aligned}$$

Using these definitions, the second term in (6) can be written as

$$\begin{aligned} \sum _{(i_{1},i_{2})\in {\mathcal {E}}} r_{i_{1},i_{2}} \Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2} = \Vert D\textrm{vec}(U) \Vert _{2,1}. \end{aligned}$$

Based on the above definitions, we summarize the relationship between the solution of the RRC and M-estimator in the following proposition.

Proposition 1

Suppose that \({\widehat{U}}\) is a convergence point in Algorithm 1 and \({\varvec{\psi }}({\varvec{o}};\lambda ,\gamma )={\varvec{o}} - {\varvec{\Theta }}({\varvec{o}};\lambda ,\gamma )\). Then, the \({\widehat{U}}\) satisfies

$$\begin{aligned} -\Psi (X-{\widehat{U}};\lambda _{2},\gamma )+\lambda _{1} \partial _{\textrm{vec}(U)}(\Vert D\textrm{vec}(U) \Vert _{2,1})|_{U={\widehat{U}}} \ni {\varvec{0}},\nonumber \!\!\!\!\!\\ \end{aligned}$$
(11)

where \(\partial _{\textrm{vec}(U)}\) is the subdifferential with respect to \(\textrm{vec}(U)\), and \(\Psi (X-{\widehat{U}};\lambda _{2},\gamma )\) is an np-dimensional vector defined as

$$\begin{aligned} \Psi (X-{\widehat{U}};\lambda _{2},\gamma ) = \begin{pmatrix} {\varvec{\psi }} ({\varvec{x}}_{1}-{\varvec{{\widehat{u}}}}_{1};\lambda _{2},\gamma ) \\ {\varvec{\psi }} ({\varvec{x}}_{2}-{\varvec{{\widehat{u}}}}_{2};\lambda _{2},\gamma ) \\ \vdots \\ {\varvec{\psi }} ({\varvec{x}}_{n}-{\varvec{{\widehat{u}}}}_{n};\lambda _{2},\gamma ) \end{pmatrix}. \end{aligned}$$

The proof is given in Appendix B.1. This proposition gives the relationship between the \({\widehat{U}}\) calculated by Algorithm 1 and the following minimization problem:

$$\begin{aligned} \min _{U} \left\{ \sum _{i=1}^{n} \rho _{\lambda _{2},\gamma }({\varvec{x}}_{i}-{\varvec{u}}_{i})+ \lambda _{1}\sum _{(i_{1},i_{2})\in {\mathcal {E}}} r_{i_{1},i_{2}} \Vert {\varvec{u}}_{i_{1}}-{\varvec{u}}_{i_{2}} \Vert _{2}\right\} ,\nonumber \!\!\!\!\\ \end{aligned}$$
(12)

where \(\rho _{\lambda ,\gamma }(\cdot )\) is a multivariate loss function that satisfies

$$\begin{aligned} \frac{\partial }{\partial {\varvec{z}}} \rho _{\lambda ,\gamma }({\varvec{z}}) = {\varvec{\psi }}({\varvec{z}};\lambda ,\gamma ). \end{aligned}$$

If we choose a regularization term \(P({\varvec{z}};\lambda ,\gamma )\) so that \(\rho _{\lambda ,\gamma }({\varvec{z}})\) is a weakly convex function (see e.g. Davis et al. 2018), the sum rule of the subdifferential (Ngai et al. 2000; Corollary 3.9) can be applied to the objective function of Problem (12). Then, the optimality condition regarding the stationary points coincides with the inclusion relationship (11), which means any output \({\widehat{U}}\) is one of the stationary points of the Problem (12). The loss functions induced from group SCAD and group MCP are weakly convex, while the skipped mean loss induced from group hard thresholding is not weakly convex. The proofs for the weakly convexity of these functions are shown in Appendix B.4.

The sketch of some multivariate loss functions and corresponding group-thresholding functions is shown in Fig. 1. Those formulations are summarized in Appendix A.

Proposition 1 is inspired by similar propositions in She and Owen (2011) and Katayama and Fujisawa (2017). While they considered linear regression case and only the situation where \({\varvec{\Theta }}(\cdot ;\lambda ,\gamma )\) is defined as the component-wise thresholding function, we consider clustering problem and the situation where \(\Theta (\cdot ;\lambda ,\gamma )\) is a group-thresholding function. This enables us to solve clustering problems with multivariate robust loss functions by optimizing the problems with group penalties for outlier parameters instead.

Fig. 1
figure 1

The multivariate loss functions (top row), the group-thresholding functions (bottom row). The x-axis and y-axis represent the values of input \({\varvec{z}}\in {\mathbb {R}}^{2}\). The z-axis shows the output \(\rho _{\lambda ,\gamma }({\varvec{z}})\) in the top row, and the first component of \(\Theta ({\varvec{z}};\lambda ,\gamma )\) in the bottom row. The values of \(\lambda \) and \(\gamma \) are fixed with three

3 Proposed method

3.1 Multi-task learning via robust regularized clustering

Almost all existing MTL methods based on clustering have not considered the presence of outlier tasks. For instance, MTLCVX shrinks the difference of centroids including those of outlier tasks, which contaminates the estimation of centroids. As a result, the estimation of regression coefficients concerning tasks corresponding to contaminated centroids also worsens. This motivates us to separate the estimation of the parameters concerning task clusters from outlier tasks. To the best of our knowledge, only Yao et al. (2019) consider the presence of outlier tasks and clustering of tasks simultaneously. However, their way of making the robustness of their method is not clear. On the other hand, some robust MTL methods (Chen et al. 2011; Gong et al. 2012) have attempted to address the issues of outlier tasks by introducing outlier parameters and selecting them using group lasso regularization. However, group lasso (Yuan and Lin 2006) limits the value of the outlier parameters, which may not adequately represent their nature. To overcome these problems, we propose the Multi-Task Learning via Robust Regularized Clustering (MTLRRC). MTLRRC is formulated as follows:

$$\begin{aligned} \begin{aligned}&\min _{\varvec{w}_{0},W,U,O} \left\{ \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) +\frac{\lambda _{1}}{2}\sum _{m=1}^{T}\Vert {\varvec{w}}_{m}-{\varvec{u}}_{m}-{\varvec{o}}_{m}\Vert _{2}^{2} \right. \\ &\qquad \quad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}} \Vert _{2}\right. \\ &\qquad \quad \left. + \sum _{m=1}^{T} P( {\varvec{o}}_{m};\lambda _{3},\gamma ) \right\} , \end{aligned} \end{aligned}$$
(13)

where \({\varvec{o}}_{m}=(o_{m1},\ldots ,o_{mp})^{\top }\in {\mathbb {R}}^{p}\) is a vector of outlier parameters for m-th task, \(\lambda _{3}\) is a regularization parameter with a non-negative value. The second through the fourth term is based on the minimization problem (6). Then, if \({\varvec{o}}_{m}\) is estimated to be a non-zero vector, m-th task is considered to be an outlier task that does not share a common structure with any tasks. We set the weights \(r_{m_{1},m_{2}}\) based on Eq. (3)

In this problem, the centroid parameters \({\varvec{u}}_{m}\) whose difference is shrank and outlier parameters \({\varvec{u}}_{m}\) estimated to be zero vector or not are separated in the light of regularization. Thus, the cluster center component and the potential outlier component included in a task are estimated separately. Furthermore, if we employ non-convex regularization terms such as group SCAD, we can allow \({\varvec{o}}_{m}\) to have a large value, which also leads to the large difference between \({\varvec{w}}_{m}\) and \({\varvec{u}}_{m}\). Then, the contamination from outlier tasks with significant unique characteristics is expected to be reduced. Next, we introduce another interpretation of the proposed method.

3.2 Interpretation through the BCD algorithm

3.2.1 Convex case

First, we consider that group lasso is employed for \(P(\cdot ;\lambda ,\gamma )\) in (13). Then, since MTLRRC is a convex optimization problem, we can obtain another representation for (13) by minimizing in terms of O as follows:

$$\begin{aligned} & \min _{{\varvec{w}}_{0},W,U }\left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \right. +\lambda _{1}\sum _{m=1}^{T}h_{\lambda _{3}/\lambda _{1}}^{\textrm{M}}({\varvec{w}}_{m}-{\varvec{u}}_{m}) \nonumber \\ & \qquad \quad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}} \Vert _{2} \right\} , \end{aligned}$$
(14)

where \(h_{\lambda }^{\textrm{M}}(\cdot )\) is a multivariate Huber’s loss function (Hampel et al. 1986) defined as

$$\begin{aligned} h_{\lambda }^{\textrm{M}}({\varvec{z}})= {\left\{ \begin{array}{ll} \frac{1}{2}\Vert {\varvec{z}}\Vert _{2}^{2} & \Vert {\varvec{z}}\Vert _{2}\le \lambda ,\\ \lambda \Vert {\varvec{z}}\Vert _{2}-\frac{\lambda ^{2}}{2} & \Vert {\varvec{z}}\Vert _{2}> \lambda . \end{array}\right. } \end{aligned}$$

This representation helps us understand the interpretation of the proposed method (13).

Algorithm 2
figure b

Block coordinate descent algorithm for Problem (14)

Let the objective function of the minimization problem (14) be \(L^{\textrm{MH}}({\varvec{w}}_{0},W,U)\). We consider solving the minimization problem (14) by Algorithm 2 based on the BCD algorithm. Since the minimization in terms of \({\varvec{w}}_{m}\) is separable, the update (15) is expressed as

$$\begin{aligned} \begin{aligned}&(w_{m0}^{(t+1)},{\varvec{w}}_{m}^{(t+1)})\\ &\quad = \mathop {\mathrm {arg~min}}\limits _{w_{m0},{\varvec{w}}_{m}}\left\{ \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) + \lambda _{1} h_{\lambda _{3}/\lambda _{1}}^{\text {M}}({\varvec{w}}_{m}-{\varvec{u}}_{m}^{(t)}) \right\} ,\\ &\qquad m=1,\ldots ,T. \end{aligned} \end{aligned}$$

These updates estimate the regression coefficients for the m-th task to be close to the corresponding centroid. However, when \(\Vert {\varvec{w}}-{\varvec{u}}_{m}^{(t)}\Vert _{2}\) tends to take a larger value than \(\lambda _{3}/\lambda _{1}\), the shrinkage toward the centroid is reduced by the part of \(\ell _{2}\)-norm in the multivariate Huber’s function. Thus, if m-th task is an outlier task, the estimated \(\widehat{{\varvec{w}}}_{m}\) is expected to be less affected by the common structure \(\widehat{{\varvec{u}}}_{m}\).

The minimization problem in terms of the update (16) is in the framework of the minimization problem (12). This can be seen by replacing \(({\varvec{x}}_{i};i=1,\ldots ,n)\) in (12) with \(({\varvec{w}}_{m}^{(t+1)};m=1,\ldots ,T)\) and \(\rho _{\lambda _{2},\gamma }(\cdot )\) with \(h_{\lambda _{3}/\lambda _{1}}^{\text {HM}}(\cdot )\). Consequently, the update of \(U^{(t+1)}\) is performed under the robust clustering of tasks. Based on the discussions of Algorithm 2, the estimated values \({\widehat{W}}\) and \({\widehat{U}}\) in MTLRRC can be regarded as a convergence point of alternative estimation, which consists of a regression step that reduces shrinkage of outlier tasks toward cluster center and a robust clustering step for tasks. Therefore, MTLRRC is expected to be robust to the outlier tasks.

3.2.2 Non-convex case

Next, we consider that non-convex group penalties are employed for \(P(\cdot ;\lambda ,\gamma )\) in (13). The estimates of the parameters can be calculated by Algorithm 3. Here, \(L^{\textrm{MR}}({\varvec{w}}_{0},W,U, O)\) is the objective function of the minimization problem (13). Then, the following proposition similar to Proposition 1 holds.

Proposition 2

Let \({\varvec{w}}_{m}^{\prime }=(w_{m0},{\varvec{w}}_{m}^{\top })^{\top }\). Suppose that \(\left( \widehat{{\varvec{w}}}_{0},{\widehat{W}},\right. \) \(\left. {\widehat{U}}\right) \) is a pair of convergence point in Algorithm 3 and \({\varvec{\psi }}({\varvec{o}};\lambda ,\gamma )={\varvec{o}} - {\varvec{\Theta }}({\varvec{o}};\lambda ,\gamma )\). Then, \((\widehat{{\varvec{w}}}_{0},{\widehat{W}},{\widehat{U}})\) satisfies

$$\begin{aligned}&\frac{\partial }{\partial {\varvec{w}}_{m}^{\prime }}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})|_{{\varvec{w}}_{m}^{\prime }=\widehat{{\varvec{w}}}_{m}^{\prime }}\nonumber \\&\quad + \lambda _{1} \begin{pmatrix} 0 \\ {\varvec{\psi }}(\widehat{{\varvec{w}}}_{m}-\widehat{{\varvec{u}}}_{m};\lambda _{3}/\lambda _{1},\gamma ) \end{pmatrix} = {\varvec{0}},\quad m=1,\ldots ,T, \end{aligned}$$
(17)
$$\begin{aligned}&-\lambda _{1}\Psi ({\widehat{W}}-{\widehat{U}};\lambda _{3}/\lambda _{1},\gamma )\nonumber \\&\quad +\lambda _{2}\partial _{\textrm{vec}(U)}(\Vert D\textrm{vec}(U) \Vert _{2,1})|_{U={\widehat{U}}} \ni {\varvec{0}}. \end{aligned}$$
(18)

The proof is given in Appendix B.2. When the \(P({\varvec{z}};\lambda ,\gamma )\) is employed such that corresponding \(\rho _{\lambda ,\gamma }({\varvec{z}})\) is a weakly convex function, the equations (17) and (18) are the same first-order conditions for the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{{\varvec{w}}_{0},W,U}&\left\{ \sum _{m=1}^{T} \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \right. \\ &\quad +\lambda _{1}\sum _{m=1}^{T}\rho _{\lambda _{3}/\lambda _{1},\gamma }({\varvec{w}}_{m}-{\varvec{u}}_{m}) \\ &\quad \left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{u}}_{m_{1}}-{v{u}}_{m_{2}} \Vert _{2} \right\} . \end{aligned}\nonumber \\ \end{aligned}$$
(19)

Therefore, we may expect that the solution in the case of non-convex penalties has a similar interpretation of the minimization problem (14).

Algorithm 3
figure c

Block coordinate descent algorithm for MTLRRC

4 Estimation algorithm via modified ADMM

MTLRRC can be estimated by Algorithm 3. However, this estimation procedure is computationally expensive, because the update of \(U^{(t+1)}\) involves solving the convex clustering, which is computationally demanding. To avoid this computation, we consider estimating parameters included in MTLRRC by alternating direction method of multipliers (ADMM; Boyd et al. 2011).

We consider the following minimization problem equivalent to Problem (13):

$$\begin{aligned} \begin{aligned} \min _{{\varvec{w}}_{0},W,U,O}&\left\{ \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) +\frac{\lambda _{1}}{2}\Vert W-U-O\Vert _{F}^{2} \right. \\ &\left. +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{b}}_{(m_{1},m_{2})} \Vert _{2}\right. \\ &\left. + \sum _{m=1}^{T} P( {\varvec{o}}_{m};\lambda _{3},\gamma ) \right\} , \\ &\mathrm {s.t.} \quad {\varvec{u}}_{m_{1}}-{\varvec{u}}_{m_{2}}= {\varvec{b}}_{(m_{1},m_{2})},\quad (m_{1},m_{2})\in {\mathcal {E}}. \end{aligned} \end{aligned}$$

For this minimization problem, we consider the following augmented Lagrangian:

$$\begin{aligned} L_{\nu } ({\varvec{w}}_{0},W,U,O,B,S)= & \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m})\nonumber \\ & \quad +\frac{\lambda _{1}}{2}\Vert W-U-O\Vert _{F}^{2} \nonumber \\ & \quad +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{b}}_{(m_{1},m_{2})} \Vert _{2} \!\!\!\!\nonumber \\ & \quad + \sum _{m=1}^{T} P( {\varvec{o}}_{m};\lambda _{3},\gamma ) \nonumber \\ & \quad + \textrm{tr}(S^{\top }(B-A_{{\mathcal {E}}}U)) \nonumber \\ & \quad + \frac{\nu }{2} \Vert B-A_{{\mathcal {E}}}U\Vert _{F}^{2}, \end{aligned}$$
(20)

where \(\textrm{tr}(\cdot )\) denotes the trace operator, \(A_{{\mathcal {E}}}\) is a \(|{\mathcal {E}}|\times T\) matrix whose each row is \({\varvec{a}}_{m_{1},m_{2}}^{\top }\) defined with the same manner of Eq. (10), S is a \(|{\mathcal {E}}|\times p\) Lagrangian multipliers matrix, and \(\nu \) is a tuning parameter with non-negative value.

For this augmented Lagrangian, we consider the following updates of the modified ADMM:

$$\begin{aligned} \begin{aligned}&({\varvec{w}}_{0}^{(t+1)},W^{(t+1)}) = \mathop {\mathrm{arg~min}}\limits _{{\varvec{w}}_{0},W} L_{\nu } ({\varvec{w}}_{0},W,U^{(t)},O^{(t)},B,S^{(t)}), \\&U^{(t+1)} = \mathop {\mathrm{arg~min}}\limits _{U} \left( \min _{B}L_{\nu }({\varvec{w}}_{0}^{(t+1)},W^{(t+1)},U,O^{(t)},B,S^{(t)})\right) , \\&O^{(t+1)} = \mathop {\mathrm{arg~min}}\limits _{O} L_{\nu } ({\varvec{w}}_{0}^{(t+1)},W^{(t+1)},U^{(t+1)},O^{(t)},B,S^{(t)}), \\&S^{(t+1)}_{(m_{1},m_{2})} = \textrm{prox}((S^{(t)}\\&\qquad \qquad \quad +\nu A_{{\mathcal {E}}} U^{(t+1)})_{(m_{1},m_{2})}, \lambda _{2}r_{m_{1},m_{2}}),\quad (m_{1},m_{2})\in {\mathcal {E}}, \\ \end{aligned} \end{aligned}$$

where \(\textrm{prox}(\cdot ,\lambda )\) is defined as

$$\begin{aligned} \textrm{prox}({\varvec{z}},\lambda )=\textrm{min}(\Vert {\varvec{z}}\Vert _{2},\lambda )\frac{{\varvec{z}}}{\Vert {\varvec{z}}\Vert _{2}}. \end{aligned}$$
(21)

The minimization problem in terms of U and B is jointly done. The minimization in terms of B can be written explicitly, and we only need to solve that in terms of U using the gradient method. These modifications about updates of UB, and S are based on the idea of Shimmura and Suzuki (2022). Those details and derivation are provided in Appendix C.

The update for \({\varvec{w}}_{0}\) and W is given by solving the independent regularized GLMs for each task, which are expressed as

$$\begin{aligned} \begin{aligned}&(w_{m0}^{(t+1)},{\varvec{w}}_{m}^{(t+1)}) = \mathop {\mathrm {arg~min}}\limits _{w_{m0},{\varvec{w}}_{m}}\left\{ \frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \right. \\ &\quad \left. + \frac{\lambda _{1}}{2}\Vert {\varvec{w}}_{m}-{\varvec{u}}_{m}^{(t)}-{\varvec{o}}_{m}^{(t)} \Vert _{2}^{2} \right\} ,\;m=1,\ldots ,T. \end{aligned} \end{aligned}$$

These minimization problems are solved by the Newton–Raphson method provided in Algorithm 4. The update of O is given by the same manner as (9). As a result, we obtain the estimation algorithm for MTLRRC as Algorithm 5. Here, \(\textrm{STL}(\cdot ,\cdot )\) is a function returning an estimated regression coefficient vector by an arbitrary single-task learning method.

Although the convergence of ADMM for non-convex functions has been shown in some studies (e.g. Wang et al. 2019 and Fan and Yin 2024), the convergence of the proposed method is non-trivial because of the modification of the ADMM algorithm based on Shimmura and Suzuki (2022). We provide a theoretical guarantee regarding convergence to a limit point, which is summarized as the following theorem.

Theorem 1

Let \(\lambda _{++}(\cdot )\) be a strictly positive minimum eigenvalue of a matrix and assume that \(L(\cdot ,\cdot )\) is a convex function and \(P(\cdot ;\lambda ,\gamma )\) is a weakly convex function with modules \(C_{p}\). If \(\lambda _{1}>C_{p}\) and \(\nu >\frac{2\lambda _{1}^{2}}{\lambda _{++}(A_{{\mathcal {E}}}^{\top }A_{{\mathcal {E}}})(\lambda _{1}-C_{p})}\) are satisfied, the sequence \(\{{\varvec{w}}_{0}^{(t)},W^{(t)}, U^{(t)},O^{(t)},B^{(t)},S^{(t)}\}\) generated by the Algorithm 5 converges to a limit point \(\{\widehat{{\varvec{w}}}_{0}^{(*)},{\widehat{W}}^{(*)},{\widehat{U}}^{(*)},{\widehat{O}}^{(*)},{\widehat{B}}^{(*)},{\widehat{S}}^{(*)}\}\). In addition, any limit point is a stationary point of the augmented Lagrangian (20).

This theorem ensures that our modified ADMM algorithm converges to a stationary point under mild conditions. The conditions provide practical guidelines for parameter selection in the algorithm. Specifically, we can ensure convergence by setting \(\lambda _1\) larger than the weak convexity modulus \(C_p\) and choosing an appropriate step size \(\nu \). However, as we will see in later simulation studies, even if these conditions are not satisfied, the algorithm will converge empirically in many situations. The proof of Theorem 1 is given in the Supplementary Material.

Furthermore, we consider the following augmented Lagrangian derived from Problem (19):

$$\begin{aligned} \begin{aligned} L_{\nu } ({\varvec{w}}_{0},W,U,B,S)&= \sum _{m=1}^{T}\frac{1}{n_{m}}L(w_{m0},{\varvec{w}}_{m}) \\ &\quad +\frac{\lambda _{1}}{2}\sum _{m=1}^{T}\rho _{\lambda _{3}/\lambda _{1},\gamma }({\varvec{w}}_{m}-{\varvec{u}}_{m}) \\ &\quad +\lambda _{2}\sum _{(m_{1},m_{2})\in {\mathcal {E}}}r_{m_{1},m_{2}}\Vert {\varvec{b}}_{(m_{1},m_{2})} \Vert _{2}\!\!\!\!\! \\ &\quad + \text {tr}(S^{\top }(B-A_{{\mathcal {E}}}U)) \\ &\quad + \frac{\nu }{2} \Vert B-A_{{\mathcal {E}}}U\Vert _{F}^{2}. \end{aligned} \end{aligned}$$
(22)

Then, the following proposition holds.

Proposition 3

Under the assumption of Theorem 1 with \(\lambda _{1}>C_{p}\) and \(\nu >\frac{2\lambda _{1}^{2}}{\lambda _{++}(A_{{\mathcal {E}}}^{\top }A_{{\mathcal {E}}})(\lambda _{1}-C_{p})}\). Then, any limit point \(\{\widehat{{\varvec{w}}}_{0}^{(*)},{\widehat{W}}^{(*)},{\widehat{U}}^{(*)},{\widehat{B}}^{(*)},{\widehat{S}}^{(*)}\}\) is one of the stationary points for the minimization problem concerning (22).

This proposition is the ADMM version of Proposition 2, which would justify minimizing MTLRRC by using the modified ADMM algorithm instead of the BCD algorithm. Note that due to the non-convexity of the objective function except when group lasso is used, there may be a duality gap between the original MTLRRC and the minimization problem concerning its augmented Lagrangian.

We analyze the computational complexity of our proposed method. The initialization of \(\iota \) requires \({\mathcal {O}}(|{\mathcal {E}}|T^{2})\) operations, which is computed only once before the ADMM iterations. Within each ADMM iteration, updating \(({\varvec{w}}_{0}^{(t+1)},W^{(t+1)})\) via the Newton–Raphson method requires \({\mathcal {O}}(N_{\textrm{NR}} p^{2}(n+Tp))\) operations, where \(N_{\textrm{NR}}\) denotes the number of Newton-Raphson iterations. The update of \(U^{(t+1)}\) via accelerated gradient method requires \({\mathcal {O}}(N_{\text {AG}}|{\mathcal {E}}|Tp)\) operations, where \(N_{\text {AG}}\) represents the number of accelerated gradient method iterations. The update of \(O^{(t+1)}\) requires \({\mathcal {O}}(Tp)\) operations. The update of \(S^{(t+1)}\) requires \({\mathcal {O}}(|{\mathcal {E}}|Tp)\) operations. Consequently, each iteration of Algorithm 5 has an overall computational complexity of \({\mathcal {O}}(N_{\text {NR}}p^{2}(n+Tp)+N_{\text {AG}}|{\mathcal {E}}|Tp)\). In contrast, Algorithm 3 requires \({\mathcal {O}}(N_{\text {NR}}p^{2}(n+Tp)+N_{\text {A}}N_{\text {AG}}|{\mathcal {E}}|Tp)\) operations, where \(N_{\textrm{A}}\) denotes the number of ADMM iterations required for solving the convex clustering problem. Thus, the computational complexity of the BCD algorithm regarding the second term is \(N_{\textrm{A}}\) times larger than that of the modified ADMM algorithm, which can be substantial when \(N_{\textrm{A}}\) is large.

We further compare our method with RCMTL (Yao et al. 2019), which is the only existing method that addresses both robustness against outlier tasks and task clustering. RCMTL employs a BCD algorithm incorporating two inner ADMM algorithms. Their algorithm requires \({\mathcal {O}}(N_{\mathrm {A_{1}}}N_{\textrm{G}}Tp(n+T)+N_{\textrm{A}_{2}}T^{2}p)\) operations in each iteration, where \(N_{\mathrm {A_{1}}}\), \(N_{\textrm{A}_{2}}\), and \(N_{\textrm{G}}\) denote the iteration counts for the first inner ADMM, second inner ADMM, and gradient method within the first inner ADMM, respectively. In comparing MTLRRC and RCMTL, the first term in both complexity expressions primarily reflects the computational cost of updating regression coefficients, while the second term corresponds to updating the task relationship components. For the first term, direct comparison is difficult as RCMTL employs backtracking to avoid matrix inversion computations. Regarding the second term in our modified ADMM, when the weights \(r_{m_{1},m_{2}}\) are constructed according to Eq. (3), the computational complexity becomes \({\mathcal {O}}(N_{\text {AG}}kT^{2}p)\) since \(|{\mathcal {E}}|\le kT\). Given that k is typically small (e.g., \(k = 5\)) in the context of convex clustering and considering accelerated gradient method’s quadratic convergence rate, our modified ADMM algorithm achieves computational efficiency comparable to or better than RCMTL in the task relationship part.

Algorithm 4
figure d

Newton-Raphson method for updating \(w_{m0}\) and \({\varvec{w}}_{m}\)

Algorithm 5
figure e

Estimation algorithm of MTLRRC via modified ADMM

5 Simulation studies

In this section, we report simulation studies in the linear regression setting. We generated data by the true model:

$$\begin{aligned} \begin{aligned} {\varvec{y}}_{m} = X_{m}{\varvec{w}}_{m}^{*}+{\varvec{\epsilon }}_{m}, \quad m=1,\ldots ,T, \end{aligned} \end{aligned}$$

where \({\varvec{\epsilon }}_{m}\) is an error term whose each component is distributed as \(N(0,\sigma ^{2})\) independently, \(X_{m}\) is a design matrix generated from \(N_{p}({\varvec{0}},I_{p})\) independently, and \({\varvec{w}}^{*}_{m}\) is a true regression coefficient vector for m-th task. For this true model, T tasks consist of C true clusters and other outlier tasks. First, all tasks were assigned to C clusters with the same number of tasks in each cluster as T/C. Then, some of them were randomly assigned to outlier tasks.

For the true structure of regression coefficient vectors, we considered the following two cases:

$$\begin{aligned} \begin{aligned} \text{ Case } \text{1: }&\quad {\varvec{w}}_{m}^{*}={\varvec{u}}_{c}^{*} + {\varvec{v}}_{m}^{c*}+I(\tau _{m}=1){\varvec{o}}_{m}^{c*}, \\ \text{ Case } \text{2: }&\quad {\varvec{w}}_{m}^{*} = {\left\{ \begin{array}{ll} {\varvec{u}}_{c}^{*} + {\varvec{v}}_{m}^{c*} & \textrm{if}\;\tau _{m} =0, \\ {\varvec{o}}_{m}^{*} & \textrm{if}\; \tau _{m} = 1, \end{array}\right. } \end{aligned} \end{aligned}$$

where \({\varvec{u}}_{c}^{*}\) is a true cluster center for c-th cluster, \({\varvec{v}}_{m}^{c{*}}\) is a true task-specific parameter for m-th task belonging to c-th cluster, \({\varvec{o}}_{m}\) is a true outlier parameter for m-th task, and \(\tau _{m}\) is a random variable distributed as \( P(\tau _{m}=1)=\kappa \) and \(P(\tau _{m}=0)=1-\kappa \). \(\tau _{m}=1\) means that the m-th task is assigned to an outlier task. Case 1 considers a situation where outlier tasks share the same cluster center with other tasks, but the outlier parameter is added. On the other hand, Case 2 considers a situation where outlier tasks do not have any common structure with other tasks.

The parameters \({\varvec{u}}_{c}^{*},{\varvec{v}}_{m}^{c*}\), and \({\varvec{o}}_{m}^{*}\) were generated as follows. First, each explanatory variable \(\{j=1,\ldots ,p\}\) was randomly assigned to the c-th clusters \(\{c=1,\ldots ,C \}\) with the same probability. Then, we generated a true centroid parameter for c-th cluster \({\varvec{u}}^{*}_{c} = (u^{*}_{c1},\ldots ,u^{*}_{cp})^{\top }\) by

$$\begin{aligned} u_{cj}^{*} {\left\{ \begin{array}{ll} \sim N(0,10) & \text {if }j\text {-th variable is assigned to } c\text {-th cluster},\\ =0 & \text {otherwise}, \end{array}\right. } \quad j=1,\ldots ,p. \end{aligned}$$

Next, we generated a true task-specific parameter for m-th task that belongs to c-th cluster \({\varvec{v}}^{c*}_{m} = (v^{c*}_{m1},\ldots ,v^{c*}_{mp})^{\top }\) by

$$\begin{aligned} v_{mj}^{c*} {\left\{ \begin{array}{ll} \sim N(0,1) & \text {if }j\text {-th variable is assigned to }c\text {-th cluster},\\ =0 & \text {otherwise}, \end{array}\right. } \quad j=1,\ldots ,p. \end{aligned}$$

For Case 1, we generated a true outlier parameter for m-th task belonging to c-th cluster but assigned to an outlier task \({\varvec{o}}_{m}^{c*}=(o_{m1}^{c*},\ldots ,o_{mp}^{c*})^{\top }\) by

$$\begin{aligned} & o_{mj}^{c*} {\left\{ \begin{array}{ll} \sim f^{\textrm{MTN}}(o) & \text {if }j\text {-th variable is assigned to} c\text {-th cluster},\\ =0 & \text {otherwise}, \end{array}\right. } \quad j=1,\ldots ,p, \end{aligned}$$

where \(f^{\textrm{MTN}}(o)\) is a mixture of truncated normal distribution given by

$$\begin{aligned} f^{\textrm{MTN}}(o)= & 0.5f(o,-\infty ,-3,-3,\sigma _{o}^{2})\\ & + 0.5f(o,3,\infty ,3,\sigma _{o}^{2}), \end{aligned}$$

where \(f(o,a,b,\mu ,\sigma )\) is a truncated normal distribution on \(a\le o \le b\) whose original normal distribution has mean \(\mu \) and variance \(\sigma ^{2}\). The reason for generating \(o_{mj}^{(c)*}\) from \(f^{\textrm{MTM}}(o)\) is to leave the absolute value of \({\varvec{o}}_{mj}^{c*}\) away from zero so that the outlier task is located away from the cluster. For Case 2, we generated a true outlier parameter for m-th task \({\varvec{o}}_{m}^{*}=(o_{m1}^{*},\ldots ,o_{mp}^{*})^{\top }\) by

$$\begin{aligned} o_{mj}^{*} \sim U(-10,10), \quad j=1,\ldots ,p, \end{aligned}$$

where U(ab) is the continuous uniform distribution.

From these generating ways of \({\varvec{u}}_{c}^{*},{\varvec{v}}_{m}^{c*}\), and \({\varvec{o}}_{m}^{*}\), true regression coefficient vectors for non-outlier tasks that belong to different clusters have different non-zero variables. In other words, tasks belonging to different clusters are orthogonal to each other. This orthogonal setting has been used in some studies of MTL (Jacob et al. 2008; Zhou and Zhao 2016).

For our true model, we set as \(n_{m}=200\), \(p=100\), \(T=150\), and \(\sigma ^{2}=5\). 200 samples in each task were split into 50 samples for the train, 100 samples for the validation, and left samples for the test. We considered settings: \(\kappa =\{0,0.1,0.2,0.3,0.4\}\).

We compared MTLRRC with several methods for the evaluation. For MTLRRC, we consider the three cases where group lasso (GL), group SCAD (GS), and group MCP (GM) are used for the fourth term in (13). Here, we set \(\gamma =3.7\) for group SCAD and \(\gamma =3\) for group MCP. As other competing methods, we employed MTLCVX, MTLK (multi-task learning via k-means; Argyriou et al. 2007), RCMTL (Yao et al. 2019), and Hotelling-like outlier task detection with MTLK (HMTLK). Here, HMTLK was done as follows. First, \(\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}\; (m=1,\ldots ,T)\) were estimated and these sample mean \(\bar{{\varvec{w}}}^{\textrm{STL}}\) and covariance matrix \(\bar{\Sigma }^{\textrm{STL}}\) were also calculated. For these values, we calculated the statistic \(h_{m}=(\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}-\bar{{\varvec{w}}}^{\textrm{STL}})^{\top }(\bar{\Sigma }^{\textrm{STL}})^{-1}(\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}-\bar{{\varvec{w}}}^{\textrm{STL}})\). Then, we detected tasks that satisfied \(h_{m}\ge \chi ^{(95)}_{p}\), where \(\chi ^{(95)}_{p}\) is a 95 percentile point of \(\chi ^{2}\) distribution having p degrees of freedom. Finally, we estimated regression coefficients by MTLK except for detected outlier tasks. Note that the detection based on Hotelling’s \(T^{2}\) is not theoretically justified for this simulation setting. However, as we will see later, it is possible to detect some outlier tasks.

The weights \(r_{m_{1},m_{2}}\) for both MTLCVX and MTLRRC were calculated by Eq. (3). k was set to five. The estimation of \(\widehat{{\varvec{w}}}_{m}^{\textrm{STL}}\) in Eq. (3) and HMTLK were performed by the lasso in R package “glmnet”. In addition, the initialization of RCMTL was done by singular value decomposition of the initial regression coefficient matrix, which is also calculated by “glmnet”. The tuning parameter \(\nu \) included in Algorithm 5 was set to one. The regularization parameters were determined by the validation data. For the evaluation, we calculated the normalized mean squared error (NMSE), root mean squared error (RMSE), true positive rate (TPR), and false positive rate (FPR):

$$\begin{aligned} \begin{aligned} \textrm{NMSE}&= \frac{1}{T} \sum _{m=1}^{T}\frac{\Vert {\varvec{y}}^{*}_{m}-X_{m}\widehat{{\varvec{w}}}_{m} \Vert _{2}^{2}}{n_{m}\textrm{Var}({\varvec{y_{m}^{*}}})},\\ \textrm{RMSE}&= \frac{1}{T}\sqrt{\sum _{m=1}^{T}\Vert {\varvec{w}}^{*}_{m}-\widehat{{\varvec{w}}}_{m} \Vert _{2}^{2}},\\ \textrm{TPR}&= \frac{\#\left\{ m;{\varvec{o}}_{m}^{*} \ne {\varvec{0}} \;\wedge \;\widehat{{\varvec{o}}}_{m}\ne {\varvec{0}}\right\} }{\#\left\{ m;{\varvec{o}}_{m}^{*}\ne {\varvec{0}}\right\} }, \\ \textrm{FPR}&= \frac{\#\left\{ m;{\varvec{o}}_{m}^{*} = {\varvec{0}} \;\wedge \;\widehat{{\varvec{o}}}_{m}\ne {\varvec{0}}\right\} }{\#\left\{ m;{\varvec{o}}_{m}^{*}= {\varvec{0}}\right\} }. \end{aligned} \end{aligned}$$

NMSE and RMSE evaluate the accuracy of the prediction and estimated regression coefficients, respectively. TPR and FPR evaluate the accuracy of outlier detection. They were computed 40 times, and the mean and standard deviation were obtained in each setting.

Tables 1 and 2 show the results of simulation studies for Cases 1 and 2, respectively. For Case 1, MTLRRC and MTLCVX show almost the same accuracy in NMSE and RMSE with all \(\kappa \)s. MTLRRC, MTLCVX, and MTLK, which do not remove the outlier task a priori, outperform HMTLK in terms of NMSE and RMSE. This may suggest that multi-task learning improves the estimation accuracy even for outlier tasks that are located away from other tasks. As for TPR, HMTLK shows better results than MTLRRC. If the purpose is only to detect and eliminate outlier tasks, HMTLK is probably a better choice than robust MTL methods. For FPR, MTLRRC with non-convex penalties archives almost the best performance with any \(\kappa \). For Case 2, MTLRRC with non-convex penalties shows better performance than MTLCVX in terms of NMSE and RMSE. Furthermore, MTLRRC with non-convex penalties is superior to HMTLK in terms of both TPR and FPR. As for RCMTL, the results do not show any superior performance in all settings. This is possible because the initialization values obtained from “glmnet” were incompatible with their parameter initialization method based on singular value decomposition, which causes the optimization algorithm to fall into local optima with poor accuracy. Note that in Tables 1 and 2, we do not highlight any TPR of RCMTL regardless of the large value, because RCMTL does not select almost all non-outlier tasks, and the value is meaningless.

On the whole, MTLRRC with non-convex penalties detected true outlier tasks while greatly minimizing the detection of false outlier tasks for both Case 1 and Case 2. However, the differences in estimation accuracy between MTLCVX and MTLRRC are small, particularly for small \(\kappa \). On the other hand, MTLRRC with the group lasso regularization shows poor performance even for TPR and FPR. For outlier task detection, the group lasso regularization would not be recommended.

Table 1 Simulation result of Case 1
Table 2 Simulation result of Case 2

We conducted a sensitivity analysis for the selection of regularization parameters and \(\gamma \). The analysis used a fixed dataset generated from Case 1 and Case 2 with \(\kappa =0.2\), maintaining other settings consistent with our previous simulation studies. For the regularization term, we employed the group SCAD penalty. First, we selected optimal regularization parameters except for fixed \(\gamma =3.7\) through grid search. Then, we estimated MTLRRC and calculated the evaluation values by varying only one parameter while fixing the other parameters to their optimal values. Consequently, we obtain curves representing the empirical relationship between each tuning parameter and the evaluation values as shown in Figs. 2 and 3 for Case 1 and Case 2, respectively.

For \(\lambda _{1}\), the selected value is the same for both Case 1 and Case 2. Only in Case 2, the selection of \(\lambda _{1}\) is sensitive to all evaluation values. Since the trends of all evaluation values are synchronized, selecting the optimal value that minimizes NMSE yields favorable TPR and FPR values. Therefore it would be essential to search \(\lambda _{1}\) in detail. For \(\lambda _{2}\), in Case 1, while the curves of NMSE, RMSE and FPR are synchronized, that of TPR is not. This would explain the small TPR in the Case 1 results, as seen in Table 1. In Case 2, the evaluation values are consistent and less sensitive to the value of \(\lambda _{2}\). For \(\lambda _{3}\), in Case 1, there is not better \(\lambda _{3}\) improving both FPR and TPR. The reason for this is that the appropriate \(\lambda _{2}\) is not selected, and the search for the true structure regarding outlier tasks and cluster structure has failed. On the other hand, in Case 2, when \(\lambda _{3}\) is larger than a certain value, TPR and NMSE are worsened simultaneously. This result probably coincides with our motivation that selecting outliers improves estimation accuracy. For \(\gamma \), its value has a limited impact when other tuning parameters are appropriately selected. On the whole, if the outlier task has large characteristics, as in Case 2, then the proposed method can appropriately detect the outlier task by choosing parameters that minimize the NMSE. However, if the latent outlier components are subtle, as in Case 1, then selecting the parameters based on NMSE may fail to identify the outlier task. This limitation requires further investigation in future work.

Fig. 2
figure 2

Evaluation values against regularization parameters under Case 1. Each figure shows the relationship between one tuning parameter and evaluation values. The horizontal axis represents the parameter value, and the vertical axis represents the evaluation values. The blue line represents NMSE, the green line represents RMSE, the red line represents TPR, and the purple line represents FPR. Note that the RMSEs are normalized by those maximum values. The dashed line corresponds to the optimal value of the parameter that minimizes NMSE

Fig. 3
figure 3

Evaluation values against regularization parameters under Case 2

We also check the convergence of our modified ADMM algorithm with group SCAD penalty. Figure 4 shows the results of the convergence under Case 2. Because the \(C_{p}\) of group SCAD is \(\frac{1}{\gamma -1}\), \(\lambda _{1}>0.371\) satisfies \(\lambda _{1}>C_{p}\) for \(\gamma =3.7\). The figure shows that all parameters converge under forty iterations for both \(\lambda _{1}>C_{p}\) and \(\lambda _{1}<C_{p}\).

Fig. 4
figure 4

Convergence of modified ADMM algorithm under Case 2. The horizontal axis represents the number of iterations (t). The vertical axis shows the difference of parameter values from previous iterations as blue line \(\Vert W^{(t)}-W^{(t+1)}\Vert _{F}\), green line \(\Vert U^{(t)}-U^{(t+1)}\Vert _{F}\), red line \(\Vert O^{(t)}-O^{(t+1)}\Vert _{F}\), and purple line \(\Vert S^{(t)}-S^{(t+1)}\Vert _{F}\), respectively

6 Application to real datasets

In this section, we apply MTLRRC to three real datasets. The first is the landmine data (Xue et al. 2007), which consists of nine-dimensional features and the corresponding binary labels for 29 tasks. Each task corresponds to a landmine field where data were collected: 1–15 tasks correspond to regions that are relatively highly foliated, and 16–29 tasks correspond to regions that are bare earth or desert. Thus, there may be two clusters depending on the ground surface conditions. The responses represent landmines or clutter. The features are four moment-based features, three correlation-based features, one energy ratio feature, and one spatial variance feature. These are extracted from radar images. This dataset contains 14,820 samples in total, but the number of positive samples and negative samples is quite unbalanced. To perform the analysis stably, down-sampling was done by reducing negative samples to equal the number of positive samples. In the results, we used 1,808 samples in total. The second dataset is the school data (Bakker and Heskes 2003), which consists of examination scores of 15,362 students, four school-specific attributes, and three student-specific attributes from 139 sary schools in London from 1985 to 1987. The dataset was obtained by the “MALSAR” package in MATLAB. In the package, categorical attributes were replaced with binary attributes. Then, we used 28-dimensional explanatory variables and the examination scores as a response. Each school is considered a task. The third dataset is microarray data (Wille et al. 2004), which consists of microarray gene expression data focusing on isoprenoid biosynthesis in plants. The dataset contains expression levels of 21 genes in the mevalonate pathway and expression levels of 18 genes in the plastidial pathway. We used those 21 genes as feature variables and each of the 18 genes as a task. Since the dataset consists of only one design matrix, the setting is rather multivariate regression.

Table 3 AUC and NMSE for real data in 100 repetitions
Fig. 5
figure 5

The mean of the estimated value of parameters in MTLRRC (GS\(\gamma \)) in 100 repetitions for the landmine data

Fig. 6
figure 6

The mean of the estimated value of parameters in MTLRRC (GS\(\gamma \)) in 100 repetitions for the school data

Fig. 7
figure 7

The mean of the estimated value of parameters in MTLRRC (GS\(\gamma \)) in 100 repetitions for the microarray data

Fig. 8
figure 8

Ratio of \(\widehat{{\varvec{o}} }_{m}\ne {\varvec{0}}\) for 100 repetitions in the landmine data

Fig. 9
figure 9

Ratio of \(\widehat{{\varvec{o}} }_{m}\ne {\varvec{0}}\) for 100 repetitions in the school data

Fig. 10
figure 10

Ratio of \(\widehat{{\varvec{o}} }_{m}\ne {\varvec{0}}\) for 100 repetitions in the microarray data

We compared MTLRRC with MTLCVX, RCMTL, DTFLR (distributed spanning-tree-based fused-lasso regression; Zhang et al. 2024), and MTLK. First, we split the samples in each task into \(60\%\) for the train, \(20\%\) for the validation, and \(20\%\) for the test. The regularization parameters are determined by the validation data. In MTLRRC, we use non-convex penalties and consider four cases. In the first case, \(\gamma \) is fixed with 3 for the group MCP (GM). In the second case, \(\gamma \) is fixed with 3.7 for the group SCAD (GS). In the third and fourth cases, \(\gamma \) is chosen by the validation data for group MCP and group SCAD, respectively (GM\(\gamma \) and GS\(\gamma \)). For the evaluation, we calculated NMSE for the school data and the microarray data and AUC for the landmine data, respectively. These values were calculated 100 times with the random splitting of the dataset.

Table 3 shows the mean and standard deviation in 100 repetitions. Due to memory restrictions, the implementation code published by Zhang et al. (2024) was infeasible for the school data. From this table, we observe that MTLRRC (GS\(\gamma \)) gives a smaller NMSE and larger AUC than other methods for the school data, the microarray data, and the landmine data, respectively.

Next, we calculated the mean of estimated parameters \({\widehat{W}},{\widehat{U}}\), and \({\widehat{O}}\) in MTLRRC (GS\(\gamma \)). Figures  5, 6 and 7 show the mean of estimated parameter values for the landmine data, the school data, and the microarray data, respectively. The vertical axis is the index of tasks, and the horizontal axis is the index of features. Each color shows the mean of the estimated parameters. Figures 8, 9, and 10 are bar plots that show the ratios of each task detected as an outlier task. Here, note that tasks that have never been detected are removed from the bar plots.

For the landmine data, Fig. 5b suggests the presence of two clusters, which is consistent with the fact that tasks 1–15 and 16–29 are obtained from regions corresponding to different surface conditions. For outliers detection in Fig. 5c, because the 10-th task has a relatively larger or smaller value than other tasks and the task is a task detected as an outlier with a ratio greater than 0.3 from Fig. 8, it may indicate that the task is a potential outlier task. Furthermore, the 9-th task also has a relatively high detection ratio that is greater than 0.3, although the mean of \(\widehat{{\varvec{o}}}_{m}\) is not as clear as the 10-th task from Fig. 5c. The estimated regression coefficients for the 9-th and 10-th tasks in Fig. 5a show a similarity to other tasks within the same estimated cluster. These results may suggest the underlying structure among tasks in the landmine data is rather Case 1 than Case 2 in Sect. 5. Furthermore, from Fig. 8, we observe that only two tasks in tasks 16–29 have an outlier task ratio of more than 0.1, while 13 tasks in tasks 1–15 have. This result suggests that the cluster composed of tasks 1–15 may have relatively large variability. This may provide some insight into the structure concerning sub-groups within the cluster.

For the school data, we obtained the homogeneous pattern in Fig. 6b. The school data have been considered rather homogeneous in some studies (Bakker and Heskes (2003); Evgeniou et al. (2005)). Thus, this result would be reasonable and shows that MTLRRC can consistently estimate cluster structure even when the true number of underlying clusters is one. From Fig. 9, the 2nd, 28th, 66th, 76th, and 83rd tasks were detected with a rate greater than 0.3. In addition, these tasks show larger values in \({\varvec{o}}_{m}\) than other tasks from Fig. 6c. For the regression coefficients in Fig. 6a, although these tasks share many characteristics with other tasks, some regression coefficients have different characteristics. For example, the regression coefficients for 76th and 83rd tasks of 16th feature have, respectively, relatively large and small values that are much different from other tasks.

For the microarray data, the analysis revealed heterogeneous patterns as shown in Figs. 7a, b, without exhibiting distinctive features. Figure 10 indicates that none of the tasks were clearly identified as outliers, suggesting the absence of outlier tasks in the microarray dataset. This finding presents an interesting contrast to the school data and landmine data, where the existence of outlier tasks was indicated.

7 Conclusion

In this paper, we proposed a robust multi-task learning method called Multi-Task Learning via Robust Regularized Clustering (MTLRRC). To perform the clustering of tasks and detection of outlier tasks simultaneously, we incorporated regularization terms based on robust regularized clustering (RRC), which can detect outlier samples by selecting outlier parameters through group sparse penalties. We showed that the solution of the RRC obtained by the BCD algorithm shares the same first-order condition with convex clustering whose loss function is replaced by the multivariate robust loss function. Thus, MTLRRC is expected to perform robust clustering of tasks. Furthermore, the solution of MTLRRC by the BCD algorithm is also viewed as a convergence point of alternative optimization that involves the RRC for tasks and regression problems reducing the shrinkage of outlier tasks toward the estimated centroid. To mitigate computational costs, we developed an estimation algorithm based on the modified ADMM.

We conducted simulation studies to evaluate the performance of MTLRRC under two scenarios of outlier task structures. In Case 1, outlier tasks share the same characteristics as the centroid but have additional outlier parameters. In Case 2, outlier tasks do not share any common structure and are independent of other tasks. In both of these cases, the proposed method with non-convex group penalties exhibited a near-zero FNR in outlier detection and a much larger TPR in Case 2. However, the improvement in estimation and prediction accuracy was slight, when the proportion of outlier tasks was few.

In the application to real data, we observed that the proposed method effectively estimates multiple cluster structures and identifies potential outlier tasks resembling Case 1. These findings suggest that MTLRRC not only estimates clusters but also provides insights into the heterogeneity of outlier tasks within clusters. In the usual convex clustering for observed and fixed data, one may look at the behavior of the cluster from the solution path. On the other hand, our clustering targets are regression coefficients, whose values also depend on the other regularization parameters like \(\lambda _{1}\). Therefore, it is difficult to see the solution path. However, the outlier task detection enables us to obtain information about the behavior of the clusters at the determined regularization parameters. Furthermore, the circumstances corresponding to detected tasks can be subject to re-examination, because the dataset would have unique characteristics compared with other datasets.

One limitation of our study is that MTLRRC includes three or four regularization parameters to be determined. The computational cost of searching for the optimal value of these parameters can be demanding. On the other hand, the definition of the multivariate M-estimator having a connection to the group penalties is different from that of the traditional one (Maronna 1976): while the former M-estimator is defined as a straightforward extension of the univariate M-estimator with robust multivariate loss function, the traditional one is defined as the solution of weighted log-likelihood equations. Although there may be some relationship between these two definitions, they are probably not equivalent. Moreover, outlier tasks similar to Case 2 were not detected in the analyzed real data. We leave these topics as future work.