Abstract
Data integration is the process of extracting information from multiple sources and jointly analyzing different data sets. In this paper, we propose to use the mixed \(\ell _{2,1}\) regularized composite quasi-likelihood function to perform multi-task feature learning with different types of responses, including continuous and discrete responses. For high dimensional settings, our result establishes the sign recovery consistency and estimation error bounds of the penalized estimates under regularity conditions. Simulation studies and real data analysis examples are provided to illustrate the utility of the proposed method to combine correlated platforms with heterogeneous tasks and perform joint sparse estimation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
All data sets are available online, which are stated in Sect. 5.
Code availability
The code is provided in the R package “HMTL” at https://CRAN.R-project.org/package=HMTL.
References
Agarwal, A., Negahban, S., & Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2), 1171–1197.
Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(61), 1817–1853.
Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multi-task feature learning. In: Proceedings of the 19th international conference on neural information processing systems. MIT Press, Cambridge, MA, USA, NIPS’06, pp. 41–48.
Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(40), 1179–1225.
Bai, H., Zhong, Y., Gao, X., et al. (2020). Multivariate mixed response model with pairwise composite-likelihood method. Stats, 3(3), 203–220.
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.
Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and dantzig selector. The Annals of Statistics, 37(4), 1705–1732.
Cadenas, C., van de Sandt, L., Edlund, K., et al. (2014). Loss of circadian clock gene expression is associated with tumor progression in breast cancer. Cell Cycle, 13(20), 3282–3291. PMID: 25485508.
Cao, H., & Schwarz, E. (2022). RMTL: Regularized multi-task learning. https://CRAN.R-project.org/package=RMTL, r package version 0.9.9.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Cox, D. R., & Reid, N. (2004). A note on pseudolikelihood constructed from marginal densities. Biometrika, 91(3), 729–737.
U.S. Department of Health and Human Services. (2010). Community Health Status Indicators (CHSI) to Combat Obesity, Heart Disease and Cancer 1996–2003 (p. 2010). Washington, D.C., USA: US Department of Health and Human Services.
Ekvall, K.O., & Molstad, A.J. (2021). mmrr: Mixed-type multivariate response regression. R package version 0.1.
Ekvall, K. O., & Molstad, A. J. (2022). Mixed-type multivariate response regression with covariance estimation. Statistics in Medicine,41(15), 2768–2785. https://doi.org/10.1002/sim.9383, onlinelibrary.wiley.com/doi/abs/10.1002/sim.9383.
Eldar, Y. C., Kuppinger, P., & Bolcskei, H. (2010). Block-sparse signals: Uncertainty relations and efficient recovery. IEEE Transactions on Signal Processing, 58(6), 3042–3054.
Fang, E. X., Ning, Y., & Li, R. (2020). Test of significance for high-dimensional longitudinal data. The Annals of Statistics, 48(5), 2622–2645.
Fan, J., Liu, H., Sun, Q., et al. (2018). I-lamm for sparse learning: Simultaneous control of algorithmic complexity and statistical error. The Annals of Statistics, 46(2), 814–841.
Fan, J., Wang, W., & Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of statistics, 49(3), 1239–1266. https://doi.org/10.1214/20-aos1980
Gao, X., Zhong, Y., & Carroll, R. J. (2022). FusionLearn: Fusion Learning. https://CRAN.R-project.org/package=FusionLearn, r package version 0.2.1.
Gao, X., & Carroll, R. J. (2017). Data integration with high dimensionality. Biometrika, 104(2), 251–272.
Gao, X., & Song, P. X. K. (2010). Composite likelihood Bayesian information criteria for model selection in high-dimensional data. Journal of the American Statistical Association, 105(492), 1531–1540.
Gao, X., & Zhong, Y. (2019). Fusionlearn: a biomarker selection algorithm on cross-platform data. Bioinformatics, 35(21), 4465–4468.
Gaughan, L., Stockley, J., Coffey, K., et al. (2013). KDM4B is a master regulator of the estrogen receptor signalling cascade. Nucleic Acids Research, 41(14), 6892–6904. https://doi.org/10.1093/nar/gkt469
Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. The Annals of Mathematical Statistics, 31(4), 1208–1211.
Gomez-Cabrero, D., Abugessaisa, I., Maier, D., et al. (2014). Data integration in the era of omics: Current and future challenges. BMC Systems Biology, 8(2), I1.
Gong, P., Ye, J., & Zhang, C. (2013). Multi-stage multi-task feature learning. Journal of Machine Learning Research, 14(55), 2979–3010.
Hatzis, C., Pusztai, L., Valero, V., et al. (2011). A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA, 305(18), 1873–1881.
Hebiri, M., & van de Geer, S. (2011). The Smooth-Lasso and other \(\ell _1+\ell _2\)-penalized methods. Electronic Journal of Statistics, 5(none), 1184–1226.
Heimes, A. S., Härtner, F., Almstedt, K., et al. (2020). Prognostic significance of interferon-\(\gamma\) and its signaling pathway in early breast cancer depends on the molecular subtypes. International Journal of Molecular Sciences,21(19).
Hellwig, B., Hengstler, J. G., Schmidt, M., et al. (2010). Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes. BMC Bioinformatics, 11(1), 276.
Itoh, M., Iwamoto, T., Matsuoka, J., et al. (2014). Estrogen receptor (er) mrna expression and molecular subtype distribution in er-negative/progesterone receptor-positive breast cancers. Breast Cancer Research and Treatment, 143(2), 403–409.
Ivshina, A. V., George, J., Senko, O., et al. (2006). Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Research, 66(21), 10292–10301.
Jalali, A., Sanghavi, S., Ruan, C., et al. (2010). A dirty model for multi-task learning. In J. Lafferty, C. Williams, J. Shawe-Taylor, et al. (Eds.), Advances in neural information processing systems. (Vol. 23). Curran Associates Inc.
Kanomata, N., Kurebayashi, J., Koike, Y., et al. (2019). Cd1d-and pja2-related immune microenvironment differs between invasive breast carcinomas with and without a micropapillary feature. BMC Cancer, 19(1), 1–9.
Karn, T., Rody, A., Müller, V., et al. (2014). Control of dataset bias in combined affymetrix cohorts of triple negative breast cancer. Genomics Data, 2, 354–356.
Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics, 80, 220–239.
Liu, J,. Ji, S., & Ye, J. (2009). Multi-task feature learning via efficient \(l_{2,1}\)-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, Arlington, Virginia, USA, UAI ’09, p 339-348.
Liu, C. L., Cheng, S. P., Huang, W. C., et al. (2023). Aberrant expression of solute carrier family 35 member a2 correlates with tumor progression in breast cancer. In Vivo, 37(1), 262–269.
Liu, Q., Xu, Q., Zheng, V. W., et al. (2010). Multi-task learning for cross-platform sirna efficacy prediction: An in-silico study. BMC Bioinformatics, 11(1), 1–16.
Li, Y., Xu, W., & Gao, X. (2021). Graphical-model based high dimensional generalized linear models. Electronic Journal of Statistics, 15(1), 1993–2028.
Loh, P. L., & Wainwright, M. J. (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19), 559–616.
Loh, P. L., & Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6), 2455–2482.
Lounici, K., Pontil, M., van de Geer, S., et al. (2011). Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4), 2164–2204.
McCullagh, P., & Nelder, J. (1989). Generalized Linear Models, Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series (2nd ed.). London: Chapman & Hall.
Meinshausen, N., & Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270.
Negahban, S. N., Ravikumar, P., Wainwright, M. J., et al. (2012). A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Negahban, S. N., & Wainwright, M. J. (2011). Simultaneous support recovery in high dimensions: Benefits and perils of block \(\ell _{1}/\ell _{\infty }\)-regularization. IEEE Transactions on Information Theory, 57(6), 3841–3863.
Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161.
Ning, Y., & Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1), 158–195.
Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231–252.
Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1), 1–47.
Ouyang, Y., Lu, W., Wang, Y., et al. (2023). Integrated analysis of mrna and extrachromosomal circular dna profiles to identify the potential mrna biomarkers in breast cancer. Gene, 857, 147174. https://doi.org/10.1016/j.gene.2023.147174
Poon, W. Y., & Lee, S. Y. (1987). Maximum likelihood estimation of multivariate polyserial and polychoric correlation coefficients. Psychometrika, 52(3), 409–430.
Rakotomamonjy, A., Flamary, R., Gasso, G., et al. (2011). \(\ell _{p}-\ell _{q}\) penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Transactions on Neural Networks, 22(8), 1307–1320.
Ravikumar, P., Wainwright, M. J., & Lafferty, J. D. (2010). High-dimensional Ising model selection using \(\ell _1\)-regularized logistic regression. The Annals of Statistics, 38(3), 1287–1319. https://doi.org/10.1214/09-AOS691
Rody, A., Karn, T., Liedtke, C., et al. (2011). A clinically relevant gene signature in triple negative and basal-like breast cancer. Breast Cancer Research, 13(5), R97.
Schmidt, M., Böhm, D., von Törne, C., et al. (2008). The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Research, 68(13), 5405–5413.
Sethuraman, A., Brown, M., Krutilina, R., et al. (2018). Bhlhe40 confers a pro-survival and pro-metastatic phenotype to breast cancer cells by modulating hbegf secretion. Breast Cancer Research, 20, 1–17.
Škalamera, D., Dahmer-Heath, M., Stevenson, A. J., et al. (2016). Genome-wide gain-of-function screen for genes that induce epithelial-to-mesenchymal transition in breast cancer. Oncotarget, 7(38), 61000–61020. https://doi.org/10.18632/oncotarget.11314
Sun, Q., Zhou, W. X., & Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association, 115(529), 254–265.
Tang, H., Sebti, S., Titone, R., et al. (2015). Decreased becn1 mrna expression in human breast cancer is associated with estrogen receptor-negative subtypes and poor prognosis. EBioMedicine, 2(3), 255–263.
Thung, K. H., & Wee, C. Y. (2018). A brief review on multi-task learning. Multimedia Tools and Applications, 77(22), 29705–29725.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
van de Geer, S. A., & Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3, 1360–1392.
van de Geer, S., Bühlmann, P., Ritov, Y., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202.
van de Geer, S., & Müller, P. (2012). Quasi-likelihood and/or robust estimation in high dimensions. Statistical Sciences, 27(4), 469–480.
Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical Analysis, 92(1), 1.
Wainwright, M.J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.
Wang, W., Liang, Y., & Xing, E. P. (2015). Collective support recovery for multi-design multi-response linear regression. IEEE Transactions on Information Theory, 61(1), 513–534.
Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biometrika, 61(3), 439–447.
Wigington, C. P., Morris, K. J., Newman, L. E., et al. (2016). The polyadenosine rna-binding protein, zinc finger cys3his protein 14 (zc3h14), regulates the pre-mrna processing of a key atp synthase subunit mrna*. Journal of Biological Chemistry, 291(43), 22442–22459. https://doi.org/10.1074/jbc.M116.754069
Wu, S., Gao, X., & Carroll, R.J. (2023). Model selection of generalized estimating equation with divergent model size. Statistica Sinica, pp. 1–22. https://doi.org/10.5705/ss.202020.0197
Yi, G. Y. (2014). Composite likelihood/pseudolikelihood (pp. 1–14). Wiley StatsRef: Statistics Reference Online.
Yi, G. Y. (2017). Statistical analysis with measurement error or misclassification: strategy, method and application. Berlin: Springer.
Yousefi, N., Lei, Y., Kloft, M., et al. (2018). Local rademacher complexity-based learning guarantees for multi-task learning. Journal of Machine Learning Research, 19(38), 1–47.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1), 49–67.
Zhan, X.J., Wang, R., & Kuang, X.R., et al. (2023). Elevated expression of myosin vi contributes to breast cancer progression via mapk/erk signaling pathway. Cellular Signalling, p. 110633.
Zhang, K., Gray, J. W., & Parvin, B. (2010). Sparse multitask regression for identifying common mechanism of response to therapeutic targets. Bioinformatics, 26(12), i97–i105.
Zhang, H., Liu, D., Zhao, J., et al. (2018). Modeling hybrid traits for comorbidity and genetic studies of alcohol and nicotine co-dependence. The Annals of Applied Statistics, 12(4), 2359–2378. https://doi.org/10.1214/18-AOAS1156
Zhang, J. Z., Xu, W., & Hu, P. (2022). Tightly integrated multiomics-based deep tensor survival model for time-to-event prediction. Bioinformatics, 38(12), 3259–3266.
Zhang Y, Yang Q (2017) A survey on multi-task learning. CoRR abs/1707.08114. arxiv:1707.08114
Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541–2563.
Zhong, Y., Xu, W., & Gao, X. (2023). HMTL: Heterogeneous Multi-Task Feature Learning. R package version 0.1.0.
Zhou, J., Yuan, L., & Liu, J., et al. (2011). A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, KDD ’11, p 814-822
Acknowledgements
X.G.’s research was supported by the Natural Sciences and Engineering Research Council of Canada funding. W.X. was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672). The authors would like to thank the anonymous reviewers for their valuable suggestions and constructive criticisms.
Funding
X.G.’s research was supported by the Natural Sciences and Engineering Research Council of Canada funding. W.X. was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672).
Author information
Authors and Affiliations
Contributions
All authors contributed to the design of the research problems and wrote the manuscript. Material preparation and original draft were completed by YZ Theoretical analysis and methodology development were conducted by XG and YZ Data collection and analysis were performed by WX, XG and YZ.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
We declare that all the authors have agreed on the submission of this paper to the Machine Learning journal.
Additional information
Editor: Jean-Philippe Vert.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Theorem in Sect. 2
This section provides the proofs of Theorem 1, and Theorem 2. In the following derivations, we assume that all tasks have identical sample sizes n.
1.1 A.1: Proof of Theorem 1
Proof
Lemma 2.4 shows that for the solution of the estimating equation (2.4) \({\hat{\theta }} \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)\), \(({\hat{\theta }} - \theta ^*) \in {\mathcal {C}}(m, \gamma )\) with \(m = c_0Ks\) and \(\gamma = 2\sqrt{K}+1\). Using the results from Lemma 2.2, we can obtain the following inequality with probability tending to 1,
We apply Hölder’s inequality in Lemma C.5 to two components in (A1):
Plugging back into (A1), we have
According to Lemma 2.1, \(\Vert n^{-1}\nabla {\mathcal {L}}(\theta ^*) \Vert _{2,\infty } \le \lambda _n/2\) with a probability tending to 1. Therefore, the component \((\Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}^c} \Vert _{2,\infty } - \lambda _n/ )\Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}^c} \Vert _1 \le 0.\) We simplify the inequality above as follows,
According to the property of mixed \(\ell _{2,\infty }\) norm, \(\Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty } = \sup _p\Vert {\hat{z}}^{(p)}\Vert _{2} = 1\). In addition, \(\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le \sqrt{\vert {{\mathcal {E}}} \vert }\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2}\) with \(\sqrt{\vert {{\mathcal {E}}} \vert } = c_1 s\) for some positive constant \(c_1\). The following inequality can be obtained
Therefore, taking the constant \(c_1 = 1\), we have
In addition, we derive the following error bounds based on Lemma 2.4:
\(\square\)
1.2 A.2: Proof of Theorem 2
Proof
The derivative equation (2.4) can be partitioned into two sets of equations based on the two sub-spaces of parameters \({\mathcal {S}}\) and \({\mathcal {S}}^c\):
Based on the definition of sub-differential, the sub-differential \({\hat{z}}_{{\mathcal {S}}}\) contains grouped subsets \({\hat{z}}^{(p)} = {\hat{\theta }}^{(p)}/\Vert {\hat{\theta }}^{(p)}\Vert _2\) with \(p \in {\mathcal {S}}\), and \(\max _{p \in {\mathcal {S}}^c } \Vert {\hat{z}}^{(p)}\Vert _2 < 1\).
According to Lemma 2.2, \({\hat{\theta }}\) is the optima of the objective function with high probability. Consider an estimator with \({\hat{\theta }}_{{\mathcal {S}},0} = ( {\hat{\theta }}_{{\mathcal {S}}}, {{\textbf {0}}}),\) where
If the estimator \({\hat{\theta }}_{{\mathcal {S}},0}\) satisfies the conditions (A2a) and (A2b), then with high probability, \({\hat{\theta }}_{{\mathcal {S}},0}\) is the local optimal solution \({\hat{\theta }}\) to Equation (2.4).
We expand the score function Using the mean value theorem as follows
where \({\hat{\Delta }} = ({\hat{\theta }}_{{\mathcal {S}},0} - \theta ^*)\), \({\tilde{\theta }} = \alpha \theta ^* + (1-\alpha ){\hat{\theta }}_{{\mathcal {S}}}\) for some \(\alpha \in [0,1]\).
Thus, we write the equations (A2a) and (A2b) in block format with solution \({\hat{\theta }}_{{\mathcal {S}},0}\)
According to Lemma C.4, the sub-matrix \(n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\) is invertible with high probability. Thus, we obtain the difference block \(\Delta _{{\mathcal {S}}}\) by solving
Next, we show that the elements of the remainder vector \({\mathcal {R}}\) can be expanded as follow
with \({\tilde{\Delta }} = ({\tilde{\theta }} - \theta ^*) = (1-\alpha ) {\hat{\Delta }}\). Let \(\nabla _{kp} {\mathcal {H}}^* = {\partial ^3 \ell _{k}({\theta }^*;Y_k)}/{\partial \theta \partial \theta ^T \partial \theta _{kp} }\), where \(\nabla _{kp} {\mathcal {H}}^*\) is a \(Kp_n \times Kp_n\) matrix. With similar derivation as in the proof of Lemma C.2, all elements of \(\nabla _{kp} {\mathcal {H}}^*\) are from sub-exponential distributions. Thus, we show that for any \(k = 1, 2, \cdots , K\) and \(p = 1, 2, \cdots , p_n\),
The step (i) is obtained based on the sub-exponential condition for the elements of \(\nabla _{kp} {\mathcal {H}}^*\). For some small \(\delta\) and a universal constant C,
According to Assumption 2.2, \({\mathcal {W}}^* \ge {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| E(\nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ) \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2\). Thus, \({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2\le (W^*+\delta )\) with high probability. According to Theorem 1, \(\Vert {\hat{\Delta }}\Vert _2^2 \le 9\lambda _n^2\,s/(2\kappa _{-})^{2}\). This leads to the result in the step (ii).
Combining the results above, we show that with a probability larger than \(1 - 2 p_n^{-d} - 4K\exp \{ - C^\prime s^{-2}n + \log (p_n) \}\),
for some constant \(C^\prime > 0.\) This implies \(\text {sign}({\hat{\theta }}_S)=\text {sign}(\theta ^*_S).\)
Next, we show that \(\max _{p \subset {\mathcal {S}}^c } \Vert {\hat{z}}^{(p)}\Vert _2 < 1,\) which satisfies the KKT conditions. The sub-differential \({\hat{z}}_{{\mathcal {S}}^c}\) can be calculated from the block equation above,
The sub-differential \(z_{{\mathcal {S}}^c}\) from (A4) can be decomposed into three components
The sub-differential can be grouped as \({\hat{z}}^{{(p)}}\) with \(p \subset {\mathcal {S}}^c\).
Based on Lemma 2.1 and 2.3, the following upper bound can be obtained with a probability at least \(1 - 2\exp \{- d\log (p_n)\} -4 \exp \{- C_0 s^{-3} \xi ^2 n + 2 \log (K p_n) \}\) for some constants \(d > 1\) and \(C_0 >0\),
For the remainder component, we have
Similarly, we show that the mixed norm of \({\mathcal {I}}_3\) can be bounded,
By adding the three components, we show that
Combining the results above, we have sign(\({\hat{\theta }}\)) \(=\) sign(\(\theta ^*\)) with a probability tending to 1. \(\square\)
Appendix B: Proofs of Lemma in Sect. 2
This section provides the proofs of Lemmes 2.1, 2.3, and 2.4.
1.1 B.1: Proof of Lemma 2.1
Proof
First, we need to analyze the distributional property of the random variable \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\). Lemma C.1 shows that
with some constant \({\mathcal {M}}_*\), and
This result can be used to bound the sub-exponential norm of \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\) by apply Minkowski’s Inequality,
Furthermore, we can show that
This implies that \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\) satisfies the sub-exponential property, such that with small \(\delta\),
Since \(\Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*) \Vert _{2,\infty }\) is the supremum of \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\) over \(p = 1, 2, \cdots , p_n\), we have
By combining all the results above, with \(\delta = {\mathcal {M}}_*\sqrt{2K(1+d)\log (p _n)/(\alpha n)}\) for some constant \(d > 1\), we show that with a probability at least \(1 - 2 p_n^{-d}\),
In addition, we showed that the score function can hold the sub-exponential condition from Lemma C.1. Therefore, we have
which can imply
with a probability at least \(1 - 2\exp \{-d\log (p_n)+\log (K)\}\) as claimed in (2.5).
The second part of Lemma 2.1 shows that the difference between the random Hessian and its expectation is bounded. When the tasks are modeled by the canonical link and the response variables are from the exponential family, the Hessian matrix is deterministic, and \(n^{-1}\nabla {\mathcal {L}}(\theta ^*) = H(\theta ^*)\). For general cases, the entries of the random Hessian of the composite quasi-likelihood are
The component \({\mathcal {I}}_1\) is equal to the corresponding element in the sensitivity matrix \(H(\theta ^*)\). With some special link functions, the component \({\mathcal {I}}_3\) can be equal to zero. For the models with the general quasi-likelihood settings, we can show that the component \({\mathcal {I}}_3\) can be bounded by universal constant \({\mathcal {K}}>0\) across all tasks with similar derivation as Lemma C.2. Based on Assumption 2.5, the variables \(n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^* )_{[kp,kp^{'}]} - H(\theta ^*)_{[kp,kp^{'}]}\) satisfy the sub-exponential condition with mean zero and the \(\psi _1\) norm bounded by \({\mathcal {K}}{\mathcal {M}} < {\mathcal {M}}_*\) for some universal constant \({\mathcal {M}}_*\). Therefore, we have the concentration result of the random Hessian matrix
for any \(k = 1, 2, \cdots , K\) and \(p, p^{'} = 1, 2, \cdots , p_n\).
\(\square\)
Corollary 3
Under Assumptions 2.3–2.7, if the penalty parameter is chosen as
then
with a probability at least \(1 - 2 \exp \{- d\log (p_n) \}\) for some constant d. The mixed \(\ell _{2,\infty }\) norm is defined in (1.1).
1.2 B.2: Proof of Lemma 2.2
Proof
Based on Lemma C.2, the Hessian matrix of the composite quasi-likelihood is given by
and there exist some positive constants \(\alpha _0, \alpha _1,\) and \(\alpha _2\), such that \(\alpha _0< f _1(\eta _{ki} ) < \alpha _1\) and \(\vert f _2(\eta _{ki} )\vert < \alpha _2.\)
When the parameters are partitioned into subsets of different tasks, the Hessian matrix is in the form of diagonal block matrix. We show that the minimum and maximum eigenvalues of the Hessian matrix are given by
We have
We apply Hölder’s inequality and get
Based on Assumption 2.5, \(\Vert x_{ki}\Vert _\infty \le L\) across all tasks and \(\Vert u_{{\mathcal {J}}^c}\Vert _1 \le \gamma \Vert u_{\mathcal {J}}\Vert _1\) with \(\vert {\mathcal {J}}\vert \le m = c_0Ks\), we have
In addition, the variables \(y_{ki} - g_k^{-1}(\eta _{ki}^*)\) follow sub-exponential distributions based on Assumption 2.5. We obtain that with a probability at least \(1 - 2\exp \{-c\log (p_n)\}\) for some constant \(c = (\alpha _2 {\mathcal {M}})^{-2} >0\),
Therefore, there exists some \(\kappa _{-} < \alpha _0\rho _{-}.\) If the sample size is sufficiently large
then we obtain the lower bound for the minimum eigenvalue of Hessian matrix
Similarly, the upper bound can be obtained using similar approach
Combining the results above, the random Hessian matrix satisfies the restricted eigenvalue condition with high probability.
\(\square\)
1.3 B.3: Proof of Lemma 2.3
Proof
The proof of lemma 2.3 is analogous to previous work in Ravikumar et al. (2010). For simplicity, let the sub-matrix of random Hessian \(n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^*)_{\mathcal{S}\mathcal{S}}= {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}\), and let the difference of the matrices be denoted by \(\Delta H_{\mathcal{S}\mathcal{S}}^* = {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} - H(\theta ^*)_{\mathcal{S}\mathcal{S}}\). Because the sub-matrices of random Hessian are diagonal block matrices, we show that
where the sub-matrix \(_k H_{\mathcal{S}\mathcal{S}}^* \in {\mathbb {R}}^{s\times s}\) represents the kth block in \(H_{\mathcal{S}\mathcal{S}}^*\). The difference between sub-matrices is denoted as \(\Delta _k H_{\mathcal{S}\mathcal{S}}^* = [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} - _kH(\theta ^*)_{\mathcal{S}\mathcal{S}}]\).
We need to obtain the concentration result of the inverse matrix difference \([{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1}\). Based on Lemma C.6, we show that the diagonal block matrix \({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } = \sup _k {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [_k{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }\), so that
In the step (i), we apply the inequality between matrix norms and the Cauchy–Schwarz inequality. We have
The step (i) can be obtained based on the derivation C1 and C3. This probability is exponentially small as \(n > c s^3 \log (p_n)\) with some constant c.
We combine all the concentration results and obtain
We have the component \(\sqrt{K}{\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_1 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \le (1 - \xi )\) based on Assumption 2.7. For the second component \({\mathcal {I}}_2\), we apply Lemma C.3 to obtain that with a probability at least \(1 - 2 K \exp \{-\frac{\alpha \kappa _{-}^2\varepsilon ^2}{{\mathcal {M}}_*^2\,s^3} n + 2 \log (s) \}\),
Based on Lemmas C.3 and C.4, the concentration result of the component \({\mathcal {I}}_3\) and \({\mathcal {I}}_4\) can be obtained with a probability at least \(1 - 2K \exp \big \{- \frac{\alpha \kappa _{-}^4{\varepsilon ^{'}}^2}{{\mathcal {M}}_*^2\,s^3} n + 2 \log (s) \big \} - 2K \exp \big \{- \frac{\alpha \kappa _{-}^2 \varepsilon ^2}{{\mathcal {M}}_*^2\,s^3} n + \log (s) + \log (p_n -s) \big \}\),
We set \(\varepsilon \le \xi /(4\sqrt{K})\) and \(\varepsilon ^{'} \le \xi\) that leads to
with a probability \(1 - 4 K\exp \big \{- C_0 \xi ^2 n/s^3 + 2 \log (p_n) \big \}\) for a universal constant \(C_0 > 0.\) \(\square\)
1.4 B.4: Proof of Lemma 2.4
Proof
The first-order partial derivative of the objective function can be expanded by applying the mean value theorem,
where \({\tilde{\theta }} = \alpha \theta ^* + (1-\alpha ){\hat{\theta }}\) for some \(\alpha \in (0,1).\) This entails
Based on Lemma C.2, we can show that with a probability tending to 1,
Thus, we can construct the inequality from 6.7 as follows,
For the exact solution \({\hat{\theta }}\), all elements in \(\nabla Q({\hat{\theta }})\) are zero so that the component \({\mathcal {I}}_1 =0\). The elements in the vector \(({\hat{\theta }} - \theta ^*) \in {\mathcal {R}}^{Kp_n}\) can be decomposed into two subsets \({{\mathcal {E}}}\) and \({{\mathcal {E}}^c}\). By applying Hölder’s inequality in Lemma C.5, the components \({{\mathcal {I}}_2}\) from the equation 6.8 can be bounded above as follows
By the definition, if \({\hat{\theta }}^{(p)} \ne {{\textbf {0}}}\), \({\hat{z}}^{(p)}= {\hat{\theta }}^{(p)} / \Vert {\hat{\theta }}^{(p)}\Vert _2\), and \(\Vert {\hat{z}}^{(p)}\Vert _2 = 1\). If \({\hat{\theta }}^{(p)} = {{\textbf {0}}}\), \(\Vert {\hat{z}}^{(p)} \Vert _2 < 1\). Since \({\mathcal {S}} \cap {\mathcal {E}}^c = \emptyset\), we have \(\theta _{ {\mathcal {E}}^c }^* = {{\textbf {0}}}\). First, we decompose the term \({\mathcal {I}}_3\) into two subsets. In the subset \({\mathcal {E}}\),
In the compliment set \({\mathcal {E}}^c\),
In the step (i), We divide the estimator \({\hat{\theta }}_{{\mathcal {E}}^c}\) into nonzero and zero subsets. Therefore, the formulation in (ii) is identical with the definition of mixed \(\ell _{2,1}\) norm for \({\hat{\theta }}_{{\mathcal {E}}^c}\).
From the above derivations, the inequality 6.8 can be expanded as
Because \(\Vert n^{-1} \nabla {\mathcal {L}}({\theta }^*)\Vert _{2,\infty } \le \lambda _n/2\) with high probability, we have
In addition, if we plug in the maximum value \(\lambda _n/2\) of \(\Vert n^{-1} \nabla {\mathcal {L}}(\theta ^*)\Vert _\infty\), we obtain
Based on the relation between the \(\ell _1\) norm and \(\ell _{2,1}\) norm, we can show that
\(\square\)
Appendix C: Technical lemma
Lemma C.1
Based on Assumptions 2.3–2.5, the individual score function satisfies the sub-exponential condition such that for some universal constant \({\mathcal {M}}_*\),
for any \(k = 1, 2, \cdots , K\) and \(p = 1, 2, \cdots , p_n\).
Proof
For each task, the quasi log-likelihood score function is given by
for \(k = 1, 2, \cdots , K\) and \(p = 1, 2, \cdots , p_n\).
From Assumptions 2.3–2.5, as the linear predictor \(\eta _{ki} < K_0\), the variance functions \(V(\eta _{ki})\) and the link functions \(g_k(\eta _{ki} )\) are well-defined and bounded. Thus, we show that the second component \({\mathcal {I}}_2\) is bounded by some constant. In addition, the derivatives of the linear predictor are \({\partial \eta _{ki}/\theta _{kp} } = x_{kpi},\) and \(\sup _{k,p,i}\{x_{kpi}\} \le L < \infty\). Thus, the component \({\mathcal {I}}_3\) is bounded by L.
Based on Assumption 2.5, \({\mathcal {I}}_1 = y_{ki} - g_k^{-1}(\eta _{ki}^*)\) is from a sub-exponential distribution with zero mean and \(\psi _1\) norm bounded above by \({\mathcal {M}}\). Let \({\mathcal {K}}_{ki} = {\mathcal {I}}_2\times {\mathcal {I}}_3.\) The individual score function is given by
where we have \({\mathcal {K}}_{ki}< {\mathcal {K}} < \infty\) for some universal constant \({\mathcal {K}}\) for all tasks. We obtain that the \(\psi _1\) norm of individual score function are as follows
Based on the property of sub-exponential distribution (Wainwright, 2019),
the \(\psi _1\) norm of \(n^{-1/2} {\partial \ell _{k}(\theta ^*_k; Y_k) }/{\partial \theta _{kp}}\) can be bounded by some constant \({\mathcal {M}}_*\ge \mathcal{K}\mathcal{M}\), such that
\(\square\)
Lemma C.2
Based on Assumption 2.3 and 2.5, let \(w_k = 1\), and there exists some \({\tilde{r}}\), for any \(\theta \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)\), the observed Hessian can be formulated as follows,
with \(\eta _{ki} = \sum _{p=1}^{p_n} x_{kpi}\theta _{kp}\) and \(\eta _{ki}^* = \sum _{p=1}^{p_n} x_{kpi}\theta _{kp}^*\), and the functions of linear predictors \(f _1(\eta _{ki})\) and \(f _2(\eta _{ki})\) are both bounded. Furthermore, the function \(f _1(\eta _{ki}) > 0\) for \(\theta \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)\).
Proof
First, the observed Hessian can be constructed as follow
Therefore, we can set
and by applying the approximation,
We can further show that
Based on Assumption 2.3, we can set that there exists some positive constants \(K_1\), \(K_2\), \(K_3\), \(K_4\), \(K_5\), and \(K_6\),
Since the variance function has a polynomial form of mean, then
Therefore, the function \(f _1(\eta _{ki})\) is bounded by
Therefore, as \(\Vert \theta - \theta ^*\Vert _1 \le r\) and \(\Vert x_{ki}\Vert _\infty \le L\), we can set \({\tilde{r}} = \min \{r, K^\prime K_3^2 \}\) with constant \(K^\prime = 1/(L K_2( K_3 + K_2^2 K_6/K_4 ))\), and we can show that \(0< f _1(\eta _{ki}) < \infty\). In addition, we can also set
which is a bounded function based on Assumption 2.3. \(\square\)
Lemma C.3
Under Assumptions 2.3–2.5, for some positive constants \(\alpha\) and \(\varepsilon\),
Proof
With the same notation as in the proof of Lemma 2.3, we can show that \(\Delta H_{\mathcal{S}\mathcal{S}}^* = \text {diag}( \Delta _k H_{\mathcal{S}\mathcal{S}}^*)_{k=1}^K\). For any \(\varepsilon > 0\),
In the step (i), we apply the result in Lemma C.6. In the step (ii), we apply the concentration result of the Hessian matrix based on Lemma 2.1. Using the same method, we derive that
\(\square\)
Lemma C.4
Under Assumptions 2.3 - 2.6, there exist some positive constants \(\alpha\) and \(\varepsilon\) with \(\varepsilon < \kappa _{-}\),
Proof
With the same notation as in the proofs of Lemmas 2.3 and C.3, we have the sub-matrix of Hessian denoted by \(_kH^*_\mathcal{S}\mathcal{S}\). Lemma 2.2 shows that with high probability, the eigenvalues of \(H(\theta ^*)\) are bounded and positive. Therefore, for any sub-matrix of Hessian, we have
Based on Courant–Fischer variational representation (Ravikumar et al., 2010), we have
where y is the unit-norm eigenvector of \({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}\). Using condition C1, we can show that
Next, we have
As a result, we can show that with a probability at least \(1 - 2 K \exp \left\{ - \frac{\alpha \varepsilon ^2}{({\mathcal {M}}_*s)^2}n + 2 \log (s) \right\}\),
Furthermore, for \(\varepsilon < \kappa _{-}\) in C2, we set the constant \(\delta = \kappa _{-} - \varepsilon > 0\), such that
\(\square\)
Lemma C.5
Consider vectors u and \(v \in R^{Kp_n}\) double-indexed as \(u = (u_{11}, \cdots , u_{kp}, \cdots , u_{Kp_n})\) and \(v = (v_{11}, \cdots , v_{kp}, \cdots , v_{Kp_n})\) for \(k = 1,2, \cdots , K\) and \(p = 1,2, \cdots , p_n\). Then
Proof
We apply Hölder’s inequality to show that
\(\square\)
Lemma C.6
Suppose a matrix \(A \in {\mathbb {R}}^{Kd\times Kd}\) consists of diagonal blocks such that \(A = \text {diag}(A_k)_{k=1}^K\), and each block matrix has the same dimension that \(A_k \in {\mathbb {R}}^{d\times d}\). Then,
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhong, Y., Xu, W. & Gao, X. Heterogeneous multi-task feature learning with mixed \(\ell _{2,1}\) regularization. Mach Learn 113, 891–932 (2024). https://doi.org/10.1007/s10994-023-06410-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06410-0