Heterogeneous multi-task feature learning with mixed $$\ell _{2,1}$$ regularization

Zhong, Yuan; Xu, Wei; Gao, Xin

doi:10.1007/s10994-023-06410-0

Heterogeneous multi-task feature learning with mixed $\ell _{2,1}$ regularization

Published: 18 December 2023

Volume 113, pages 891–932, (2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

355 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Data integration is the process of extracting information from multiple sources and jointly analyzing different data sets. In this paper, we propose to use the mixed $\ell _{2,1}$ regularized composite quasi-likelihood function to perform multi-task feature learning with different types of responses, including continuous and discrete responses. For high dimensional settings, our result establishes the sign recovery consistency and estimation error bounds of the penalized estimates under regularity conditions. Simulation studies and real data analysis examples are provided to illustrate the utility of the proposed method to combine correlated platforms with heterogeneous tasks and perform joint sparse estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online Weighted Multi-task Feature Selection

Low-Rank and Sparse Multi-task Learning

Robust Feature Selection with Feature Correlation via Sparse Multi-Label Learning

Article 01 January 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

All data sets are available online, which are stated in Sect. 5.

Code availability

The code is provided in the R package “HMTL” at https://CRAN.R-project.org/package=HMTL.

References

Agarwal, A., Negahban, S., & Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2), 1171–1197.
Article MathSciNet Google Scholar
Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(61), 1817–1853.
MathSciNet Google Scholar
Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multi-task feature learning. In: Proceedings of the 19th international conference on neural information processing systems. MIT Press, Cambridge, MA, USA, NIPS’06, pp. 41–48.
Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(40), 1179–1225.
MathSciNet Google Scholar
Bai, H., Zhong, Y., Gao, X., et al. (2020). Multivariate mixed response model with pairwise composite-likelihood method. Stats, 3(3), 203–220.
Article Google Scholar
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.
Article MathSciNet Google Scholar
Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and dantzig selector. The Annals of Statistics, 37(4), 1705–1732.
Article MathSciNet Google Scholar
Cadenas, C., van de Sandt, L., Edlund, K., et al. (2014). Loss of circadian clock gene expression is associated with tumor progression in breast cancer. Cell Cycle, 13(20), 3282–3291. PMID: 25485508.
Article Google Scholar
Cao, H., & Schwarz, E. (2022). RMTL: Regularized multi-task learning. https://CRAN.R-project.org/package=RMTL, r package version 0.9.9.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Article MathSciNet Google Scholar
Cox, D. R., & Reid, N. (2004). A note on pseudolikelihood constructed from marginal densities. Biometrika, 91(3), 729–737.
Article MathSciNet Google Scholar
U.S. Department of Health and Human Services. (2010). Community Health Status Indicators (CHSI) to Combat Obesity, Heart Disease and Cancer 1996–2003 (p. 2010). Washington, D.C., USA: US Department of Health and Human Services.
Ekvall, K.O., & Molstad, A.J. (2021). mmrr: Mixed-type multivariate response regression. R package version 0.1.
Ekvall, K. O., & Molstad, A. J. (2022). Mixed-type multivariate response regression with covariance estimation. Statistics in Medicine,41(15), 2768–2785. https://doi.org/10.1002/sim.9383, onlinelibrary.wiley.com/doi/abs/10.1002/sim.9383.
Eldar, Y. C., Kuppinger, P., & Bolcskei, H. (2010). Block-sparse signals: Uncertainty relations and efficient recovery. IEEE Transactions on Signal Processing, 58(6), 3042–3054.
Article MathSciNet Google Scholar
Fang, E. X., Ning, Y., & Li, R. (2020). Test of significance for high-dimensional longitudinal data. The Annals of Statistics, 48(5), 2622–2645.
Article MathSciNet Google Scholar
Fan, J., Liu, H., Sun, Q., et al. (2018). I-lamm for sparse learning: Simultaneous control of algorithmic complexity and statistical error. The Annals of Statistics, 46(2), 814–841.
Article MathSciNet Google Scholar
Fan, J., Wang, W., & Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of statistics, 49(3), 1239–1266. https://doi.org/10.1214/20-aos1980
Article MathSciNet Google Scholar
Gao, X., Zhong, Y., & Carroll, R. J. (2022). FusionLearn: Fusion Learning. https://CRAN.R-project.org/package=FusionLearn, r package version 0.2.1.
Gao, X., & Carroll, R. J. (2017). Data integration with high dimensionality. Biometrika, 104(2), 251–272.
Article MathSciNet Google Scholar
Gao, X., & Song, P. X. K. (2010). Composite likelihood Bayesian information criteria for model selection in high-dimensional data. Journal of the American Statistical Association, 105(492), 1531–1540.
Article MathSciNet Google Scholar
Gao, X., & Zhong, Y. (2019). Fusionlearn: a biomarker selection algorithm on cross-platform data. Bioinformatics, 35(21), 4465–4468.
Article Google Scholar
Gaughan, L., Stockley, J., Coffey, K., et al. (2013). KDM4B is a master regulator of the estrogen receptor signalling cascade. Nucleic Acids Research, 41(14), 6892–6904. https://doi.org/10.1093/nar/gkt469
Article Google Scholar
Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. The Annals of Mathematical Statistics, 31(4), 1208–1211.
Article MathSciNet Google Scholar
Gomez-Cabrero, D., Abugessaisa, I., Maier, D., et al. (2014). Data integration in the era of omics: Current and future challenges. BMC Systems Biology, 8(2), I1.
Article Google Scholar
Gong, P., Ye, J., & Zhang, C. (2013). Multi-stage multi-task feature learning. Journal of Machine Learning Research, 14(55), 2979–3010.
MathSciNet Google Scholar
Hatzis, C., Pusztai, L., Valero, V., et al. (2011). A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA, 305(18), 1873–1881.
Article Google Scholar
Hebiri, M., & van de Geer, S. (2011). The Smooth-Lasso and other $\ell _1+\ell _2$-penalized methods. Electronic Journal of Statistics, 5(none), 1184–1226.
Article MathSciNet Google Scholar
Heimes, A. S., Härtner, F., Almstedt, K., et al. (2020). Prognostic significance of interferon-$\gamma$ and its signaling pathway in early breast cancer depends on the molecular subtypes. International Journal of Molecular Sciences,21(19).
Hellwig, B., Hengstler, J. G., Schmidt, M., et al. (2010). Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes. BMC Bioinformatics, 11(1), 276.
Article Google Scholar
Itoh, M., Iwamoto, T., Matsuoka, J., et al. (2014). Estrogen receptor (er) mrna expression and molecular subtype distribution in er-negative/progesterone receptor-positive breast cancers. Breast Cancer Research and Treatment, 143(2), 403–409.
Article Google Scholar
Ivshina, A. V., George, J., Senko, O., et al. (2006). Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Research, 66(21), 10292–10301.
Article Google Scholar
Jalali, A., Sanghavi, S., Ruan, C., et al. (2010). A dirty model for multi-task learning. In J. Lafferty, C. Williams, J. Shawe-Taylor, et al. (Eds.), Advances in neural information processing systems. (Vol. 23). Curran Associates Inc.
Google Scholar
Kanomata, N., Kurebayashi, J., Koike, Y., et al. (2019). Cd1d-and pja2-related immune microenvironment differs between invasive breast carcinomas with and without a micropapillary feature. BMC Cancer, 19(1), 1–9.
Article Google Scholar
Karn, T., Rody, A., Müller, V., et al. (2014). Control of dataset bias in combined affymetrix cohorts of triple negative breast cancer. Genomics Data, 2, 354–356.
Article Google Scholar
Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics, 80, 220–239.
MathSciNet Google Scholar
Liu, J,. Ji, S., & Ye, J. (2009). Multi-task feature learning via efficient $l_{2,1}$-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, Arlington, Virginia, USA, UAI ’09, p 339-348.
Liu, C. L., Cheng, S. P., Huang, W. C., et al. (2023). Aberrant expression of solute carrier family 35 member a2 correlates with tumor progression in breast cancer. In Vivo, 37(1), 262–269.
Article Google Scholar
Liu, Q., Xu, Q., Zheng, V. W., et al. (2010). Multi-task learning for cross-platform sirna efficacy prediction: An in-silico study. BMC Bioinformatics, 11(1), 1–16.
Article Google Scholar
Li, Y., Xu, W., & Gao, X. (2021). Graphical-model based high dimensional generalized linear models. Electronic Journal of Statistics, 15(1), 1993–2028.
Article MathSciNet Google Scholar
Loh, P. L., & Wainwright, M. J. (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19), 559–616.
MathSciNet Google Scholar
Loh, P. L., & Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6), 2455–2482.
Article MathSciNet Google Scholar
Lounici, K., Pontil, M., van de Geer, S., et al. (2011). Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4), 2164–2204.
Article MathSciNet Google Scholar
McCullagh, P., & Nelder, J. (1989). Generalized Linear Models, Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series (2nd ed.). London: Chapman & Hall.
Google Scholar
Meinshausen, N., & Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270.
Article MathSciNet Google Scholar
Negahban, S. N., Ravikumar, P., Wainwright, M. J., et al. (2012). A unified framework for high-dimensional analysis of $m$-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Article MathSciNet Google Scholar
Negahban, S. N., & Wainwright, M. J. (2011). Simultaneous support recovery in high dimensions: Benefits and perils of block $\ell _{1}/\ell _{\infty }$-regularization. IEEE Transactions on Information Theory, 57(6), 3841–3863.
Article MathSciNet Google Scholar
Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161.
Article MathSciNet Google Scholar
Ning, Y., & Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1), 158–195.
Article MathSciNet Google Scholar
Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231–252.
Article MathSciNet Google Scholar
Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1), 1–47.
Article MathSciNet Google Scholar
Ouyang, Y., Lu, W., Wang, Y., et al. (2023). Integrated analysis of mrna and extrachromosomal circular dna profiles to identify the potential mrna biomarkers in breast cancer. Gene, 857, 147174. https://doi.org/10.1016/j.gene.2023.147174
Article Google Scholar
Poon, W. Y., & Lee, S. Y. (1987). Maximum likelihood estimation of multivariate polyserial and polychoric correlation coefficients. Psychometrika, 52(3), 409–430.
Article MathSciNet Google Scholar
Rakotomamonjy, A., Flamary, R., Gasso, G., et al. (2011). $\ell _{p}-\ell _{q}$ penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Transactions on Neural Networks, 22(8), 1307–1320.
Article Google Scholar
Ravikumar, P., Wainwright, M. J., & Lafferty, J. D. (2010). High-dimensional Ising model selection using $\ell _1$-regularized logistic regression. The Annals of Statistics, 38(3), 1287–1319. https://doi.org/10.1214/09-AOS691
Article MathSciNet Google Scholar
Rody, A., Karn, T., Liedtke, C., et al. (2011). A clinically relevant gene signature in triple negative and basal-like breast cancer. Breast Cancer Research, 13(5), R97.
Article Google Scholar
Schmidt, M., Böhm, D., von Törne, C., et al. (2008). The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Research, 68(13), 5405–5413.
Article Google Scholar
Sethuraman, A., Brown, M., Krutilina, R., et al. (2018). Bhlhe40 confers a pro-survival and pro-metastatic phenotype to breast cancer cells by modulating hbegf secretion. Breast Cancer Research, 20, 1–17.
Article Google Scholar
Škalamera, D., Dahmer-Heath, M., Stevenson, A. J., et al. (2016). Genome-wide gain-of-function screen for genes that induce epithelial-to-mesenchymal transition in breast cancer. Oncotarget, 7(38), 61000–61020. https://doi.org/10.18632/oncotarget.11314
Article Google Scholar
Sun, Q., Zhou, W. X., & Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association, 115(529), 254–265.
Article MathSciNet Google Scholar
Tang, H., Sebti, S., Titone, R., et al. (2015). Decreased becn1 mrna expression in human breast cancer is associated with estrogen receptor-negative subtypes and poor prognosis. EBioMedicine, 2(3), 255–263.
Article Google Scholar
Thung, K. H., & Wee, C. Y. (2018). A brief review on multi-task learning. Multimedia Tools and Applications, 77(22), 29705–29725.
Article Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
MathSciNet Google Scholar
van de Geer, S. A., & Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3, 1360–1392.
MathSciNet Google Scholar
van de Geer, S., Bühlmann, P., Ritov, Y., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202.
MathSciNet Google Scholar
van de Geer, S., & Müller, P. (2012). Quasi-likelihood and/or robust estimation in high dimensions. Statistical Sciences, 27(4), 469–480.
MathSciNet Google Scholar
Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical Analysis, 92(1), 1.
Article MathSciNet Google Scholar
Wainwright, M.J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.
Wang, W., Liang, Y., & Xing, E. P. (2015). Collective support recovery for multi-design multi-response linear regression. IEEE Transactions on Information Theory, 61(1), 513–534.
Article MathSciNet Google Scholar
Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biometrika, 61(3), 439–447.
MathSciNet Google Scholar
Wigington, C. P., Morris, K. J., Newman, L. E., et al. (2016). The polyadenosine rna-binding protein, zinc finger cys3his protein 14 (zc3h14), regulates the pre-mrna processing of a key atp synthase subunit mrna*. Journal of Biological Chemistry, 291(43), 22442–22459. https://doi.org/10.1074/jbc.M116.754069
Article Google Scholar
Wu, S., Gao, X., & Carroll, R.J. (2023). Model selection of generalized estimating equation with divergent model size. Statistica Sinica, pp. 1–22. https://doi.org/10.5705/ss.202020.0197
Yi, G. Y. (2014). Composite likelihood/pseudolikelihood (pp. 1–14). Wiley StatsRef: Statistics Reference Online.
Yi, G. Y. (2017). Statistical analysis with measurement error or misclassification: strategy, method and application. Berlin: Springer.
Book Google Scholar
Yousefi, N., Lei, Y., Kloft, M., et al. (2018). Local rademacher complexity-based learning guarantees for multi-task learning. Journal of Machine Learning Research, 19(38), 1–47.
MathSciNet Google Scholar
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1), 49–67.
Article MathSciNet Google Scholar
Zhan, X.J., Wang, R., & Kuang, X.R., et al. (2023). Elevated expression of myosin vi contributes to breast cancer progression via mapk/erk signaling pathway. Cellular Signalling, p. 110633.
Zhang, K., Gray, J. W., & Parvin, B. (2010). Sparse multitask regression for identifying common mechanism of response to therapeutic targets. Bioinformatics, 26(12), i97–i105.
Article Google Scholar
Zhang, H., Liu, D., Zhao, J., et al. (2018). Modeling hybrid traits for comorbidity and genetic studies of alcohol and nicotine co-dependence. The Annals of Applied Statistics, 12(4), 2359–2378. https://doi.org/10.1214/18-AOAS1156
Article MathSciNet Google Scholar
Zhang, J. Z., Xu, W., & Hu, P. (2022). Tightly integrated multiomics-based deep tensor survival model for time-to-event prediction. Bioinformatics, 38(12), 3259–3266.
Article Google Scholar
Zhang Y, Yang Q (2017) A survey on multi-task learning. CoRR abs/1707.08114. arxiv:1707.08114
Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541–2563.
MathSciNet Google Scholar
Zhong, Y., Xu, W., & Gao, X. (2023). HMTL: Heterogeneous Multi-Task Feature Learning. R package version 0.1.0.
Zhou, J., Yuan, L., & Liu, J., et al. (2011). A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, KDD ’11, p 814-822

Download references

Acknowledgements

X.G.’s research was supported by the Natural Sciences and Engineering Research Council of Canada funding. W.X. was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672). The authors would like to thank the anonymous reviewers for their valuable suggestions and constructive criticisms.

Funding

X.G.’s research was supported by the Natural Sciences and Engineering Research Council of Canada funding. W.X. was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672).

Author information

Authors and Affiliations

Department of Mathematics and Statistics, York University, 4700 Keele St, Toronto, ON, M3J 1P3, Canada
Yuan Zhong & Xin Gao
Department of Biostatistics, Dalla Lana School of Public Health, University of Toronto, 27 King’s College Cir, Toronto, ON, M5S 1A1, Canada
Yuan Zhong & Wei Xu

Authors

Yuan Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Wei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Gao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the design of the research problems and wrote the manuscript. Material preparation and original draft were completed by YZ Theoretical analysis and methodology development were conducted by XG and YZ Data collection and analysis were performed by WX, XG and YZ.

Corresponding author

Correspondence to Xin Gao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

We declare that all the authors have agreed on the submission of this paper to the Machine Learning journal.

Additional information

Editor: Jean-Philippe Vert.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem in Sect. 2

This section provides the proofs of Theorem 1, and Theorem 2. In the following derivations, we assume that all tasks have identical sample sizes n.

1.1 A.1: Proof of Theorem 1

Proof

Lemma 2.4 shows that for the solution of the estimating equation (2.4) ${\hat{\theta }} \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)$, $({\hat{\theta }} - \theta ^*) \in {\mathcal {C}}(m, \gamma )$ with $m = c_0Ks$ and $\gamma = 2\sqrt{K}+1$. Using the results from Lemma 2.2, we can obtain the following inequality with probability tending to 1,

$$\begin{aligned} \frac{1}{n} \nabla Q({\hat{\theta }})^T({\hat{\theta }} - \theta ^*) - \left( \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*) + \lambda _n {\hat{z}}\right)^T ({\hat{\theta }} - \theta ^*)&\ge {\kappa _{-}} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2. \end{aligned}$$

(A1)

We apply Hölder’s inequality in Lemma C.5 to two components in (A1):

$$\begin{aligned} - \nabla {\mathcal {L}}(\theta ^*)({\hat{\theta }} - \theta ^*)&\le \Vert \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}} \Vert _{2,1} + \Vert \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}^c} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}^c} \Vert _{2,1}; \\ - {\hat{z}}^T ({\hat{\theta }} - \theta ^*)&\le \Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} - \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1}. \end{aligned}$$

Plugging back into (A1), we have

$$\begin{aligned} \kappa _{-} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2 & \le \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}} \Vert _{2,1} + \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}^c} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}^c} \Vert _{2,1} \\ & \quad + \lambda _n\Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty }\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} - \lambda _n \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1}. \end{aligned}$$

According to Lemma 2.1, $\Vert n^{-1}\nabla {\mathcal {L}}(\theta ^*) \Vert _{2,\infty } \le \lambda _n/2$ with a probability tending to 1. Therefore, the component $(\Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}^c} \Vert _{2,\infty } - \lambda _n/ )\Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}^c} \Vert _1 \le 0.$ We simplify the inequality above as follows,

$$\begin{aligned} \kappa _{-} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2&\le \left( \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}} \Vert _{2,\infty } + \lambda _n \Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty }\right) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} . \end{aligned}$$

According to the property of mixed $\ell _{2,\infty }$ norm, $\Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty } = \sup _p\Vert {\hat{z}}^{(p)}\Vert _{2} = 1$. In addition, $\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le \sqrt{\vert {{\mathcal {E}}} \vert }\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2}$ with $\sqrt{\vert {{\mathcal {E}}} \vert } = c_1 s$ for some positive constant $c_1$. The following inequality can be obtained

$$\begin{aligned} \kappa _{-} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2&\le \left( \frac{\lambda _n}{2} + \lambda _n \right) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} ,\\&\le \frac{3\lambda _n \sqrt{c_1s}}{2} \Vert ({\hat{\theta }} - \theta ^*) \Vert _2. \end{aligned}$$

Therefore, taking the constant $c_1 = 1$, we have

$$\begin{aligned} \Vert {\hat{\theta }} - \theta ^* \Vert _2&\le \frac{3\lambda _n \sqrt{s}}{2 \kappa _{-}} \end{aligned}$$

In addition, we derive the following error bounds based on Lemma 2.4:

$$\begin{aligned} \Vert {\hat{\theta }} - \theta ^* \Vert _1 \le 4\sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le&4 \sqrt{sK} \Vert ({\hat{\theta }} - \theta ^*)\Vert _2 \le \frac{6\sqrt{K}}{\kappa _{-}} \lambda _n s; \\ \frac{1}{n}(\nabla {\mathcal {L}}({\hat{\theta }}) -\nabla {\mathcal {L}}({\theta }^*) )^T ({\hat{\theta }} - \theta ^* ) \le&\frac{1}{n}\Vert \nabla {\mathcal {L}}({\hat{\theta }}) - \nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty } \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \\ \le&\left( \Vert \frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }}\right) + \lambda _n {\hat{z}}\Vert _{2,\infty } + \Vert \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty }\\&+ \Vert \lambda _n {\hat{z}}\Vert _{2,\infty } ) \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \\ =&\left( \Vert \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*\right) \Vert _{2,\infty } + \lambda _n) \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \le \frac{9\lambda _n^2 s}{4 \kappa _{-}}. \end{aligned}$$

$\square$

1.2 A.2: Proof of Theorem 2

Proof

The derivative equation (2.4) can be partitioned into two sets of equations based on the two sub-spaces of parameters ${\mathcal {S}}$ and ${\mathcal {S}}^c$:

$$\begin{aligned} -\frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }})_{{\mathcal {S}}}&= \lambda _n {\hat{z}}_{{\mathcal {S}}} , \end{aligned}$$

(A2a)

$$\begin{aligned} -\frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }})_{{\mathcal {S}}^c}&= \lambda _n {\hat{z}}_{{\mathcal {S}}^c} . \end{aligned}$$

(A2b)

Based on the definition of sub-differential, the sub-differential ${\hat{z}}_{{\mathcal {S}}}$ contains grouped subsets ${\hat{z}}^{(p)} = {\hat{\theta }}^{(p)}/\Vert {\hat{\theta }}^{(p)}\Vert _2$ with $p \in {\mathcal {S}}$, and $\max _{p \in {\mathcal {S}}^c } \Vert {\hat{z}}^{(p)}\Vert _2 < 1$.

According to Lemma 2.2, ${\hat{\theta }}$ is the optima of the objective function with high probability. Consider an estimator with ${\hat{\theta }}_{{\mathcal {S}},0} = ( {\hat{\theta }}_{{\mathcal {S}}}, {{\textbf {0}}}),$ where

$$\begin{aligned} {\hat{\theta }}_{{\mathcal {S}},0} = \underset{\theta = ( {\theta }_{{\mathcal {S}}}, {{\textbf {0}}})}{\arg \min } \{ {\mathcal {L}}(\theta ) + n \lambda _n \Vert \theta \Vert _{2,1}\} . \end{aligned}$$

(A3)

If the estimator ${\hat{\theta }}_{{\mathcal {S}},0}$ satisfies the conditions (A2a) and (A2b), then with high probability, ${\hat{\theta }}_{{\mathcal {S}},0}$ is the local optimal solution ${\hat{\theta }}$ to Equation (2.4).

We expand the score function Using the mean value theorem as follows

$$\begin{aligned} \frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }}_{{\mathcal {S}},0})&= \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) +\frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }}_{{\mathcal {S}},0}) - \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) = \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) + \frac{1}{n}\nabla ^2 {\mathcal {L}}({\tilde{\theta }}) {\hat{\Delta }} \\&= \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) + \frac{1}{n}\nabla ^2 {\mathcal {L}}({\theta }^*) {\hat{\Delta }} + \underbrace{\left( \frac{1}{n}\nabla ^2 {\mathcal {L}}({\tilde{\theta }}\right) - \frac{1}{n}\nabla ^2 {\mathcal {L}}({\theta }^*)){\hat{\Delta }}}_{{\mathcal {R}}}, \end{aligned}$$

where ${\hat{\Delta }} = ({\hat{\theta }}_{{\mathcal {S}},0} - \theta ^*)$, ${\tilde{\theta }} = \alpha \theta ^* + (1-\alpha ){\hat{\theta }}_{{\mathcal {S}}}$ for some $\alpha \in [0,1]$.

Thus, we write the equations (A2a) and (A2b) in block format with solution ${\hat{\theta }}_{{\mathcal {S}},0}$

$$\begin{aligned} \frac{1}{n} \begin{bmatrix} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}&{} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}^c} \\ \nabla ^2 {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}^c{\mathcal {S}}} &{} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}^c{\mathcal {S}}^c} \end{bmatrix} \begin{pmatrix} {\hat{\Delta }}_{{\mathcal {S}}} \\ {{\textbf {0}}} \end{pmatrix} + \frac{1}{n} \begin{pmatrix} \nabla {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}} \\ \nabla {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}^c} \end{pmatrix} + \begin{pmatrix} {\mathcal {R}}_{{\mathcal {S}}} +\lambda _n {\hat{z}}_{{\mathcal {S}}} \\ {\mathcal {R}}_{{\mathcal {S}}^c} +\lambda _n {\hat{z}}_{{\mathcal {S}}^c} \end{pmatrix} = {{\textbf {0}}}. \end{aligned}$$

According to Lemma C.4, the sub-matrix $n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}$ is invertible with high probability. Thus, we obtain the difference block $\Delta _{{\mathcal {S}}}$ by solving

$$\begin{aligned} \Delta _{{\mathcal {S}}} = {\hat{\theta }}_{{\mathcal {S}}} - \theta ^*_{{\mathcal {S}}} = - \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \left( \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} + \lambda _n {\hat{z}}_{{\mathcal {S}}} + {\mathcal {R}}_{{\mathcal {S}}}\right) . \end{aligned}$$

Next, we show that the elements of the remainder vector ${\mathcal {R}}$ can be expanded as follow

$$\begin{aligned} {\mathcal {R}}_{kp}&= \big (\frac{\partial ^2 \ell _{k}({\tilde{\theta }};Y_k)}{ \partial \theta \partial \theta ^T \partial \theta _{kp} } - \frac{\partial ^2 \ell _{k}({\theta }^*;Y_k)}{\partial \theta \partial \theta ^T \partial \theta _{kp}} \big ) {\hat{\Delta }} = {\tilde{\Delta }}^T \big (\frac{\partial ^3 \ell _{k}( {\theta }^*;Y_k)}{\partial \theta \partial \theta ^T \partial \theta _{kp} }\big ) {\hat{\Delta }}, \end{aligned}$$

with ${\tilde{\Delta }} = ({\tilde{\theta }} - \theta ^*) = (1-\alpha ) {\hat{\Delta }}$. Let $\nabla _{kp} {\mathcal {H}}^* = {\partial ^3 \ell _{k}({\theta }^*;Y_k)}/{\partial \theta \partial \theta ^T \partial \theta _{kp} }$, where $\nabla _{kp} {\mathcal {H}}^*$ is a $Kp_n \times Kp_n$ matrix. With similar derivation as in the proof of Lemma C.2, all elements of $\nabla _{kp} {\mathcal {H}}^*$ are from sub-exponential distributions. Thus, we show that for any $k = 1, 2, \cdots , K$ and $p = 1, 2, \cdots , p_n$,

$$\begin{aligned} {\mathcal {R}}_{kp}&= (1 - \alpha ) {\hat{\Delta }}^T_{{\mathcal {S}}} \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} {\hat{\Delta }}_{{\mathcal {S}}} \le (1 - \alpha ) \Vert {\hat{\Delta }}_{{\mathcal {S}}}\Vert _2^2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \\&\overset{(i)}{\le }\ {\mathcal {W}}^* \Vert {\hat{\Delta }}_{{\mathcal {S}}}\Vert _2^2 \overset{(ii)}{\le }\ \frac{9 ({\mathcal {W}}^*+\delta )}{4\kappa _{-}^2} \lambda _n^2 s. \end{aligned}$$

The step (i) is obtained based on the sub-exponential condition for the elements of $\nabla _{kp} {\mathcal {H}}^*$. For some small $\delta$ and a universal constant C,

$$\begin{aligned} P( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \ge {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| E(\nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ) \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 + \delta ) \le 2K\exp \left\{ - C\frac{\delta ^2 n}{Ks^2} + 2\log (s)\right\} . \end{aligned}$$

According to Assumption 2.2, ${\mathcal {W}}^* \ge {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| E(\nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ) \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2$. Thus, ${\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2\le (W^*+\delta )$ with high probability. According to Theorem 1, $\Vert {\hat{\Delta }}\Vert _2^2 \le 9\lambda _n^2\,s/(2\kappa _{-})^{2}$. This leads to the result in the step (ii).

Combining the results above, we show that with a probability larger than $1 - 2 p_n^{-d} - 4K\exp \{ - C^\prime s^{-2}n + \log (p_n) \}$,

$$\begin{aligned} \Vert \Delta _{{\mathcal {S}}} \Vert _\infty&= \Vert \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \left( \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} + \lambda _n {\hat{z}}_{{\mathcal {S}}} + {\mathcal {R}}_{{\mathcal {S}}}\right) \Vert _\infty \\&\le \frac{ \sqrt{s}}{\kappa _{-}} \big (\Vert \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}}\Vert _{2,\infty } + \lambda _n \Vert {\hat{z}}_{{\mathcal {S}}}\Vert _\infty + \Vert {\mathcal {R}}_{{\mathcal {S}}}\Vert _\infty \big ) \le \frac{3\sqrt{s}}{2\kappa _{-}} \lambda _n \le \min _{k;p\in {\mathcal {S}}} \vert \theta ^*_{kp}\vert , \end{aligned}$$

for some constant $C^\prime > 0.$ This implies $\text {sign}({\hat{\theta }}_S)=\text {sign}(\theta ^*_S).$

Next, we show that $\max _{p \subset {\mathcal {S}}^c } \Vert {\hat{z}}^{(p)}\Vert _2 < 1,$ which satisfies the KKT conditions. The sub-differential ${\hat{z}}_{{\mathcal {S}}^c}$ can be calculated from the block equation above,

$$\begin{aligned} \begin{aligned} {\hat{z}}_{{\mathcal {S}}^c} = - \frac{1}{\lambda _n} ( \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c} +&{\mathcal {R}}_{{\mathcal {S}}^c} - {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \\&( \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} + \lambda _n {\hat{z}}_{{\mathcal {S}}} + {\mathcal {R}}_{{\mathcal {S}}} ) ). \end{aligned} \end{aligned}$$

(A4)

The sub-differential $z_{{\mathcal {S}}^c}$ from (A4) can be decomposed into three components

$$\begin{aligned} {\hat{z}}_{{\mathcal {S}}^c} = \frac{1}{\lambda _n}(&\underbrace{ {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} - \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c} }_{{\mathcal {I}}_1} \nonumber \\&+ \underbrace{ {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} {\mathcal {R}}_{{\mathcal {S}}} - {\mathcal {R}}_{{\mathcal {S}}^c} }_{{\mathcal {I}}_2} \nonumber \\&+ \underbrace{ \lambda _n {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} {\hat{z}}_{{\mathcal {S}}} }_{{\mathcal {I}}_3} ). \end{aligned}$$

(A5)

The sub-differential can be grouped as ${\hat{z}}^{{(p)}}$ with $p \subset {\mathcal {S}}^c$.

Based on Lemma 2.1 and 2.3, the following upper bound can be obtained with a probability at least $1 - 2\exp \{- d\log (p_n)\} -4 \exp \{- C_0 s^{-3} \xi ^2 n + 2 \log (K p_n) \}$ for some constants $d > 1$ and $C_0 >0$,

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \big \Vert {\mathcal {I}}_1^{(p)} \Vert _2 \le&\max _{p \subset {\mathcal {S}}^c}\Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \big \Vert _2 \\&+ \max _{p \subset {\mathcal {S}}^c} \big \Vert \big \{\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)_{{\mathcal {S}}}\big \}^{(p)} \big \Vert _2 \\ \le&\max _{p \subset {\mathcal {S}}^c}\Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \big \Vert _2 \\&+ \sqrt{K} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \big \{\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)_{{\mathcal {S}}}\big \} \big \Vert _\infty \\ \le&\frac{\xi }{4}\lambda _n + \frac{\xi }{4}(1-\frac{\xi }{2})\lambda _n < \frac{\xi }{2}\lambda _n. \end{aligned}$$

For the remainder component, we have

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \Vert {\mathcal {I}}_2^{(p)} \Vert _2 \le&\sqrt{K} \Vert {\mathcal {I}}_2 \Vert _\infty \nonumber \\ \le&\sqrt{K} \left( \Vert {\mathcal {R}}_{{\mathcal {S}}^c}\Vert _\infty + \Vert {\mathcal {R}}_{{\mathcal {S}}}\Vert _\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \right) \nonumber \\ \le&\dfrac{9{\mathcal {W}}^*}{4\kappa _{-}^2} \lambda _n^2 s \sqrt{K} = {\mathcal {O}}\left( \frac{s\log (p_n)}{n}\right) \rightarrow o(1). \end{aligned}$$

(A6)

Similarly, we show that the mixed norm of ${\mathcal {I}}_3$ can be bounded,

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \Vert {\mathcal {I}}_3^{(p)} \Vert _2 =&\max _{p \subset {\mathcal {S}}^c} \lambda _n \big \Vert \big \{ \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} {\hat{z}}_{\mathcal {S}} \big \}^{(p)} \big \Vert _2 \le \lambda _n \left(1-\frac{1}{2}\xi \right). \end{aligned}$$

By adding the three components, we show that

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \Vert {\hat{z}}^{(p)} \Vert _2 \le&\max _{p \subset {\mathcal {S}}^c} \frac{1}{\lambda _n} (\Vert {\mathcal {I}}_1^{(p)} \Vert _2 + \Vert {\mathcal {I}}_2^{(p)} \Vert _2 + \Vert {\mathcal {I}}_3^{(p)} \Vert _2)< 1 -\frac{\xi }{2} + \frac{\xi }{2} < 1. \end{aligned}$$

Combining the results above, we have sign(${\hat{\theta }}$) $=$ sign($\theta ^*$) with a probability tending to 1. $\square$

Appendix B: Proofs of Lemma in Sect. 2

This section provides the proofs of Lemmes 2.1, 2.3, and 2.4.

1.1 B.1: Proof of Lemma 2.1

Proof

First, we need to analyze the distributional property of the random variable $\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2$. Lemma C.1 shows that

$$\begin{aligned} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} \le {\mathcal {M}}_* . \end{aligned}$$

with some constant ${\mathcal {M}}_*$, and

$$\begin{aligned} \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2&= \left( \sum _{k=1}^K \left( \frac{1}{n} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2} = \left( \frac{1}{n} \sum _{k=1}^K \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2} . \end{aligned}$$

This result can be used to bound the sub-exponential norm of $\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2$ by apply Minkowski’s Inequality,

$$\begin{aligned} \Vert \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\Vert _{\psi _1}&= \Vert \left( \frac{1}{n} \sum _{k=1}^K \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2}\Vert _{\psi _1} \\&= \sup _{m\ge 1} \frac{1}{m}\left( E\left( \left| \left( \frac{1}{n} \sum _{k=1}^K \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2} \right| ^m \right) \right) ^{1/m} \\&\le \frac{K}{\sqrt{n}} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} \le {\frac{K}{\sqrt{n}}} {\mathcal {M}}_* < \infty . \end{aligned}$$

Furthermore, we can show that

$$\begin{aligned} E( \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 )&\le \left\{ \frac{1}{n} \sum _{k=1}^K E \left[ \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right] \right\} ^{1/2} \le {\mathcal {M}}_* \sqrt{\frac{K}{n}},\\ var(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 )&\le \frac{1}{n} \sum _{k=1}^K E \left[ \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right] \le 2{\mathcal {M}}_*^2 {\frac{K}{n}}. \end{aligned}$$

This implies that $\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2$ satisfies the sub-exponential property, such that with small $\delta$,

$$\begin{aligned} P( \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 \ge E( \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 ) + \delta ) \le 2 \exp \left\{ - \alpha \frac{\delta ^2}{2 K{\mathcal {M}}^2_*} n \right\} . \end{aligned}$$

Since $\Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*) \Vert _{2,\infty }$ is the supremum of $\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2$ over $p = 1, 2, \cdots , p_n$, we have

$$\begin{aligned} P\left( \sup _p \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 \ge E(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2) + \delta \right)&\le 2 p_n \exp \left\{ - \alpha \frac{\delta ^2 n}{2K {\mathcal {M}}^2_*} \right\} . \end{aligned}$$

By combining all the results above, with $\delta = {\mathcal {M}}_*\sqrt{2K(1+d)\log (p _n)/(\alpha n)}$ for some constant $d > 1$, we show that with a probability at least $1 - 2 p_n^{-d}$,

$$\begin{aligned} \Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*) \Vert _{2,\infty } \le {\mathcal {M}}_* \left( \sqrt{\frac{K}{n}} + \sqrt{\frac{2K(1+d)\log (p_n)}{ \alpha n}} \right) . \end{aligned}$$

In addition, we showed that the score function can hold the sub-exponential condition from Lemma C.1. Therefore, we have

$$\begin{aligned} P\left( \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*) \Vert _\infty \ge \varepsilon \right) \le Kp_n \max _{p} P\left( \frac{1}{n} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \ge \varepsilon \right) \le 2Kp_n \exp \left\{ -\frac{\varepsilon ^2}{{\mathcal {M}}_*^2}n\right\} , \end{aligned}$$

which can imply

$$\begin{aligned} \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*) \Vert _\infty \le {\mathcal {M}}_* \left( \sqrt{\frac{1}{n}} + \sqrt{ \frac{2(d+1)\log (p_n)}{(\alpha n)}} \right) , \end{aligned}$$

with a probability at least $1 - 2\exp \{-d\log (p_n)+\log (K)\}$ as claimed in (2.5).

The second part of Lemma 2.1 shows that the difference between the random Hessian and its expectation is bounded. When the tasks are modeled by the canonical link and the response variables are from the exponential family, the Hessian matrix is deterministic, and $n^{-1}\nabla {\mathcal {L}}(\theta ^*) = H(\theta ^*)$. For general cases, the entries of the random Hessian of the composite quasi-likelihood are

$$\begin{aligned} \frac{1}{n} \frac{\partial ^2}{\partial \theta _{kp}\partial \theta _{kp^{'}} } {\mathcal {L}}(\theta ^* ) =&\underbrace{\frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \frac{1}{\phi _k V(g_k^{-1} (\eta _{ki}^*))} \bigg \{\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \frac{ \partial \eta _{ki} }{\partial \theta _{kp}} \bigg \} \bigg \{ \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \frac{ \partial \eta _{ki} }{\partial \theta _{kp^{'}}}\bigg \} }_{{\mathcal {I}}_1} \\&- \frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \underbrace{\bigg ( y_{ki} - g_k^{-1}(\eta _{ki}^*) \bigg )}_{{\mathcal {I}}_2} \underbrace{\frac{\partial }{\partial \theta _{kp^{'}}} \bigg \{ \frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}^*))} \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\frac{ \partial \eta _{ki} }{\partial \theta _{kp}} \bigg \}}_{{\mathcal {I}}_3}. \end{aligned}$$

The component ${\mathcal {I}}_1$ is equal to the corresponding element in the sensitivity matrix $H(\theta ^*)$. With some special link functions, the component ${\mathcal {I}}_3$ can be equal to zero. For the models with the general quasi-likelihood settings, we can show that the component ${\mathcal {I}}_3$ can be bounded by universal constant ${\mathcal {K}}>0$ across all tasks with similar derivation as Lemma C.2. Based on Assumption 2.5, the variables $n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^* )_{[kp,kp^{'}]} - H(\theta ^*)_{[kp,kp^{'}]}$ satisfy the sub-exponential condition with mean zero and the $\psi _1$ norm bounded by ${\mathcal {K}}{\mathcal {M}} < {\mathcal {M}}_*$ for some universal constant ${\mathcal {M}}_*$. Therefore, we have the concentration result of the random Hessian matrix

$$\begin{aligned} \sup _{k, p, p^{'}} \left\{ \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*) - H(\theta ^*) \right\} _{[kp,kp^{'}]}&= O_p\left( \sqrt{\frac{\log p_n}{n}}\right) , \end{aligned}$$

for any $k = 1, 2, \cdots , K$ and $p, p^{'} = 1, 2, \cdots , p_n$.

$\square$

Corollary 3

Under Assumptions 2.3–2.7, if the penalty parameter is chosen as

$$\begin{aligned} \lambda _n \ge \frac{4{\mathcal {M}}_*}{\xi } \left( \sqrt{\frac{K}{n}} + \sqrt{\frac{2(d+1)K\log (p_n)}{\alpha n}} \right) , \end{aligned}$$

then

$$\begin{aligned} \frac{1}{\lambda _n} \Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*) \Vert _{2,\infty } \le \frac{\xi }{4} \end{aligned}$$

with a probability at least $1 - 2 \exp \{- d\log (p_n) \}$ for some constant d. The mixed $\ell _{2,\infty }$ norm is defined in (1.1).

1.2 B.2: Proof of Lemma 2.2

Proof

Based on Lemma C.2, the Hessian matrix of the composite quasi-likelihood is given by

$$\begin{aligned} \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) =&\frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \{ f _1(\eta _{ki}) - (y_{ki} - g^{-1}_k(\eta _{ki}^*)) f _2(\eta _{ki}) \} x_{ki}x_{ki}^T, \end{aligned}$$

and there exist some positive constants $\alpha _0, \alpha _1,$ and $\alpha _2$, such that $\alpha _0< f _1(\eta _{ki} ) < \alpha _1$ and $\vert f _2(\eta _{ki} )\vert < \alpha _2.$

When the parameters are partitioned into subsets of different tasks, the Hessian matrix is in the form of diagonal block matrix. We show that the minimum and maximum eigenvalues of the Hessian matrix are given by

$$\begin{aligned} \min \Lambda \left( \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) \right)&= \inf _k \bigg \{ u^T \frac{1}{n} \sum _{i=1}^n \big \{ f _1(\eta _{ki}) - ( y_{ki} - g_k^{-1}(\eta _{ki}^*) ) f _2(\eta _{ki}) \big \} x_{ki}x_{ki}^T u\bigg \} ;\\ \max \Lambda \left( \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta )\right)&= \sup _k \bigg \{ u^T \frac{1}{n} \sum _{i=1}^n \big \{ f _1(\eta _{ki}) - ( y_{ki} - g_k^{-1}(\eta _{ki}^*) ) f _2(\eta _{ki}) \big \} x_{ki}x_{ki}^T u\bigg \} . \end{aligned}$$

We have

$$\begin{aligned} u^T \bigg \{ \frac{1}{n} \sum _{i=1}^n { f _1(\eta _{ki})} x_{ki}x_{ki}^T \bigg \} u =&\frac{1}{n} \sum _{i=1}^n { f _1(\eta _{ki})} u^T x_{ki}x_{ki}^T u \ge {\alpha _0 \rho _{-}} . \end{aligned}$$

We apply Hölder’s inequality and get

$$\begin{aligned} u^T \bigg \{\frac{1}{n} \sum _{i=1}^n( y_{ki} -&g_k^{-1}(\eta _{ki}^*) ) { f _2(\eta _{ki})} x_{ki}x_{ki}^T \bigg \} u \le \max _{k,i}\{ ( x_{ki}^T u)^2 \} \frac{1}{n} \sum _{i=1}^n \vert ( y_{ki} - g_k^{-1}(\eta _{ki}^*) ) { f _2(\eta _{ki})} \vert . \end{aligned}$$

Based on Assumption 2.5, $\Vert x_{ki}\Vert _\infty \le L$ across all tasks and $\Vert u_{{\mathcal {J}}^c}\Vert _1 \le \gamma \Vert u_{\mathcal {J}}\Vert _1$ with $\vert {\mathcal {J}}\vert \le m = c_0Ks$, we have

$$\begin{aligned} x_{ki}^T u \le \Vert x_{ki} \Vert _\infty \Vert u \Vert _1 {\le } (1+\gamma ) \Vert x_{ki} \Vert _\infty \Vert u_{\mathcal {J}} \Vert _1 \le (1+\gamma ) \sqrt{\vert {\mathcal {J}} \vert } L. \end{aligned}$$

In addition, the variables $y_{ki} - g_k^{-1}(\eta _{ki}^*)$ follow sub-exponential distributions based on Assumption 2.5. We obtain that with a probability at least $1 - 2\exp \{-c\log (p_n)\}$ for some constant $c = (\alpha _2 {\mathcal {M}})^{-2} >0$,

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \vert y_{ki} - g_k^{-1}(\eta _{ki}^*) \vert { f _2(\eta _{ki})} \le \sqrt{\frac{2 \log p_n}{n}}. \end{aligned}$$

Therefore, there exists some $\kappa _{-} < \alpha _0\rho _{-}.$ If the sample size is sufficiently large

$$\begin{aligned} n \ge \bigg ( \frac{ c_0(1+\gamma )^2 L^2 K}{ \alpha _0 \rho _{-} - \kappa _{-} }\bigg )^2 2s^2 \log p_n , \end{aligned}$$

then we obtain the lower bound for the minimum eigenvalue of Hessian matrix

$$\begin{aligned} u^T \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) u \ge {\alpha _0 \rho _{-}} - c_0(1+\gamma )^2 L^2 K s \sqrt{\frac{2\log (p_n)}{n}} \ge \kappa _{-} > 0. \end{aligned}$$

Similarly, the upper bound can be obtained using similar approach

$$\begin{aligned} u^T \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) u \le {\alpha _1 \rho _{+}} + c_0(1+\gamma )^2L^2 K s \sqrt{\frac{2\log (p_n)}{n} } \le \kappa _{+} < \infty . \end{aligned}$$

Combining the results above, the random Hessian matrix satisfies the restricted eigenvalue condition with high probability.

$\square$

1.3 B.3: Proof of Lemma 2.3

Proof

The proof of lemma 2.3 is analogous to previous work in Ravikumar et al. (2010). For simplicity, let the sub-matrix of random Hessian $n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^*)_{\mathcal{S}\mathcal{S}}= {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}$, and let the difference of the matrices be denoted by $\Delta H_{\mathcal{S}\mathcal{S}}^* = {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} - H(\theta ^*)_{\mathcal{S}\mathcal{S}}$. Because the sub-matrices of random Hessian are diagonal block matrices, we show that

$$\begin{aligned} H_{\mathcal{S}\mathcal{S}}^* = \text {diag}( _k H_{\mathcal{S}\mathcal{S}}^*)_{k=1}^K, \end{aligned}$$

where the sub-matrix $_k H_{\mathcal{S}\mathcal{S}}^* \in {\mathbb {R}}^{s\times s}$ represents the kth block in $H_{\mathcal{S}\mathcal{S}}^*$. The difference between sub-matrices is denoted as $\Delta _k H_{\mathcal{S}\mathcal{S}}^* = [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} - _kH(\theta ^*)_{\mathcal{S}\mathcal{S}}]$.

We need to obtain the concentration result of the inverse matrix difference $[{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1}$. Based on Lemma C.6, we show that the diagonal block matrix ${\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } = \sup _k {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [_k{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }$, so that

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [_k{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&= {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [ _k H^*_\mathcal{S}\mathcal{S}]^{-1} \Delta _k H^*_\mathcal{S}\mathcal{S} [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \\&\overset{(i)}{\le }\ \sqrt{s} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [ _k H^*_\mathcal{S}\mathcal{S}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{2} \\&\le \frac{\sqrt{s}}{\kappa _{-}} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{2}. \end{aligned}$$

In the step (i), we apply the inequality between matrix norms and the Cauchy–Schwarz inequality. We have

$$\begin{aligned} P({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [ {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [ {H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \ge \varepsilon )&\le K \sup _k P( \frac{\sqrt{s}}{\kappa _{-}} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{2} \ge \varepsilon ) \\&\overset{(i)}{\le }\ K \sup _k P( \{ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \ge \frac{\varepsilon \kappa _{-}^2}{\sqrt{s}} \} \cup \{ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 > \varepsilon \} \\&\le 2 K\exp \big \{-\frac{\alpha \kappa _{-}^4\varepsilon ^2}{{\mathcal {M}}_*^2 s^3} n + 2 \log (s) \big \} . \end{aligned}$$

The step (i) can be obtained based on the derivation C1 and C3. This probability is exponentially small as $n > c s^3 \log (p_n)$ with some constant c.

We combine all the concentration results and obtain

$$\begin{aligned} {\mathcal {H}}^*_{{\mathcal {S}}^c{\mathcal {S}}}({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1}&= [{H}^*_{{\mathcal {S}}^c{\mathcal {S}}} + \Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}}] [[{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} + [{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} ] \\&= \underbrace{{H}^*_{{\mathcal {S}}^c{\mathcal {S}}} (H^*_{\mathcal{S}\mathcal{S}})^{-1}}_{{\mathcal {I}}_1} + \underbrace{{H}^*_{{\mathcal {S}}^c{\mathcal {S}}} ([ {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [ {H}^*_{\mathcal{S}\mathcal{S}}]^{-1})}_{{\mathcal {I}}_2} \\&\quad + \underbrace{\Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}}({H}^*_{\mathcal{S}\mathcal{S}})^{-1}}_{{\mathcal {I}}_3} + \underbrace{\Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}} ([ {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [ {H}^*_{\mathcal{S}\mathcal{S}}]^{-1})}_{{\mathcal {I}}_4}. \end{aligned}$$

We have the component $\sqrt{K}{\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_1 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \le (1 - \xi )$ based on Assumption 2.7. For the second component ${\mathcal {I}}_2$, we apply Lemma C.3 to obtain that with a probability at least $1 - 2 K \exp \{-\frac{\alpha \kappa _{-}^2\varepsilon ^2}{{\mathcal {M}}_*^2\,s^3} n + 2 \log (s) \}$,

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_2 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {H}^*_{{\mathcal {S}}^c{\mathcal {S}}}(H^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H^*_\mathcal{S}\mathcal{S} ({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \\&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {H}^*_{{\mathcal {S}}^c{\mathcal {S}}}(H^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \sup _k \{ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \} \\&< \frac{1 - \xi }{\sqrt{K}} \times \bigg \{ \frac{ \kappa _{-}\varepsilon }{\sqrt{s}} \bigg \} \times \bigg \{ \frac{\sqrt{s}}{\kappa _{-}} \bigg \}= \frac{1 - \xi }{\sqrt{K}} \times \varepsilon ^{'}. \end{aligned}$$

Based on Lemmas C.3 and C.4, the concentration result of the component ${\mathcal {I}}_3$ and ${\mathcal {I}}_4$ can be obtained with a probability at least $1 - 2K \exp \big \{- \frac{\alpha \kappa _{-}^4{\varepsilon ^{'}}^2}{{\mathcal {M}}_*^2\,s^3} n + 2 \log (s) \big \} - 2K \exp \big \{- \frac{\alpha \kappa _{-}^2 \varepsilon ^2}{{\mathcal {M}}_*^2\,s^3} n + \log (s) + \log (p_n -s) \big \}$,

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_3 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \sup _k \{ \sqrt{s}{\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| (_k{H}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \} \le \bigg \{\frac{\varepsilon \kappa _{-}}{\sqrt{s}} \bigg \} \bigg \{ \frac{\sqrt{s}}{\kappa _{-}}\bigg \} = \varepsilon \\ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_4 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta [ H^*_\mathcal{S}\mathcal{S}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \le \varepsilon \times \varepsilon ^{'} . \end{aligned}$$

We set $\varepsilon \le \xi /(4\sqrt{K})$ and $\varepsilon ^{'} \le \xi$ that leads to

$$\begin{aligned} \sqrt{K} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {H}}^*_{{\mathcal {S}}^c{\mathcal {S}}} ({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty&< 1 - \xi + \sqrt{K}\varepsilon \times \varepsilon ^{'} + (1 - \xi )\varepsilon + \sqrt{K}\varepsilon \\&< (1 -\xi ) + \frac{\xi ^2}{4} + \frac{\xi - \xi ^2}{4} + \frac{\xi }{4} < 1- \frac{1}{2}\xi \end{aligned}$$

with a probability $1 - 4 K\exp \big \{- C_0 \xi ^2 n/s^3 + 2 \log (p_n) \big \}$ for a universal constant $C_0 > 0.$ $\square$

1.4 B.4: Proof of Lemma 2.4

Proof

The first-order partial derivative of the objective function can be expanded by applying the mean value theorem,

$$\begin{aligned} {{\textbf {0}}} = \nabla Q({\hat{\theta }}) = \nabla {\mathcal {L}}(\theta ^*) + \nabla ^2 {\mathcal {L}}({\tilde{\theta }})({\hat{\theta }} - \theta ^*) + n \lambda _n {\hat{z}}, \end{aligned}$$

where ${\tilde{\theta }} = \alpha \theta ^* + (1-\alpha ){\hat{\theta }}$ for some $\alpha \in (0,1).$ This entails

$$\begin{aligned} \nabla Q({\hat{\theta }})^T({\hat{\theta }} - \theta ^*) = (\nabla {\mathcal {L}}(\theta ^*) + n \lambda _n {\hat{z}})^T ({\hat{\theta }} - \theta ^*) + ({\hat{\theta }} - \theta ^*)^T \nabla ^2 {\mathcal {L}}({\tilde{\theta }})({\hat{\theta }} - \theta ^*) . \end{aligned}$$

(B1)

Based on Lemma C.2, we can show that with a probability tending to 1,

$$\begin{aligned} ({\hat{\theta }} - \theta ^*)^T \frac{1}{n} \nabla ^2 {\mathcal {L}}({\tilde{\theta }} ) ({\hat{\theta }} - \theta ^*) \ge 0. \end{aligned}$$

Thus, we can construct the inequality from 6.7 as follows,

$$\begin{aligned} \underbrace{\nabla Q({\hat{\theta }})^T({\hat{\theta }} - \theta ^*)}_{{\mathcal {I}}_1} - \underbrace{\nabla {\mathcal {L}}(\theta ^*)({\hat{\theta }} - \theta ^*)}_{{\mathcal {I}}_2} - n\lambda _n \underbrace{{\hat{z}}^T ({\hat{\theta }} - \theta ^*) } _{{\mathcal {I}}_3}\ge 0. \end{aligned}$$

(B2)

For the exact solution ${\hat{\theta }}$, all elements in $\nabla Q({\hat{\theta }})$ are zero so that the component ${\mathcal {I}}_1 =0$. The elements in the vector $({\hat{\theta }} - \theta ^*) \in {\mathcal {R}}^{Kp_n}$ can be decomposed into two subsets ${{\mathcal {E}}}$ and ${{\mathcal {E}}^c}$. By applying Hölder’s inequality in Lemma C.5, the components ${{\mathcal {I}}_2}$ from the equation 6.8 can be bounded above as follows

$$\begin{aligned} {{\mathcal {I}}_2} : - \nabla {\mathcal {L}}(\theta ^*)^T ({\hat{\theta }} - \theta ^*) \le&\Vert \nabla {\mathcal {L}}(\theta ^*) \Vert _{2,\infty } \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \nonumber \\ =&\Vert \nabla {\mathcal {L}}(\theta ^*) \Vert _{2,\infty } (\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} + \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1}). \end{aligned}$$

(B3)

By the definition, if ${\hat{\theta }}^{(p)} \ne {{\textbf {0}}}$, ${\hat{z}}^{(p)}= {\hat{\theta }}^{(p)} / \Vert {\hat{\theta }}^{(p)}\Vert _2$, and $\Vert {\hat{z}}^{(p)}\Vert _2 = 1$. If ${\hat{\theta }}^{(p)} = {{\textbf {0}}}$, $\Vert {\hat{z}}^{(p)} \Vert _2 < 1$. Since ${\mathcal {S}} \cap {\mathcal {E}}^c = \emptyset$, we have $\theta _{ {\mathcal {E}}^c }^* = {{\textbf {0}}}$. First, we decompose the term ${\mathcal {I}}_3$ into two subsets. In the subset ${\mathcal {E}}$,

$$\begin{aligned} -{\hat{z}}^T_{{\mathcal {E}}} ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \le \Vert {\hat{z}}^T_{{\mathcal {E}}} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1}. \end{aligned}$$

In the compliment set ${\mathcal {E}}^c$,

$$\begin{aligned} {\hat{z}}^T_{{\mathcal {E}}^c} ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} = {\hat{z}}^T_{{\mathcal {E}}^c} {\hat{\theta }}_{{\mathcal {E}}^c} \overset{(i)}{=}\ {}&\sum _{\begin{array}{c} {\hat{\theta }}^{(p)}\ne {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array} } \frac{ \Vert {\hat{\theta }}^{(p)} \Vert ^2_2}{ \Vert {\hat{\theta }}^{(p)}\Vert _2} + \sum _{\begin{array}{c} {\hat{\theta }}^{(p)}= {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array}}( {\hat{z}}^{(p)})^T {\hat{\theta }}^{(p)} \\ \overset{(ii)}{=}\&\sum _{\begin{array}{c} {\hat{\theta }}^{(p)}\ne {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array}} \Vert {\hat{\theta }}^{(p)} \Vert _{2} + \sum _{\begin{array}{c} {\hat{\theta }}^{(p)}= {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array}} 0 = \Vert ({\hat{\theta }}- \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} . \end{aligned}$$

In the step (i), We divide the estimator ${\hat{\theta }}_{{\mathcal {E}}^c}$ into nonzero and zero subsets. Therefore, the formulation in (ii) is identical with the definition of mixed $\ell _{2,1}$ norm for ${\hat{\theta }}_{{\mathcal {E}}^c}$.

From the above derivations, the inequality 6.8 can be expanded as

$$\begin{aligned} (\lambda _n + \Vert \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)\Vert _{2,\infty } ) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \ge (\lambda _n - \Vert \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)\Vert _{2,\infty }) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} . \end{aligned}$$

Because $\Vert n^{-1} \nabla {\mathcal {L}}({\theta }^*)\Vert _{2,\infty } \le \lambda _n/2$ with high probability, we have

$$\begin{aligned} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le \frac{\lambda _n + \Vert n^{-1} \nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty } }{\lambda _n - \Vert n^{-1} \nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty } } \Vert ( {\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1}. \end{aligned}$$

In addition, if we plug in the maximum value $\lambda _n/2$ of $\Vert n^{-1} \nabla {\mathcal {L}}(\theta ^*)\Vert _\infty$, we obtain

$$\begin{aligned} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le 3\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1}. \end{aligned}$$

Based on the relation between the $\ell _1$ norm and $\ell _{2,1}$ norm, we can show that

$$\begin{aligned} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le \sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le 3\sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le 3\sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{1}. \end{aligned}$$

$\square$

Appendix C: Technical lemma

Lemma C.1

Based on Assumptions 2.3–2.5, the individual score function satisfies the sub-exponential condition such that for some universal constant ${\mathcal {M}}_*$,

$$\begin{aligned} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} \le {\mathcal {M}}_* . \end{aligned}$$

for any $k = 1, 2, \cdots , K$ and $p = 1, 2, \cdots , p_n$.

Proof

For each task, the quasi log-likelihood score function is given by

$$\begin{aligned} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} = \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} = \sum _{i=1}^n \underbrace{(y_{ki} - g_k^{-1}(\eta _{ki}^*) )}_{{\mathcal {I}}_1} \underbrace{\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}^*))} \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}}_{{\mathcal {I}}_2}\underbrace{\frac{ \partial \eta _{ki} }{\partial \theta _{kp}}}_{{\mathcal {I}}_3} , \end{aligned}$$

for $k = 1, 2, \cdots , K$ and $p = 1, 2, \cdots , p_n$.

From Assumptions 2.3–2.5, as the linear predictor $\eta _{ki} < K_0$, the variance functions $V(\eta _{ki})$ and the link functions $g_k(\eta _{ki} )$ are well-defined and bounded. Thus, we show that the second component ${\mathcal {I}}_2$ is bounded by some constant. In addition, the derivatives of the linear predictor are ${\partial \eta _{ki}/\theta _{kp} } = x_{kpi},$ and $\sup _{k,p,i}\{x_{kpi}\} \le L < \infty$. Thus, the component ${\mathcal {I}}_3$ is bounded by L.

Based on Assumption 2.5, ${\mathcal {I}}_1 = y_{ki} - g_k^{-1}(\eta _{ki}^*)$ is from a sub-exponential distribution with zero mean and $\psi _1$ norm bounded above by ${\mathcal {M}}$. Let ${\mathcal {K}}_{ki} = {\mathcal {I}}_2\times {\mathcal {I}}_3.$ The individual score function is given by

$$\begin{aligned} \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} = { ( y_{ki} - g_k^{-1}(\eta _{ki}^*) )} {\mathcal {K}}_{ki}, \end{aligned}$$

where we have ${\mathcal {K}}_{ki}< {\mathcal {K}} < \infty$ for some universal constant ${\mathcal {K}}$ for all tasks. We obtain that the $\psi _1$ norm of individual score function are as follows

$$\begin{aligned} \Vert \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \Vert _{\psi _1}&= \sup _{m \ge 1} \frac{1}{m}\left( E \vert \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \vert ^m \right) ^{1/m} \\&\le \sup _{m \ge 1} {\mathcal {K}}_{ki} \frac{1}{m}\left( E \vert g_k^{-1}(\eta _{ki}^*) - y_{ki} \vert ^m \right) ^{1/m} \\&\le \sup _{p} {\mathcal {K}} \Vert g_k^{-1}(\eta _{ki}) - y_{ki} \Vert _{\psi _1} \le \mathcal{K}\mathcal{M}. \end{aligned}$$

Based on the property of sub-exponential distribution (Wainwright, 2019),

the $\psi _1$ norm of $n^{-1/2} {\partial \ell _{k}(\theta ^*_k; Y_k) }/{\partial \theta _{kp}}$ can be bounded by some constant ${\mathcal {M}}_*\ge \mathcal{K}\mathcal{M}$, such that

$$\begin{aligned} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} = \sup _{m\ge 1} \frac{1}{m} \left( E \left[ \left( \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \right) ^{m} \right] \right) ^{1/m} \le {\mathcal {M}}_* . \end{aligned}$$

$\square$

Lemma C.2

Based on Assumption 2.3 and 2.5, let $w_k = 1$, and there exists some ${\tilde{r}}$, for any $\theta \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)$, the observed Hessian can be formulated as follows,

$$\begin{aligned} \frac{1}{n} \nabla ^2 {\mathcal {L}} (\theta ) = \frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \{ f _1(\eta _{ki}) - (y_{ki} - g^{-1}_k(\eta _{ki}^*)) f _2(\eta _{ki}) \} x_{ki}x_{ki}^T . \end{aligned}$$

with $\eta _{ki} = \sum _{p=1}^{p_n} x_{kpi}\theta _{kp}$ and $\eta _{ki}^* = \sum _{p=1}^{p_n} x_{kpi}\theta _{kp}^*$, and the functions of linear predictors $f _1(\eta _{ki})$ and $f _2(\eta _{ki})$ are both bounded. Furthermore, the function $f _1(\eta _{ki}) > 0$ for $\theta \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)$.

Proof

First, the observed Hessian can be constructed as follow

$$\begin{aligned} \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) =&\frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg \{ {\bigg (\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\bigg )^2} \\&- (g_k^{-1}(\eta _{ki}^*)-g_k^{-1}(\eta _{ki}))\times \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) \bigg \} x_{ki}x_{ki}^T \\&- \frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \frac{ y_{ki} - g_k^{-1}(\eta _{ki}^*)}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) x_{ki}x_{ki}^T . \end{aligned}$$

Therefore, we can set

$$\begin{aligned} f _1(\eta _{ki}) =&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg \{ {\bigg (\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\bigg )^2} \\&- (g_k^{-1}(\eta _{ki}^*)-g_k^{-1}(\eta _{ki}))\times \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) \bigg \} \end{aligned}$$

and by applying the approximation,

$$\begin{aligned} g_k^{-1}(\eta _{ki}^*)-g_k^{-1}(\eta _{ki}) = \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}( \eta _{ki}^*- \eta _{ki}) . \end{aligned}$$

We can further show that

$$\begin{aligned} f _1(\eta _{ki}) =&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg \{ {\bigg (\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\bigg )^2} \\&- \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}( \eta _{ki}^*- \eta _{ki})\times \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) \bigg \} . \end{aligned}$$

Based on Assumption 2.3, we can set that there exists some positive constants $K_1$, $K_2$, $K_3$, $K_4$, $K_5$, and $K_6$,

$$\begin{aligned} K_1 \le \max _{k,i} \vert \frac{\partial g_k^{-1}(\eta )}{ \partial \eta } \vert _{\eta = \eta _{ki}} \vert \le K_2, \text { and } \max _{k,i} \vert \frac{\partial ^2 g_k^{-1}(\eta )}{ \partial \eta ^2 } \vert _{\eta = \eta _{ki}} \vert \le K_3. \end{aligned}$$

Since the variance function has a polynomial form of mean, then

$$\begin{aligned} K4 \le V_k( g_k^{-1}(\eta _{ki} )) \le K5 \text { and } V_k^\prime ( g_k^{-1}(\eta _{ki} )) \le K6. \end{aligned}$$

Therefore, the function $f _1(\eta _{ki})$ is bounded by

$$\begin{aligned} f _1(\eta _{ki}) \ge&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \left( K_1^2 - K_2( K_3 + K_2^2 K_6/K_4 ) \vert \eta _{ki}^*- \eta _{ki} \vert \right) \\ \ge&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \left( K_1^2 - K_2( K_3 + K_2^2 K_6/K_4 ) L \Vert \theta - \theta ^*\Vert _1 \right) \end{aligned}$$

Therefore, as $\Vert \theta - \theta ^*\Vert _1 \le r$ and $\Vert x_{ki}\Vert _\infty \le L$, we can set ${\tilde{r}} = \min \{r, K^\prime K_3^2 \}$ with constant $K^\prime = 1/(L K_2( K_3 + K_2^2 K_6/K_4 ))$, and we can show that $0< f _1(\eta _{ki}) < \infty$. In addition, we can also set

$$\begin{aligned} f _2(\eta _{ki}) =&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \left( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \left( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \right) ^2 \right) , \end{aligned}$$

which is a bounded function based on Assumption 2.3. $\square$

Lemma C.3

Under Assumptions 2.3–2.5, for some positive constants $\alpha$ and $\varepsilon$,

$$\begin{aligned} P\bigg ({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta )_{\mathcal{S}\mathcal{S}} - H(\theta ^*)_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \le \varepsilon \bigg ) \ge&1 - 2K \exp \bigg \{- \alpha \frac{\varepsilon ^2}{({\mathcal {M}}_*s)^2} n + 2 \log (s) \bigg \}, \ \\ P\bigg ({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta )_{\mathcal{S}\mathcal{S}^c} - H(\theta ^*)_{\mathcal{S}\mathcal{S}^c} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \le \varepsilon \bigg ) \ge&1 - 2 K\exp \bigg \{- \alpha \frac{\varepsilon ^2}{({\mathcal {M}}_*s)^2} n + \log (s(p_n - s )) \bigg \} . \end{aligned}$$

Proof

With the same notation as in the proof of Lemma 2.3, we can show that $\Delta H_{\mathcal{S}\mathcal{S}}^* = \text {diag}( \Delta _k H_{\mathcal{S}\mathcal{S}}^*)_{k=1}^K$. For any $\varepsilon > 0$,

$$\begin{aligned} P\left({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty> \varepsilon \right)&\overset{(i)}{=}\ P\left(\sup _k {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty> \varepsilon ) \le Ks^2 \sup _{k, p, p^{'} } P( \vert \Delta _k H_{[p,p^{'}]}^*\vert > \frac{\varepsilon }{s}\right) \\&\overset{(ii)}{\le } 2K \exp \left\{ -\alpha \min \{\frac{\varepsilon ^2}{({\mathcal {M}}_* s)^2}, \frac{\varepsilon }{{\mathcal {M}}_* s} \}n + 2 \log (s) \right\} . \end{aligned}$$

In the step (i), we apply the result in Lemma C.6. In the step (ii), we apply the concentration result of the Hessian matrix based on Lemma 2.1. Using the same method, we derive that

$$\begin{aligned} P\left({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H_{\mathcal{S}\mathcal{S}^c}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty> \varepsilon \right )&\le K \sup _k P\left({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}^c}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty > \varepsilon \right) \\&\le 2K \exp \{-\alpha \min \left\{\frac{\varepsilon ^2}{({\mathcal {M}}_* s)^2}, \frac{\varepsilon }{{\mathcal {M}}_* s} \}n + \log (p_n - s) + \log (s) \right\}. \end{aligned}$$

$\square$

Lemma C.4

Under Assumptions 2.3 - 2.6, there exist some positive constants $\alpha$ and $\varepsilon$ with $\varepsilon < \kappa _{-}$,

$$\begin{aligned} P\bigg ( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \ge \kappa - \varepsilon \bigg ) \le 1 - 2 K \exp \left\{ - \frac{\alpha \varepsilon ^2}{({\mathcal {M}}_*s)^2}n + 2 \log (s) \right\} . \end{aligned}$$

Proof

With the same notation as in the proofs of Lemmas 2.3 and C.3, we have the sub-matrix of Hessian denoted by $_kH^*_\mathcal{S}\mathcal{S}$. Lemma 2.2 shows that with high probability, the eigenvalues of $H(\theta ^*)$ are bounded and positive. Therefore, for any sub-matrix of Hessian, we have

$$\begin{aligned} \kappa _{-} \le \min \Lambda (_kH^*_\mathcal{S}\mathcal{S}). \end{aligned}$$

(C1)

Based on Courant–Fischer variational representation (Ravikumar et al., 2010), we have

$$\begin{aligned} \min \Lambda (_kH^*_\mathcal{S}\mathcal{S})&= \min \Lambda (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} + {_k H}^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}}) \\&= \min _{\Vert x\Vert _2 =1} x^T({_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} + {_k H}^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}}) x \\&\le y^T {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} y + y^T(_kH^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}})y, \end{aligned}$$

where y is the unit-norm eigenvector of ${\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}$. Using condition C1, we can show that

$$\begin{aligned} y^T {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} y \ge \min \Lambda ({_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} ) \ge&\min \Lambda ({_kH}^*_\mathcal{S}\mathcal{S}) - y^T(_kH^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}})y\\ \ge&\kappa _{-} - {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| _kH^*_{\mathcal{S}\mathcal{S}} - { _k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2. \end{aligned}$$

Next, we have

$$\begin{aligned} P( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2> \varepsilon )&\le P( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_F> \varepsilon ) \le s^2 \sup _{k, p, p^{'}} P ( \vert \Delta _k H_{[pp^{'}]}^*\vert > {\varepsilon }/{s} ) \\&\le 2 \exp \left\{ -\alpha \min \left\{\frac{\varepsilon ^2}{({\mathcal {M}}_* s)^2}, \frac{\varepsilon }{{\mathcal {M}}_* s} \right\}n + 2 \log (s) \right\} . \end{aligned}$$

As a result, we can show that with a probability at least $1 - 2 K \exp \left\{ - \frac{\alpha \varepsilon ^2}{({\mathcal {M}}_*s)^2}n + 2 \log (s) \right\}$,

$$\begin{aligned} \Lambda (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}) \ge \kappa _{-} - \varepsilon . \end{aligned}$$

(C2)

Furthermore, for $\varepsilon < \kappa _{-}$ in C2, we set the constant $\delta = \kappa _{-} - \varepsilon > 0$, such that

$$\begin{aligned} P( \Lambda (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}) \le \delta ) = P( \Lambda ([_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1}) \ge \delta ^{-1} ) . \end{aligned}$$

(C3)

$\square$

Lemma C.5

Consider vectors u and $v \in R^{Kp_n}$ double-indexed as $u = (u_{11}, \cdots , u_{kp}, \cdots , u_{Kp_n})$ and $v = (v_{11}, \cdots , v_{kp}, \cdots , v_{Kp_n})$ for $k = 1,2, \cdots , K$ and $p = 1,2, \cdots , p_n$. Then

$$\begin{aligned} uv \le \Vert u\Vert _{2,1}\Vert v\Vert _{2,\infty }. \end{aligned}$$

Proof

We apply Hölder’s inequality to show that

$$\begin{aligned} uv \le&\sum _{p=1}^{p_n}\sum _{k=1}^K u_{kp} v_{kp} \le \sum _{p=1}^{p_n} \Vert u^{(p)}\Vert _2\Vert v^{(p)}\Vert _2 \le \Vert u\Vert _{2,1}\Vert v\Vert _{2,\infty }. \end{aligned}$$

$\square$

Lemma C.6

Suppose a matrix $A \in {\mathbb {R}}^{Kd\times Kd}$ consists of diagonal blocks such that $A = \text {diag}(A_k)_{k=1}^K$, and each block matrix has the same dimension that $A_k \in {\mathbb {R}}^{d\times d}$. Then,

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_1 \le \sup _{k} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A_k \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_1 \text { and } {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \le \sup _{k} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A_k \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhong, Y., Xu, W. & Gao, X. Heterogeneous multi-task feature learning with mixed $\ell _{2,1}$ regularization. Mach Learn 113, 891–932 (2024). https://doi.org/10.1007/s10994-023-06410-0

Download citation

Received: 03 May 2022
Revised: 14 June 2023
Accepted: 01 October 2023
Published: 18 December 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10994-023-06410-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heterogeneous multi-task feature learning with mixed \(\ell _{2,1}\) regularization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Online Weighted Multi-task Feature Selection

Low-Rank and Sparse Multi-task Learning

Robust Feature Selection with Feature Correlation via Sparse Multi-Label Learning

Explore related subjects

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A: Proof of Theorem in Sect. 2

1.1 A.1: Proof of Theorem 1

Proof

1.2 A.2: Proof of Theorem 2

Proof

Appendix B: Proofs of Lemma in Sect. 2

1.1 B.1: Proof of Lemma 2.1

Proof

Corollary 3

1.2 B.2: Proof of Lemma 2.2

Proof

1.3 B.3: Proof of Lemma 2.3

Proof

1.4 B.4: Proof of Lemma 2.4

Proof

Appendix C: Technical lemma

Lemma C.1

Proof

Lemma C.2

Proof

Lemma C.3

Proof

Lemma C.4

Proof

Lemma C.5

Proof

Lemma C.6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation