Skip to main content
Log in

Heterogeneous multi-task feature learning with mixed \(\ell _{2,1}\) regularization

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Data integration is the process of extracting information from multiple sources and jointly analyzing different data sets. In this paper, we propose to use the mixed \(\ell _{2,1}\) regularized composite quasi-likelihood function to perform multi-task feature learning with different types of responses, including continuous and discrete responses. For high dimensional settings, our result establishes the sign recovery consistency and estimation error bounds of the penalized estimates under regularity conditions. Simulation studies and real data analysis examples are provided to illustrate the utility of the proposed method to combine correlated platforms with heterogeneous tasks and perform joint sparse estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

All data sets are available online, which are stated in Sect. 5.

Code availability

The code is provided in the R package “HMTL” at https://CRAN.R-project.org/package=HMTL.

References

  • Agarwal, A., Negahban, S., & Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2), 1171–1197.

    Article  MathSciNet  Google Scholar 

  • Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(61), 1817–1853.

    MathSciNet  Google Scholar 

  • Argyriou, A., Evgeniou, T., & Pontil, M. (2006). Multi-task feature learning. In: Proceedings of the 19th international conference on neural information processing systems. MIT Press, Cambridge, MA, USA, NIPS’06, pp. 41–48.

  • Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(40), 1179–1225.

    MathSciNet  Google Scholar 

  • Bai, H., Zhong, Y., Gao, X., et al. (2020). Multivariate mixed response model with pairwise composite-likelihood method. Stats, 3(3), 203–220.

    Article  Google Scholar 

  • Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.

    Article  MathSciNet  Google Scholar 

  • Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and dantzig selector. The Annals of Statistics, 37(4), 1705–1732.

    Article  MathSciNet  Google Scholar 

  • Cadenas, C., van de Sandt, L., Edlund, K., et al. (2014). Loss of circadian clock gene expression is associated with tumor progression in breast cancer. Cell Cycle, 13(20), 3282–3291. PMID: 25485508.

    Article  Google Scholar 

  • Cao, H., & Schwarz, E. (2022). RMTL: Regularized multi-task learning. https://CRAN.R-project.org/package=RMTL, r package version 0.9.9.

  • Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.

    Article  MathSciNet  Google Scholar 

  • Cox, D. R., & Reid, N. (2004). A note on pseudolikelihood constructed from marginal densities. Biometrika, 91(3), 729–737.

    Article  MathSciNet  Google Scholar 

  • U.S. Department of Health and Human Services. (2010). Community Health Status Indicators (CHSI) to Combat Obesity, Heart Disease and Cancer 1996–2003 (p. 2010). Washington, D.C., USA: US Department of Health and Human Services.

  • Ekvall, K.O., & Molstad, A.J. (2021). mmrr: Mixed-type multivariate response regression. R package version 0.1.

  • Ekvall, K. O., & Molstad, A. J. (2022). Mixed-type multivariate response regression with covariance estimation. Statistics in Medicine,41(15), 2768–2785. https://doi.org/10.1002/sim.9383, onlinelibrary.wiley.com/doi/abs/10.1002/sim.9383.

  • Eldar, Y. C., Kuppinger, P., & Bolcskei, H. (2010). Block-sparse signals: Uncertainty relations and efficient recovery. IEEE Transactions on Signal Processing, 58(6), 3042–3054.

    Article  MathSciNet  Google Scholar 

  • Fang, E. X., Ning, Y., & Li, R. (2020). Test of significance for high-dimensional longitudinal data. The Annals of Statistics, 48(5), 2622–2645.

    Article  MathSciNet  Google Scholar 

  • Fan, J., Liu, H., Sun, Q., et al. (2018). I-lamm for sparse learning: Simultaneous control of algorithmic complexity and statistical error. The Annals of Statistics, 46(2), 814–841.

    Article  MathSciNet  Google Scholar 

  • Fan, J., Wang, W., & Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of statistics, 49(3), 1239–1266. https://doi.org/10.1214/20-aos1980

    Article  MathSciNet  Google Scholar 

  • Gao, X., Zhong, Y., & Carroll, R. J. (2022). FusionLearn: Fusion Learning. https://CRAN.R-project.org/package=FusionLearn, r package version 0.2.1.

  • Gao, X., & Carroll, R. J. (2017). Data integration with high dimensionality. Biometrika, 104(2), 251–272.

    Article  MathSciNet  Google Scholar 

  • Gao, X., & Song, P. X. K. (2010). Composite likelihood Bayesian information criteria for model selection in high-dimensional data. Journal of the American Statistical Association, 105(492), 1531–1540.

    Article  MathSciNet  Google Scholar 

  • Gao, X., & Zhong, Y. (2019). Fusionlearn: a biomarker selection algorithm on cross-platform data. Bioinformatics, 35(21), 4465–4468.

    Article  Google Scholar 

  • Gaughan, L., Stockley, J., Coffey, K., et al. (2013). KDM4B is a master regulator of the estrogen receptor signalling cascade. Nucleic Acids Research, 41(14), 6892–6904. https://doi.org/10.1093/nar/gkt469

    Article  Google Scholar 

  • Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. The Annals of Mathematical Statistics, 31(4), 1208–1211.

    Article  MathSciNet  Google Scholar 

  • Gomez-Cabrero, D., Abugessaisa, I., Maier, D., et al. (2014). Data integration in the era of omics: Current and future challenges. BMC Systems Biology, 8(2), I1.

    Article  Google Scholar 

  • Gong, P., Ye, J., & Zhang, C. (2013). Multi-stage multi-task feature learning. Journal of Machine Learning Research, 14(55), 2979–3010.

    MathSciNet  Google Scholar 

  • Hatzis, C., Pusztai, L., Valero, V., et al. (2011). A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA, 305(18), 1873–1881.

    Article  Google Scholar 

  • Hebiri, M., & van de Geer, S. (2011). The Smooth-Lasso and other \(\ell _1+\ell _2\)-penalized methods. Electronic Journal of Statistics, 5(none), 1184–1226.

    Article  MathSciNet  Google Scholar 

  • Heimes, A. S., Härtner, F., Almstedt, K., et al. (2020). Prognostic significance of interferon-\(\gamma\) and its signaling pathway in early breast cancer depends on the molecular subtypes. International Journal of Molecular Sciences,21(19).

  • Hellwig, B., Hengstler, J. G., Schmidt, M., et al. (2010). Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes. BMC Bioinformatics, 11(1), 276.

    Article  Google Scholar 

  • Itoh, M., Iwamoto, T., Matsuoka, J., et al. (2014). Estrogen receptor (er) mrna expression and molecular subtype distribution in er-negative/progesterone receptor-positive breast cancers. Breast Cancer Research and Treatment, 143(2), 403–409.

    Article  Google Scholar 

  • Ivshina, A. V., George, J., Senko, O., et al. (2006). Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Research, 66(21), 10292–10301.

    Article  Google Scholar 

  • Jalali, A., Sanghavi, S., Ruan, C., et al. (2010). A dirty model for multi-task learning. In J. Lafferty, C. Williams, J. Shawe-Taylor, et al. (Eds.), Advances in neural information processing systems. (Vol. 23). Curran Associates Inc.

    Google Scholar 

  • Kanomata, N., Kurebayashi, J., Koike, Y., et al. (2019). Cd1d-and pja2-related immune microenvironment differs between invasive breast carcinomas with and without a micropapillary feature. BMC Cancer, 19(1), 1–9.

    Article  Google Scholar 

  • Karn, T., Rody, A., Müller, V., et al. (2014). Control of dataset bias in combined affymetrix cohorts of triple negative breast cancer. Genomics Data, 2, 354–356.

    Article  Google Scholar 

  • Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics, 80, 220–239.

    MathSciNet  Google Scholar 

  • Liu, J,. Ji, S., & Ye, J. (2009). Multi-task feature learning via efficient \(l_{2,1}\)-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, Arlington, Virginia, USA, UAI ’09, p 339-348.

  • Liu, C. L., Cheng, S. P., Huang, W. C., et al. (2023). Aberrant expression of solute carrier family 35 member a2 correlates with tumor progression in breast cancer. In Vivo, 37(1), 262–269.

    Article  Google Scholar 

  • Liu, Q., Xu, Q., Zheng, V. W., et al. (2010). Multi-task learning for cross-platform sirna efficacy prediction: An in-silico study. BMC Bioinformatics, 11(1), 1–16.

    Article  Google Scholar 

  • Li, Y., Xu, W., & Gao, X. (2021). Graphical-model based high dimensional generalized linear models. Electronic Journal of Statistics, 15(1), 1993–2028.

    Article  MathSciNet  Google Scholar 

  • Loh, P. L., & Wainwright, M. J. (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16(19), 559–616.

    MathSciNet  Google Scholar 

  • Loh, P. L., & Wainwright, M. J. (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6), 2455–2482.

    Article  MathSciNet  Google Scholar 

  • Lounici, K., Pontil, M., van de Geer, S., et al. (2011). Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4), 2164–2204.

    Article  MathSciNet  Google Scholar 

  • McCullagh, P., & Nelder, J. (1989). Generalized Linear Models, Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series (2nd ed.). London: Chapman & Hall.

    Google Scholar 

  • Meinshausen, N., & Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270.

    Article  MathSciNet  Google Scholar 

  • Negahban, S. N., Ravikumar, P., Wainwright, M. J., et al. (2012). A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.

    Article  MathSciNet  Google Scholar 

  • Negahban, S. N., & Wainwright, M. J. (2011). Simultaneous support recovery in high dimensions: Benefits and perils of block \(\ell _{1}/\ell _{\infty }\)-regularization. IEEE Transactions on Information Theory, 57(6), 3841–3863.

    Article  MathSciNet  Google Scholar 

  • Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161.

    Article  MathSciNet  Google Scholar 

  • Ning, Y., & Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1), 158–195.

    Article  MathSciNet  Google Scholar 

  • Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231–252.

    Article  MathSciNet  Google Scholar 

  • Obozinski, G., Wainwright, M. J., & Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1), 1–47.

    Article  MathSciNet  Google Scholar 

  • Ouyang, Y., Lu, W., Wang, Y., et al. (2023). Integrated analysis of mrna and extrachromosomal circular dna profiles to identify the potential mrna biomarkers in breast cancer. Gene, 857, 147174. https://doi.org/10.1016/j.gene.2023.147174

    Article  Google Scholar 

  • Poon, W. Y., & Lee, S. Y. (1987). Maximum likelihood estimation of multivariate polyserial and polychoric correlation coefficients. Psychometrika, 52(3), 409–430.

    Article  MathSciNet  Google Scholar 

  • Rakotomamonjy, A., Flamary, R., Gasso, G., et al. (2011). \(\ell _{p}-\ell _{q}\) penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Transactions on Neural Networks, 22(8), 1307–1320.

    Article  Google Scholar 

  • Ravikumar, P., Wainwright, M. J., & Lafferty, J. D. (2010). High-dimensional Ising model selection using \(\ell _1\)-regularized logistic regression. The Annals of Statistics, 38(3), 1287–1319. https://doi.org/10.1214/09-AOS691

    Article  MathSciNet  Google Scholar 

  • Rody, A., Karn, T., Liedtke, C., et al. (2011). A clinically relevant gene signature in triple negative and basal-like breast cancer. Breast Cancer Research, 13(5), R97.

    Article  Google Scholar 

  • Schmidt, M., Böhm, D., von Törne, C., et al. (2008). The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Research, 68(13), 5405–5413.

    Article  Google Scholar 

  • Sethuraman, A., Brown, M., Krutilina, R., et al. (2018). Bhlhe40 confers a pro-survival and pro-metastatic phenotype to breast cancer cells by modulating hbegf secretion. Breast Cancer Research, 20, 1–17.

    Article  Google Scholar 

  • Škalamera, D., Dahmer-Heath, M., Stevenson, A. J., et al. (2016). Genome-wide gain-of-function screen for genes that induce epithelial-to-mesenchymal transition in breast cancer. Oncotarget, 7(38), 61000–61020. https://doi.org/10.18632/oncotarget.11314

    Article  Google Scholar 

  • Sun, Q., Zhou, W. X., & Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association, 115(529), 254–265.

    Article  MathSciNet  Google Scholar 

  • Tang, H., Sebti, S., Titone, R., et al. (2015). Decreased becn1 mrna expression in human breast cancer is associated with estrogen receptor-negative subtypes and poor prognosis. EBioMedicine, 2(3), 255–263.

    Article  Google Scholar 

  • Thung, K. H., & Wee, C. Y. (2018). A brief review on multi-task learning. Multimedia Tools and Applications, 77(22), 29705–29725.

    Article  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.

    MathSciNet  Google Scholar 

  • van de Geer, S. A., & Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3, 1360–1392.

    MathSciNet  Google Scholar 

  • van de Geer, S., Bühlmann, P., Ritov, Y., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202.

    MathSciNet  Google Scholar 

  • van de Geer, S., & Müller, P. (2012). Quasi-likelihood and/or robust estimation in high dimensions. Statistical Sciences, 27(4), 469–480.

    MathSciNet  Google Scholar 

  • Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical Analysis, 92(1), 1.

    Article  MathSciNet  Google Scholar 

  • Wainwright, M.J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.

  • Wang, W., Liang, Y., & Xing, E. P. (2015). Collective support recovery for multi-design multi-response linear regression. IEEE Transactions on Information Theory, 61(1), 513–534.

    Article  MathSciNet  Google Scholar 

  • Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biometrika, 61(3), 439–447.

    MathSciNet  Google Scholar 

  • Wigington, C. P., Morris, K. J., Newman, L. E., et al. (2016). The polyadenosine rna-binding protein, zinc finger cys3his protein 14 (zc3h14), regulates the pre-mrna processing of a key atp synthase subunit mrna*. Journal of Biological Chemistry, 291(43), 22442–22459. https://doi.org/10.1074/jbc.M116.754069

    Article  Google Scholar 

  • Wu, S., Gao, X., & Carroll, R.J. (2023). Model selection of generalized estimating equation with divergent model size. Statistica Sinica, pp. 1–22. https://doi.org/10.5705/ss.202020.0197

  • Yi, G. Y. (2014). Composite likelihood/pseudolikelihood (pp. 1–14). Wiley StatsRef: Statistics Reference Online.

  • Yi, G. Y. (2017). Statistical analysis with measurement error or misclassification: strategy, method and application. Berlin: Springer.

    Book  Google Scholar 

  • Yousefi, N., Lei, Y., Kloft, M., et al. (2018). Local rademacher complexity-based learning guarantees for multi-task learning. Journal of Machine Learning Research, 19(38), 1–47.

    MathSciNet  Google Scholar 

  • Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1), 49–67.

    Article  MathSciNet  Google Scholar 

  • Zhan, X.J., Wang, R., & Kuang, X.R., et al. (2023). Elevated expression of myosin vi contributes to breast cancer progression via mapk/erk signaling pathway. Cellular Signalling, p. 110633.

  • Zhang, K., Gray, J. W., & Parvin, B. (2010). Sparse multitask regression for identifying common mechanism of response to therapeutic targets. Bioinformatics, 26(12), i97–i105.

    Article  Google Scholar 

  • Zhang, H., Liu, D., Zhao, J., et al. (2018). Modeling hybrid traits for comorbidity and genetic studies of alcohol and nicotine co-dependence. The Annals of Applied Statistics, 12(4), 2359–2378. https://doi.org/10.1214/18-AOAS1156

    Article  MathSciNet  Google Scholar 

  • Zhang, J. Z., Xu, W., & Hu, P. (2022). Tightly integrated multiomics-based deep tensor survival model for time-to-event prediction. Bioinformatics, 38(12), 3259–3266.

    Article  Google Scholar 

  • Zhang Y, Yang Q (2017) A survey on multi-task learning. CoRR abs/1707.08114. arxiv:1707.08114

  • Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541–2563.

    MathSciNet  Google Scholar 

  • Zhong, Y., Xu, W., & Gao, X. (2023). HMTL: Heterogeneous Multi-Task Feature Learning. R package version 0.1.0.

  • Zhou, J., Yuan, L., & Liu, J., et al. (2011). A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, KDD ’11, p 814-822

Download references

Acknowledgements

X.G.’s research was supported by the Natural Sciences and Engineering Research Council of Canada funding. W.X. was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672). The authors would like to thank the anonymous reviewers for their valuable suggestions and constructive criticisms.

Funding

X.G.’s research was supported by the Natural Sciences and Engineering Research Council of Canada funding. W.X. was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the design of the research problems and wrote the manuscript. Material preparation and original draft were completed by YZ Theoretical analysis and methodology development were conducted by XG and YZ Data collection and analysis were performed by WX, XG and YZ.

Corresponding author

Correspondence to Xin Gao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

We declare that all the authors have agreed on the submission of this paper to the Machine Learning journal.

Additional information

Editor: Jean-Philippe Vert.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem in Sect. 2

This section provides the proofs of Theorem 1, and Theorem 2. In the following derivations, we assume that all tasks have identical sample sizes n.

1.1 A.1: Proof of Theorem 1

Proof

Lemma 2.4 shows that for the solution of the estimating equation (2.4) \({\hat{\theta }} \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)\), \(({\hat{\theta }} - \theta ^*) \in {\mathcal {C}}(m, \gamma )\) with \(m = c_0Ks\) and \(\gamma = 2\sqrt{K}+1\). Using the results from Lemma 2.2, we can obtain the following inequality with probability tending to 1,

$$\begin{aligned} \frac{1}{n} \nabla Q({\hat{\theta }})^T({\hat{\theta }} - \theta ^*) - \left( \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*) + \lambda _n {\hat{z}}\right)^T ({\hat{\theta }} - \theta ^*)&\ge {\kappa _{-}} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2. \end{aligned}$$
(A1)

We apply Hölder’s inequality in Lemma C.5 to two components in (A1):

$$\begin{aligned} - \nabla {\mathcal {L}}(\theta ^*)({\hat{\theta }} - \theta ^*)&\le \Vert \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}} \Vert _{2,1} + \Vert \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}^c} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}^c} \Vert _{2,1}; \\ - {\hat{z}}^T ({\hat{\theta }} - \theta ^*)&\le \Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} - \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1}. \end{aligned}$$

Plugging back into (A1), we have

$$\begin{aligned} \kappa _{-} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2 & \le \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}} \Vert _{2,1} + \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}^c} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}^c} \Vert _{2,1} \\ & \quad + \lambda _n\Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty }\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} - \lambda _n \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1}. \end{aligned}$$

According to Lemma 2.1, \(\Vert n^{-1}\nabla {\mathcal {L}}(\theta ^*) \Vert _{2,\infty } \le \lambda _n/2\) with a probability tending to 1. Therefore, the component \((\Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}^c} \Vert _{2,\infty } - \lambda _n/ )\Vert ({\hat{\theta }} - \theta ^*) _{{\mathcal {E}}^c} \Vert _1 \le 0.\) We simplify the inequality above as follows,

$$\begin{aligned} \kappa _{-} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2&\le \left( \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {E}}} \Vert _{2,\infty } + \lambda _n \Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty }\right) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} . \end{aligned}$$

According to the property of mixed \(\ell _{2,\infty }\) norm, \(\Vert {\hat{z}}_{{\mathcal {E}}}\Vert _{2,\infty } = \sup _p\Vert {\hat{z}}^{(p)}\Vert _{2} = 1\). In addition, \(\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le \sqrt{\vert {{\mathcal {E}}} \vert }\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2}\) with \(\sqrt{\vert {{\mathcal {E}}} \vert } = c_1 s\) for some positive constant \(c_1\). The following inequality can be obtained

$$\begin{aligned} \kappa _{-} \Vert {\hat{\theta }} - \theta ^* \Vert ^2_2&\le \left( \frac{\lambda _n}{2} + \lambda _n \right) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} ,\\&\le \frac{3\lambda _n \sqrt{c_1s}}{2} \Vert ({\hat{\theta }} - \theta ^*) \Vert _2. \end{aligned}$$

Therefore, taking the constant \(c_1 = 1\), we have

$$\begin{aligned} \Vert {\hat{\theta }} - \theta ^* \Vert _2&\le \frac{3\lambda _n \sqrt{s}}{2 \kappa _{-}} \end{aligned}$$

In addition, we derive the following error bounds based on Lemma 2.4:

$$\begin{aligned} \Vert {\hat{\theta }} - \theta ^* \Vert _1 \le 4\sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le&4 \sqrt{sK} \Vert ({\hat{\theta }} - \theta ^*)\Vert _2 \le \frac{6\sqrt{K}}{\kappa _{-}} \lambda _n s; \\ \frac{1}{n}(\nabla {\mathcal {L}}({\hat{\theta }}) -\nabla {\mathcal {L}}({\theta }^*) )^T ({\hat{\theta }} - \theta ^* ) \le&\frac{1}{n}\Vert \nabla {\mathcal {L}}({\hat{\theta }}) - \nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty } \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \\ \le&\left( \Vert \frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }}\right) + \lambda _n {\hat{z}}\Vert _{2,\infty } + \Vert \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty }\\&+ \Vert \lambda _n {\hat{z}}\Vert _{2,\infty } ) \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \\ =&\left( \Vert \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*\right) \Vert _{2,\infty } + \lambda _n) \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \le \frac{9\lambda _n^2 s}{4 \kappa _{-}}. \end{aligned}$$

\(\square\)

1.2 A.2: Proof of Theorem 2

Proof

The derivative equation (2.4) can be partitioned into two sets of equations based on the two sub-spaces of parameters \({\mathcal {S}}\) and \({\mathcal {S}}^c\):

$$\begin{aligned} -\frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }})_{{\mathcal {S}}}&= \lambda _n {\hat{z}}_{{\mathcal {S}}} , \end{aligned}$$
(A2a)
$$\begin{aligned} -\frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }})_{{\mathcal {S}}^c}&= \lambda _n {\hat{z}}_{{\mathcal {S}}^c} . \end{aligned}$$
(A2b)

Based on the definition of sub-differential, the sub-differential \({\hat{z}}_{{\mathcal {S}}}\) contains grouped subsets \({\hat{z}}^{(p)} = {\hat{\theta }}^{(p)}/\Vert {\hat{\theta }}^{(p)}\Vert _2\) with \(p \in {\mathcal {S}}\), and \(\max _{p \in {\mathcal {S}}^c } \Vert {\hat{z}}^{(p)}\Vert _2 < 1\).

According to Lemma 2.2, \({\hat{\theta }}\) is the optima of the objective function with high probability. Consider an estimator with \({\hat{\theta }}_{{\mathcal {S}},0} = ( {\hat{\theta }}_{{\mathcal {S}}}, {{\textbf {0}}}),\) where

$$\begin{aligned} {\hat{\theta }}_{{\mathcal {S}},0} = \underset{\theta = ( {\theta }_{{\mathcal {S}}}, {{\textbf {0}}})}{\arg \min } \{ {\mathcal {L}}(\theta ) + n \lambda _n \Vert \theta \Vert _{2,1}\} . \end{aligned}$$
(A3)

If the estimator \({\hat{\theta }}_{{\mathcal {S}},0}\) satisfies the conditions (A2a) and (A2b), then with high probability, \({\hat{\theta }}_{{\mathcal {S}},0}\) is the local optimal solution \({\hat{\theta }}\) to Equation (2.4).

We expand the score function Using the mean value theorem as follows

$$\begin{aligned} \frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }}_{{\mathcal {S}},0})&= \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) +\frac{1}{n}\nabla {\mathcal {L}}({\hat{\theta }}_{{\mathcal {S}},0}) - \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) = \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) + \frac{1}{n}\nabla ^2 {\mathcal {L}}({\tilde{\theta }}) {\hat{\Delta }} \\&= \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*) + \frac{1}{n}\nabla ^2 {\mathcal {L}}({\theta }^*) {\hat{\Delta }} + \underbrace{\left( \frac{1}{n}\nabla ^2 {\mathcal {L}}({\tilde{\theta }}\right) - \frac{1}{n}\nabla ^2 {\mathcal {L}}({\theta }^*)){\hat{\Delta }}}_{{\mathcal {R}}}, \end{aligned}$$

where \({\hat{\Delta }} = ({\hat{\theta }}_{{\mathcal {S}},0} - \theta ^*)\), \({\tilde{\theta }} = \alpha \theta ^* + (1-\alpha ){\hat{\theta }}_{{\mathcal {S}}}\) for some \(\alpha \in [0,1]\).

Thus, we write the equations (A2a) and (A2b) in block format with solution \({\hat{\theta }}_{{\mathcal {S}},0}\)

$$\begin{aligned} \frac{1}{n} \begin{bmatrix} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}&{} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}^c} \\ \nabla ^2 {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}^c{\mathcal {S}}} &{} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}^c{\mathcal {S}}^c} \end{bmatrix} \begin{pmatrix} {\hat{\Delta }}_{{\mathcal {S}}} \\ {{\textbf {0}}} \end{pmatrix} + \frac{1}{n} \begin{pmatrix} \nabla {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}} \\ \nabla {\mathcal {L}}(\theta ^{*})_{{\mathcal {S}}^c} \end{pmatrix} + \begin{pmatrix} {\mathcal {R}}_{{\mathcal {S}}} +\lambda _n {\hat{z}}_{{\mathcal {S}}} \\ {\mathcal {R}}_{{\mathcal {S}}^c} +\lambda _n {\hat{z}}_{{\mathcal {S}}^c} \end{pmatrix} = {{\textbf {0}}}. \end{aligned}$$

According to Lemma C.4, the sub-matrix \(n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\) is invertible with high probability. Thus, we obtain the difference block \(\Delta _{{\mathcal {S}}}\) by solving

$$\begin{aligned} \Delta _{{\mathcal {S}}} = {\hat{\theta }}_{{\mathcal {S}}} - \theta ^*_{{\mathcal {S}}} = - \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \left( \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} + \lambda _n {\hat{z}}_{{\mathcal {S}}} + {\mathcal {R}}_{{\mathcal {S}}}\right) . \end{aligned}$$

Next, we show that the elements of the remainder vector \({\mathcal {R}}\) can be expanded as follow

$$\begin{aligned} {\mathcal {R}}_{kp}&= \big (\frac{\partial ^2 \ell _{k}({\tilde{\theta }};Y_k)}{ \partial \theta \partial \theta ^T \partial \theta _{kp} } - \frac{\partial ^2 \ell _{k}({\theta }^*;Y_k)}{\partial \theta \partial \theta ^T \partial \theta _{kp}} \big ) {\hat{\Delta }} = {\tilde{\Delta }}^T \big (\frac{\partial ^3 \ell _{k}( {\theta }^*;Y_k)}{\partial \theta \partial \theta ^T \partial \theta _{kp} }\big ) {\hat{\Delta }}, \end{aligned}$$

with \({\tilde{\Delta }} = ({\tilde{\theta }} - \theta ^*) = (1-\alpha ) {\hat{\Delta }}\). Let \(\nabla _{kp} {\mathcal {H}}^* = {\partial ^3 \ell _{k}({\theta }^*;Y_k)}/{\partial \theta \partial \theta ^T \partial \theta _{kp} }\), where \(\nabla _{kp} {\mathcal {H}}^*\) is a \(Kp_n \times Kp_n\) matrix. With similar derivation as in the proof of Lemma C.2, all elements of \(\nabla _{kp} {\mathcal {H}}^*\) are from sub-exponential distributions. Thus, we show that for any \(k = 1, 2, \cdots , K\) and \(p = 1, 2, \cdots , p_n\),

$$\begin{aligned} {\mathcal {R}}_{kp}&= (1 - \alpha ) {\hat{\Delta }}^T_{{\mathcal {S}}} \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} {\hat{\Delta }}_{{\mathcal {S}}} \le (1 - \alpha ) \Vert {\hat{\Delta }}_{{\mathcal {S}}}\Vert _2^2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \\&\overset{(i)}{\le }\ {\mathcal {W}}^* \Vert {\hat{\Delta }}_{{\mathcal {S}}}\Vert _2^2 \overset{(ii)}{\le }\ \frac{9 ({\mathcal {W}}^*+\delta )}{4\kappa _{-}^2} \lambda _n^2 s. \end{aligned}$$

The step (i) is obtained based on the sub-exponential condition for the elements of \(\nabla _{kp} {\mathcal {H}}^*\). For some small \(\delta\) and a universal constant C,

$$\begin{aligned} P( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \ge {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| E(\nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ) \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 + \delta ) \le 2K\exp \left\{ - C\frac{\delta ^2 n}{Ks^2} + 2\log (s)\right\} . \end{aligned}$$

According to Assumption 2.2, \({\mathcal {W}}^* \ge {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| E(\nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ) \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2\). Thus, \({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \nabla _{kp} {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2\le (W^*+\delta )\) with high probability. According to Theorem 1, \(\Vert {\hat{\Delta }}\Vert _2^2 \le 9\lambda _n^2\,s/(2\kappa _{-})^{2}\). This leads to the result in the step (ii).

Combining the results above, we show that with a probability larger than \(1 - 2 p_n^{-d} - 4K\exp \{ - C^\prime s^{-2}n + \log (p_n) \}\),

$$\begin{aligned} \Vert \Delta _{{\mathcal {S}}} \Vert _\infty&= \Vert \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \left( \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} + \lambda _n {\hat{z}}_{{\mathcal {S}}} + {\mathcal {R}}_{{\mathcal {S}}}\right) \Vert _\infty \\&\le \frac{ \sqrt{s}}{\kappa _{-}} \big (\Vert \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}}\Vert _{2,\infty } + \lambda _n \Vert {\hat{z}}_{{\mathcal {S}}}\Vert _\infty + \Vert {\mathcal {R}}_{{\mathcal {S}}}\Vert _\infty \big ) \le \frac{3\sqrt{s}}{2\kappa _{-}} \lambda _n \le \min _{k;p\in {\mathcal {S}}} \vert \theta ^*_{kp}\vert , \end{aligned}$$

for some constant \(C^\prime > 0.\) This implies \(\text {sign}({\hat{\theta }}_S)=\text {sign}(\theta ^*_S).\)

Next, we show that \(\max _{p \subset {\mathcal {S}}^c } \Vert {\hat{z}}^{(p)}\Vert _2 < 1,\) which satisfies the KKT conditions. The sub-differential \({\hat{z}}_{{\mathcal {S}}^c}\) can be calculated from the block equation above,

$$\begin{aligned} \begin{aligned} {\hat{z}}_{{\mathcal {S}}^c} = - \frac{1}{\lambda _n} ( \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c} +&{\mathcal {R}}_{{\mathcal {S}}^c} - {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \\&( \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} + \lambda _n {\hat{z}}_{{\mathcal {S}}} + {\mathcal {R}}_{{\mathcal {S}}} ) ). \end{aligned} \end{aligned}$$
(A4)

The sub-differential \(z_{{\mathcal {S}}^c}\) from (A4) can be decomposed into three components

$$\begin{aligned} {\hat{z}}_{{\mathcal {S}}^c} = \frac{1}{\lambda _n}(&\underbrace{ {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \frac{1}{n}\nabla {\mathcal {L}}({\theta }^*)_{{\mathcal {S}}} - \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c} }_{{\mathcal {I}}_1} \nonumber \\&+ \underbrace{ {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} {\mathcal {R}}_{{\mathcal {S}}} - {\mathcal {R}}_{{\mathcal {S}}^c} }_{{\mathcal {I}}_2} \nonumber \\&+ \underbrace{ \lambda _n {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} {\hat{z}}_{{\mathcal {S}}} }_{{\mathcal {I}}_3} ). \end{aligned}$$
(A5)

The sub-differential can be grouped as \({\hat{z}}^{{(p)}}\) with \(p \subset {\mathcal {S}}^c\).

Based on Lemma 2.1 and 2.3, the following upper bound can be obtained with a probability at least \(1 - 2\exp \{- d\log (p_n)\} -4 \exp \{- C_0 s^{-3} \xi ^2 n + 2 \log (K p_n) \}\) for some constants \(d > 1\) and \(C_0 >0\),

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \big \Vert {\mathcal {I}}_1^{(p)} \Vert _2 \le&\max _{p \subset {\mathcal {S}}^c}\Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \big \Vert _2 \\&+ \max _{p \subset {\mathcal {S}}^c} \big \Vert \big \{\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)_{{\mathcal {S}}}\big \}^{(p)} \big \Vert _2 \\ \le&\max _{p \subset {\mathcal {S}}^c}\Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \big \Vert _2 \\&+ \sqrt{K} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \big \{\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*)_{{\mathcal {S}}}\big \} \big \Vert _\infty \\ \le&\frac{\xi }{4}\lambda _n + \frac{\xi }{4}(1-\frac{\xi }{2})\lambda _n < \frac{\xi }{2}\lambda _n. \end{aligned}$$

For the remainder component, we have

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \Vert {\mathcal {I}}_2^{(p)} \Vert _2 \le&\sqrt{K} \Vert {\mathcal {I}}_2 \Vert _\infty \nonumber \\ \le&\sqrt{K} \left( \Vert {\mathcal {R}}_{{\mathcal {S}}^c}\Vert _\infty + \Vert {\mathcal {R}}_{{\mathcal {S}}}\Vert _\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} } \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \right) \nonumber \\ \le&\dfrac{9{\mathcal {W}}^*}{4\kappa _{-}^2} \lambda _n^2 s \sqrt{K} = {\mathcal {O}}\left( \frac{s\log (p_n)}{n}\right) \rightarrow o(1). \end{aligned}$$
(A6)

Similarly, we show that the mixed norm of \({\mathcal {I}}_3\) can be bounded,

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \Vert {\mathcal {I}}_3^{(p)} \Vert _2 =&\max _{p \subset {\mathcal {S}}^c} \lambda _n \big \Vert \big \{ \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{{\mathcal {S}}^c{\mathcal {S}}} \left( \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^{*})_{\mathcal{S}\mathcal{S}}\right) ^{-1} {\hat{z}}_{\mathcal {S}} \big \}^{(p)} \big \Vert _2 \le \lambda _n \left(1-\frac{1}{2}\xi \right). \end{aligned}$$

By adding the three components, we show that

$$\begin{aligned} \max _{p \subset {\mathcal {S}}^c} \Vert {\hat{z}}^{(p)} \Vert _2 \le&\max _{p \subset {\mathcal {S}}^c} \frac{1}{\lambda _n} (\Vert {\mathcal {I}}_1^{(p)} \Vert _2 + \Vert {\mathcal {I}}_2^{(p)} \Vert _2 + \Vert {\mathcal {I}}_3^{(p)} \Vert _2)< 1 -\frac{\xi }{2} + \frac{\xi }{2} < 1. \end{aligned}$$

Combining the results above, we have sign(\({\hat{\theta }}\)) \(=\) sign(\(\theta ^*\)) with a probability tending to 1. \(\square\)

Appendix B: Proofs of Lemma in Sect. 2

This section provides the proofs of Lemmes 2.1,  2.3, and  2.4.

1.1 B.1: Proof of Lemma 2.1

Proof

First, we need to analyze the distributional property of the random variable \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\). Lemma C.1 shows that

$$\begin{aligned} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} \le {\mathcal {M}}_* . \end{aligned}$$

with some constant \({\mathcal {M}}_*\), and

$$\begin{aligned} \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2&= \left( \sum _{k=1}^K \left( \frac{1}{n} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2} = \left( \frac{1}{n} \sum _{k=1}^K \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2} . \end{aligned}$$

This result can be used to bound the sub-exponential norm of \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\) by apply Minkowski’s Inequality,

$$\begin{aligned} \Vert \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\Vert _{\psi _1}&= \Vert \left( \frac{1}{n} \sum _{k=1}^K \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2}\Vert _{\psi _1} \\&= \sup _{m\ge 1} \frac{1}{m}\left( E\left( \left| \left( \frac{1}{n} \sum _{k=1}^K \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right) ^{1/2} \right| ^m \right) \right) ^{1/m} \\&\le \frac{K}{\sqrt{n}} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} \le {\frac{K}{\sqrt{n}}} {\mathcal {M}}_* < \infty . \end{aligned}$$

Furthermore, we can show that

$$\begin{aligned} E( \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 )&\le \left\{ \frac{1}{n} \sum _{k=1}^K E \left[ \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right] \right\} ^{1/2} \le {\mathcal {M}}_* \sqrt{\frac{K}{n}},\\ var(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 )&\le \frac{1}{n} \sum _{k=1}^K E \left[ \left( \frac{1}{\sqrt{n}} \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \right) ^2 \right] \le 2{\mathcal {M}}_*^2 {\frac{K}{n}}. \end{aligned}$$

This implies that \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\) satisfies the sub-exponential property, such that with small \(\delta\),

$$\begin{aligned} P( \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 \ge E( \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 ) + \delta ) \le 2 \exp \left\{ - \alpha \frac{\delta ^2}{2 K{\mathcal {M}}^2_*} n \right\} . \end{aligned}$$

Since \(\Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*) \Vert _{2,\infty }\) is the supremum of \(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2\) over \(p = 1, 2, \cdots , p_n\), we have

$$\begin{aligned} P\left( \sup _p \Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2 \ge E(\Vert n^{-1} \nabla {\mathcal {L}}( {\theta }^*)^{(p)} \Vert _2) + \delta \right)&\le 2 p_n \exp \left\{ - \alpha \frac{\delta ^2 n}{2K {\mathcal {M}}^2_*} \right\} . \end{aligned}$$

By combining all the results above, with \(\delta = {\mathcal {M}}_*\sqrt{2K(1+d)\log (p _n)/(\alpha n)}\) for some constant \(d > 1\), we show that with a probability at least \(1 - 2 p_n^{-d}\),

$$\begin{aligned} \Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*) \Vert _{2,\infty } \le {\mathcal {M}}_* \left( \sqrt{\frac{K}{n}} + \sqrt{\frac{2K(1+d)\log (p_n)}{ \alpha n}} \right) . \end{aligned}$$

In addition, we showed that the score function can hold the sub-exponential condition from Lemma C.1. Therefore, we have

$$\begin{aligned} P\left( \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*) \Vert _\infty \ge \varepsilon \right) \le Kp_n \max _{p} P\left( \frac{1}{n} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \ge \varepsilon \right) \le 2Kp_n \exp \left\{ -\frac{\varepsilon ^2}{{\mathcal {M}}_*^2}n\right\} , \end{aligned}$$

which can imply

$$\begin{aligned} \Vert \frac{1}{n} \nabla {\mathcal {L}}(\theta ^*) \Vert _\infty \le {\mathcal {M}}_* \left( \sqrt{\frac{1}{n}} + \sqrt{ \frac{2(d+1)\log (p_n)}{(\alpha n)}} \right) , \end{aligned}$$

with a probability at least \(1 - 2\exp \{-d\log (p_n)+\log (K)\}\) as claimed in (2.5).

The second part of Lemma 2.1 shows that the difference between the random Hessian and its expectation is bounded. When the tasks are modeled by the canonical link and the response variables are from the exponential family, the Hessian matrix is deterministic, and \(n^{-1}\nabla {\mathcal {L}}(\theta ^*) = H(\theta ^*)\). For general cases, the entries of the random Hessian of the composite quasi-likelihood are

$$\begin{aligned} \frac{1}{n} \frac{\partial ^2}{\partial \theta _{kp}\partial \theta _{kp^{'}} } {\mathcal {L}}(\theta ^* ) =&\underbrace{\frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \frac{1}{\phi _k V(g_k^{-1} (\eta _{ki}^*))} \bigg \{\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \frac{ \partial \eta _{ki} }{\partial \theta _{kp}} \bigg \} \bigg \{ \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \frac{ \partial \eta _{ki} }{\partial \theta _{kp^{'}}}\bigg \} }_{{\mathcal {I}}_1} \\&- \frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \underbrace{\bigg ( y_{ki} - g_k^{-1}(\eta _{ki}^*) \bigg )}_{{\mathcal {I}}_2} \underbrace{\frac{\partial }{\partial \theta _{kp^{'}}} \bigg \{ \frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}^*))} \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\frac{ \partial \eta _{ki} }{\partial \theta _{kp}} \bigg \}}_{{\mathcal {I}}_3}. \end{aligned}$$

The component \({\mathcal {I}}_1\) is equal to the corresponding element in the sensitivity matrix \(H(\theta ^*)\). With some special link functions, the component \({\mathcal {I}}_3\) can be equal to zero. For the models with the general quasi-likelihood settings, we can show that the component \({\mathcal {I}}_3\) can be bounded by universal constant \({\mathcal {K}}>0\) across all tasks with similar derivation as Lemma C.2. Based on Assumption 2.5, the variables \(n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^* )_{[kp,kp^{'}]} - H(\theta ^*)_{[kp,kp^{'}]}\) satisfy the sub-exponential condition with mean zero and the \(\psi _1\) norm bounded by \({\mathcal {K}}{\mathcal {M}} < {\mathcal {M}}_*\) for some universal constant \({\mathcal {M}}_*\). Therefore, we have the concentration result of the random Hessian matrix

$$\begin{aligned} \sup _{k, p, p^{'}} \left\{ \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*) - H(\theta ^*) \right\} _{[kp,kp^{'}]}&= O_p\left( \sqrt{\frac{\log p_n}{n}}\right) , \end{aligned}$$

for any \(k = 1, 2, \cdots , K\) and \(p, p^{'} = 1, 2, \cdots , p_n\).

\(\square\)

Corollary 3

Under Assumptions 2.32.7, if the penalty parameter is chosen as

$$\begin{aligned} \lambda _n \ge \frac{4{\mathcal {M}}_*}{\xi } \left( \sqrt{\frac{K}{n}} + \sqrt{\frac{2(d+1)K\log (p_n)}{\alpha n}} \right) , \end{aligned}$$

then

$$\begin{aligned} \frac{1}{\lambda _n} \Vert \frac{1}{n} \nabla {\mathcal {L}}( {\theta }^*) \Vert _{2,\infty } \le \frac{\xi }{4} \end{aligned}$$

with a probability at least \(1 - 2 \exp \{- d\log (p_n) \}\) for some constant d. The mixed \(\ell _{2,\infty }\) norm is defined in (1.1).

1.2 B.2: Proof of Lemma 2.2

Proof

Based on Lemma C.2, the Hessian matrix of the composite quasi-likelihood is given by

$$\begin{aligned} \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) =&\frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \{ f _1(\eta _{ki}) - (y_{ki} - g^{-1}_k(\eta _{ki}^*)) f _2(\eta _{ki}) \} x_{ki}x_{ki}^T, \end{aligned}$$

and there exist some positive constants \(\alpha _0, \alpha _1,\) and \(\alpha _2\), such that \(\alpha _0< f _1(\eta _{ki} ) < \alpha _1\) and \(\vert f _2(\eta _{ki} )\vert < \alpha _2.\)

When the parameters are partitioned into subsets of different tasks, the Hessian matrix is in the form of diagonal block matrix. We show that the minimum and maximum eigenvalues of the Hessian matrix are given by

$$\begin{aligned} \min \Lambda \left( \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) \right)&= \inf _k \bigg \{ u^T \frac{1}{n} \sum _{i=1}^n \big \{ f _1(\eta _{ki}) - ( y_{ki} - g_k^{-1}(\eta _{ki}^*) ) f _2(\eta _{ki}) \big \} x_{ki}x_{ki}^T u\bigg \} ;\\ \max \Lambda \left( \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta )\right)&= \sup _k \bigg \{ u^T \frac{1}{n} \sum _{i=1}^n \big \{ f _1(\eta _{ki}) - ( y_{ki} - g_k^{-1}(\eta _{ki}^*) ) f _2(\eta _{ki}) \big \} x_{ki}x_{ki}^T u\bigg \} . \end{aligned}$$

We have

$$\begin{aligned} u^T \bigg \{ \frac{1}{n} \sum _{i=1}^n { f _1(\eta _{ki})} x_{ki}x_{ki}^T \bigg \} u =&\frac{1}{n} \sum _{i=1}^n { f _1(\eta _{ki})} u^T x_{ki}x_{ki}^T u \ge {\alpha _0 \rho _{-}} . \end{aligned}$$

We apply Hölder’s inequality and get

$$\begin{aligned} u^T \bigg \{\frac{1}{n} \sum _{i=1}^n( y_{ki} -&g_k^{-1}(\eta _{ki}^*) ) { f _2(\eta _{ki})} x_{ki}x_{ki}^T \bigg \} u \le \max _{k,i}\{ ( x_{ki}^T u)^2 \} \frac{1}{n} \sum _{i=1}^n \vert ( y_{ki} - g_k^{-1}(\eta _{ki}^*) ) { f _2(\eta _{ki})} \vert . \end{aligned}$$

Based on Assumption 2.5, \(\Vert x_{ki}\Vert _\infty \le L\) across all tasks and \(\Vert u_{{\mathcal {J}}^c}\Vert _1 \le \gamma \Vert u_{\mathcal {J}}\Vert _1\) with \(\vert {\mathcal {J}}\vert \le m = c_0Ks\), we have

$$\begin{aligned} x_{ki}^T u \le \Vert x_{ki} \Vert _\infty \Vert u \Vert _1 {\le } (1+\gamma ) \Vert x_{ki} \Vert _\infty \Vert u_{\mathcal {J}} \Vert _1 \le (1+\gamma ) \sqrt{\vert {\mathcal {J}} \vert } L. \end{aligned}$$

In addition, the variables \(y_{ki} - g_k^{-1}(\eta _{ki}^*)\) follow sub-exponential distributions based on Assumption 2.5. We obtain that with a probability at least \(1 - 2\exp \{-c\log (p_n)\}\) for some constant \(c = (\alpha _2 {\mathcal {M}})^{-2} >0\),

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \vert y_{ki} - g_k^{-1}(\eta _{ki}^*) \vert { f _2(\eta _{ki})} \le \sqrt{\frac{2 \log p_n}{n}}. \end{aligned}$$

Therefore, there exists some \(\kappa _{-} < \alpha _0\rho _{-}.\) If the sample size is sufficiently large

$$\begin{aligned} n \ge \bigg ( \frac{ c_0(1+\gamma )^2 L^2 K}{ \alpha _0 \rho _{-} - \kappa _{-} }\bigg )^2 2s^2 \log p_n , \end{aligned}$$

then we obtain the lower bound for the minimum eigenvalue of Hessian matrix

$$\begin{aligned} u^T \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) u \ge {\alpha _0 \rho _{-}} - c_0(1+\gamma )^2 L^2 K s \sqrt{\frac{2\log (p_n)}{n}} \ge \kappa _{-} > 0. \end{aligned}$$

Similarly, the upper bound can be obtained using similar approach

$$\begin{aligned} u^T \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) u \le {\alpha _1 \rho _{+}} + c_0(1+\gamma )^2L^2 K s \sqrt{\frac{2\log (p_n)}{n} } \le \kappa _{+} < \infty . \end{aligned}$$

Combining the results above, the random Hessian matrix satisfies the restricted eigenvalue condition with high probability.

\(\square\)

1.3 B.3: Proof of Lemma 2.3

Proof

The proof of lemma 2.3 is analogous to previous work in Ravikumar et al. (2010). For simplicity, let the sub-matrix of random Hessian \(n^{-1}\nabla ^2 {\mathcal {L}}(\theta ^*)_{\mathcal{S}\mathcal{S}}= {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}\), and let the difference of the matrices be denoted by \(\Delta H_{\mathcal{S}\mathcal{S}}^* = {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} - H(\theta ^*)_{\mathcal{S}\mathcal{S}}\). Because the sub-matrices of random Hessian are diagonal block matrices, we show that

$$\begin{aligned} H_{\mathcal{S}\mathcal{S}}^* = \text {diag}( _k H_{\mathcal{S}\mathcal{S}}^*)_{k=1}^K, \end{aligned}$$

where the sub-matrix \(_k H_{\mathcal{S}\mathcal{S}}^* \in {\mathbb {R}}^{s\times s}\) represents the kth block in \(H_{\mathcal{S}\mathcal{S}}^*\). The difference between sub-matrices is denoted as \(\Delta _k H_{\mathcal{S}\mathcal{S}}^* = [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} - _kH(\theta ^*)_{\mathcal{S}\mathcal{S}}]\).

We need to obtain the concentration result of the inverse matrix difference \([{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1}\). Based on Lemma C.6, we show that the diagonal block matrix \({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } = \sup _k {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [_k{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }\), so that

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [_k{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&= {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [ _k H^*_\mathcal{S}\mathcal{S}]^{-1} \Delta _k H^*_\mathcal{S}\mathcal{S} [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \\&\overset{(i)}{\le }\ \sqrt{s} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [ _k H^*_\mathcal{S}\mathcal{S}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{2} \\&\le \frac{\sqrt{s}}{\kappa _{-}} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{2}. \end{aligned}$$

In the step (i),  we apply the inequality between matrix norms and the Cauchy–Schwarz inequality. We have

$$\begin{aligned} P({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [ {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [ {H}^*_{\mathcal{S}\mathcal{S}}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \ge \varepsilon )&\le K \sup _k P( \frac{\sqrt{s}}{\kappa _{-}} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| [_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} ]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{2} \ge \varepsilon ) \\&\overset{(i)}{\le }\ K \sup _k P( \{ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \ge \frac{\varepsilon \kappa _{-}^2}{\sqrt{s}} \} \cup \{ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 > \varepsilon \} \\&\le 2 K\exp \big \{-\frac{\alpha \kappa _{-}^4\varepsilon ^2}{{\mathcal {M}}_*^2 s^3} n + 2 \log (s) \big \} . \end{aligned}$$

The step (i) can be obtained based on the derivation C1 and C3. This probability is exponentially small as \(n > c s^3 \log (p_n)\) with some constant c.

We combine all the concentration results and obtain

$$\begin{aligned} {\mathcal {H}}^*_{{\mathcal {S}}^c{\mathcal {S}}}({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1}&= [{H}^*_{{\mathcal {S}}^c{\mathcal {S}}} + \Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}}] [[{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} + [{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [{H}^*_{\mathcal{S}\mathcal{S}}]^{-1} ] \\&= \underbrace{{H}^*_{{\mathcal {S}}^c{\mathcal {S}}} (H^*_{\mathcal{S}\mathcal{S}})^{-1}}_{{\mathcal {I}}_1} + \underbrace{{H}^*_{{\mathcal {S}}^c{\mathcal {S}}} ([ {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [ {H}^*_{\mathcal{S}\mathcal{S}}]^{-1})}_{{\mathcal {I}}_2} \\&\quad + \underbrace{\Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}}({H}^*_{\mathcal{S}\mathcal{S}})^{-1}}_{{\mathcal {I}}_3} + \underbrace{\Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}} ([ {\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1} - [ {H}^*_{\mathcal{S}\mathcal{S}}]^{-1})}_{{\mathcal {I}}_4}. \end{aligned}$$

We have the component \(\sqrt{K}{\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_1 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \le (1 - \xi )\) based on Assumption 2.7. For the second component \({\mathcal {I}}_2\), we apply Lemma C.3 to obtain that with a probability at least \(1 - 2 K \exp \{-\frac{\alpha \kappa _{-}^2\varepsilon ^2}{{\mathcal {M}}_*^2\,s^3} n + 2 \log (s) \}\),

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_2 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {H}^*_{{\mathcal {S}}^c{\mathcal {S}}}(H^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H^*_\mathcal{S}\mathcal{S} ({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \\&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {H}^*_{{\mathcal {S}}^c{\mathcal {S}}}(H^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \sup _k \{ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H^*_\mathcal{S}\mathcal{S} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \} \\&< \frac{1 - \xi }{\sqrt{K}} \times \bigg \{ \frac{ \kappa _{-}\varepsilon }{\sqrt{s}} \bigg \} \times \bigg \{ \frac{\sqrt{s}}{\kappa _{-}} \bigg \}= \frac{1 - \xi }{\sqrt{K}} \times \varepsilon ^{'}. \end{aligned}$$

Based on Lemmas C.3 and C.4, the concentration result of the component \({\mathcal {I}}_3\) and \({\mathcal {I}}_4\) can be obtained with a probability at least \(1 - 2K \exp \big \{- \frac{\alpha \kappa _{-}^4{\varepsilon ^{'}}^2}{{\mathcal {M}}_*^2\,s^3} n + 2 \log (s) \big \} - 2K \exp \big \{- \frac{\alpha \kappa _{-}^2 \varepsilon ^2}{{\mathcal {M}}_*^2\,s^3} n + \log (s) + \log (p_n -s) \big \}\),

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_3 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \sup _k \{ \sqrt{s}{\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| (_k{H}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \} \le \bigg \{\frac{\varepsilon \kappa _{-}}{\sqrt{s}} \bigg \} \bigg \{ \frac{\sqrt{s}}{\kappa _{-}}\bigg \} = \varepsilon \\ {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {I}}_4 \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty }&\le {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H^*_{{\mathcal {S}}^c{\mathcal {S}}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta [ H^*_\mathcal{S}\mathcal{S}]^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \le \varepsilon \times \varepsilon ^{'} . \end{aligned}$$

We set \(\varepsilon \le \xi /(4\sqrt{K})\) and \(\varepsilon ^{'} \le \xi\) that leads to

$$\begin{aligned} \sqrt{K} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| {\mathcal {H}}^*_{{\mathcal {S}}^c{\mathcal {S}}} ({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}})^{-1} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty&< 1 - \xi + \sqrt{K}\varepsilon \times \varepsilon ^{'} + (1 - \xi )\varepsilon + \sqrt{K}\varepsilon \\&< (1 -\xi ) + \frac{\xi ^2}{4} + \frac{\xi - \xi ^2}{4} + \frac{\xi }{4} < 1- \frac{1}{2}\xi \end{aligned}$$

with a probability \(1 - 4 K\exp \big \{- C_0 \xi ^2 n/s^3 + 2 \log (p_n) \big \}\) for a universal constant \(C_0 > 0.\) \(\square\)

1.4 B.4: Proof of Lemma 2.4

Proof

The first-order partial derivative of the objective function can be expanded by applying the mean value theorem,

$$\begin{aligned} {{\textbf {0}}} = \nabla Q({\hat{\theta }}) = \nabla {\mathcal {L}}(\theta ^*) + \nabla ^2 {\mathcal {L}}({\tilde{\theta }})({\hat{\theta }} - \theta ^*) + n \lambda _n {\hat{z}}, \end{aligned}$$

where \({\tilde{\theta }} = \alpha \theta ^* + (1-\alpha ){\hat{\theta }}\) for some \(\alpha \in (0,1).\) This entails

$$\begin{aligned} \nabla Q({\hat{\theta }})^T({\hat{\theta }} - \theta ^*) = (\nabla {\mathcal {L}}(\theta ^*) + n \lambda _n {\hat{z}})^T ({\hat{\theta }} - \theta ^*) + ({\hat{\theta }} - \theta ^*)^T \nabla ^2 {\mathcal {L}}({\tilde{\theta }})({\hat{\theta }} - \theta ^*) . \end{aligned}$$
(B1)

Based on Lemma C.2, we can show that with a probability tending to 1,

$$\begin{aligned} ({\hat{\theta }} - \theta ^*)^T \frac{1}{n} \nabla ^2 {\mathcal {L}}({\tilde{\theta }} ) ({\hat{\theta }} - \theta ^*) \ge 0. \end{aligned}$$

Thus, we can construct the inequality from 6.7 as follows,

$$\begin{aligned} \underbrace{\nabla Q({\hat{\theta }})^T({\hat{\theta }} - \theta ^*)}_{{\mathcal {I}}_1} - \underbrace{\nabla {\mathcal {L}}(\theta ^*)({\hat{\theta }} - \theta ^*)}_{{\mathcal {I}}_2} - n\lambda _n \underbrace{{\hat{z}}^T ({\hat{\theta }} - \theta ^*) } _{{\mathcal {I}}_3}\ge 0. \end{aligned}$$
(B2)

For the exact solution \({\hat{\theta }}\), all elements in \(\nabla Q({\hat{\theta }})\) are zero so that the component \({\mathcal {I}}_1 =0\). The elements in the vector \(({\hat{\theta }} - \theta ^*) \in {\mathcal {R}}^{Kp_n}\) can be decomposed into two subsets \({{\mathcal {E}}}\) and \({{\mathcal {E}}^c}\). By applying Hölder’s inequality in Lemma C.5, the components \({{\mathcal {I}}_2}\) from the equation 6.8 can be bounded above as follows

$$\begin{aligned} {{\mathcal {I}}_2} : - \nabla {\mathcal {L}}(\theta ^*)^T ({\hat{\theta }} - \theta ^*) \le&\Vert \nabla {\mathcal {L}}(\theta ^*) \Vert _{2,\infty } \Vert {\hat{\theta }} - \theta ^* \Vert _{2,1} \nonumber \\ =&\Vert \nabla {\mathcal {L}}(\theta ^*) \Vert _{2,\infty } (\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} + \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1}). \end{aligned}$$
(B3)

By the definition, if \({\hat{\theta }}^{(p)} \ne {{\textbf {0}}}\), \({\hat{z}}^{(p)}= {\hat{\theta }}^{(p)} / \Vert {\hat{\theta }}^{(p)}\Vert _2\), and \(\Vert {\hat{z}}^{(p)}\Vert _2 = 1\). If \({\hat{\theta }}^{(p)} = {{\textbf {0}}}\), \(\Vert {\hat{z}}^{(p)} \Vert _2 < 1\). Since \({\mathcal {S}} \cap {\mathcal {E}}^c = \emptyset\), we have \(\theta _{ {\mathcal {E}}^c }^* = {{\textbf {0}}}\). First, we decompose the term \({\mathcal {I}}_3\) into two subsets. In the subset \({\mathcal {E}}\),

$$\begin{aligned} -{\hat{z}}^T_{{\mathcal {E}}} ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \le \Vert {\hat{z}}^T_{{\mathcal {E}}} \Vert _{2,\infty } \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1}. \end{aligned}$$

In the compliment set \({\mathcal {E}}^c\),

$$\begin{aligned} {\hat{z}}^T_{{\mathcal {E}}^c} ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} = {\hat{z}}^T_{{\mathcal {E}}^c} {\hat{\theta }}_{{\mathcal {E}}^c} \overset{(i)}{=}\ {}&\sum _{\begin{array}{c} {\hat{\theta }}^{(p)}\ne {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array} } \frac{ \Vert {\hat{\theta }}^{(p)} \Vert ^2_2}{ \Vert {\hat{\theta }}^{(p)}\Vert _2} + \sum _{\begin{array}{c} {\hat{\theta }}^{(p)}= {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array}}( {\hat{z}}^{(p)})^T {\hat{\theta }}^{(p)} \\ \overset{(ii)}{=}\&\sum _{\begin{array}{c} {\hat{\theta }}^{(p)}\ne {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array}} \Vert {\hat{\theta }}^{(p)} \Vert _{2} + \sum _{\begin{array}{c} {\hat{\theta }}^{(p)}= {{\textbf {0}}};\\ p \subseteq {\mathcal {E}}^c \end{array}} 0 = \Vert ({\hat{\theta }}- \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} . \end{aligned}$$

In the step (i), We divide the estimator \({\hat{\theta }}_{{\mathcal {E}}^c}\) into nonzero and zero subsets. Therefore, the formulation in (ii) is identical with the definition of mixed \(\ell _{2,1}\) norm for \({\hat{\theta }}_{{\mathcal {E}}^c}\).

From the above derivations, the inequality 6.8 can be expanded as

$$\begin{aligned} (\lambda _n + \Vert \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)\Vert _{2,\infty } ) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \ge (\lambda _n - \Vert \frac{1}{n}\nabla {\mathcal {L}}(\theta ^*)\Vert _{2,\infty }) \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} . \end{aligned}$$

Because \(\Vert n^{-1} \nabla {\mathcal {L}}({\theta }^*)\Vert _{2,\infty } \le \lambda _n/2\) with high probability, we have

$$\begin{aligned} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le \frac{\lambda _n + \Vert n^{-1} \nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty } }{\lambda _n - \Vert n^{-1} \nabla {\mathcal {L}}({\theta }^*) \Vert _{2,\infty } } \Vert ( {\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1}. \end{aligned}$$

In addition, if we plug in the maximum value \(\lambda _n/2\) of \(\Vert n^{-1} \nabla {\mathcal {L}}(\theta ^*)\Vert _\infty\), we obtain

$$\begin{aligned} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le 3\Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1}. \end{aligned}$$

Based on the relation between the \(\ell _1\) norm and \(\ell _{2,1}\) norm, we can show that

$$\begin{aligned} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le \sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}^c} \Vert _{2,1} \le 3\sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{2,1} \le 3\sqrt{K} \Vert ({\hat{\theta }} - \theta ^*)_{{\mathcal {E}}} \Vert _{1}. \end{aligned}$$

\(\square\)

Appendix C: Technical lemma

Lemma C.1

Based on Assumptions 2.32.5, the individual score function satisfies the sub-exponential condition such that for some universal constant \({\mathcal {M}}_*\),

$$\begin{aligned} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} \le {\mathcal {M}}_* . \end{aligned}$$

for any \(k = 1, 2, \cdots , K\) and \(p = 1, 2, \cdots , p_n\).

Proof

For each task, the quasi log-likelihood score function is given by

$$\begin{aligned} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} = \sum _{i=1}^n \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} = \sum _{i=1}^n \underbrace{(y_{ki} - g_k^{-1}(\eta _{ki}^*) )}_{{\mathcal {I}}_1} \underbrace{\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}^*))} \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}}_{{\mathcal {I}}_2}\underbrace{\frac{ \partial \eta _{ki} }{\partial \theta _{kp}}}_{{\mathcal {I}}_3} , \end{aligned}$$

for \(k = 1, 2, \cdots , K\) and \(p = 1, 2, \cdots , p_n\).

From Assumptions 2.32.5, as the linear predictor \(\eta _{ki} < K_0\), the variance functions \(V(\eta _{ki})\) and the link functions \(g_k(\eta _{ki} )\) are well-defined and bounded. Thus, we show that the second component \({\mathcal {I}}_2\) is bounded by some constant. In addition, the derivatives of the linear predictor are \({\partial \eta _{ki}/\theta _{kp} } = x_{kpi},\) and \(\sup _{k,p,i}\{x_{kpi}\} \le L < \infty\). Thus, the component \({\mathcal {I}}_3\) is bounded by L.

Based on Assumption 2.5, \({\mathcal {I}}_1 = y_{ki} - g_k^{-1}(\eta _{ki}^*)\) is from a sub-exponential distribution with zero mean and \(\psi _1\) norm bounded above by \({\mathcal {M}}\). Let \({\mathcal {K}}_{ki} = {\mathcal {I}}_2\times {\mathcal {I}}_3.\) The individual score function is given by

$$\begin{aligned} \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} = { ( y_{ki} - g_k^{-1}(\eta _{ki}^*) )} {\mathcal {K}}_{ki}, \end{aligned}$$

where we have \({\mathcal {K}}_{ki}< {\mathcal {K}} < \infty\) for some universal constant \({\mathcal {K}}\) for all tasks. We obtain that the \(\psi _1\) norm of individual score function are as follows

$$\begin{aligned} \Vert \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \Vert _{\psi _1}&= \sup _{m \ge 1} \frac{1}{m}\left( E \vert \frac{\partial \ell _{ki}(\theta ^*_k; y_{ki}) }{\partial \theta _{kp}} \vert ^m \right) ^{1/m} \\&\le \sup _{m \ge 1} {\mathcal {K}}_{ki} \frac{1}{m}\left( E \vert g_k^{-1}(\eta _{ki}^*) - y_{ki} \vert ^m \right) ^{1/m} \\&\le \sup _{p} {\mathcal {K}} \Vert g_k^{-1}(\eta _{ki}) - y_{ki} \Vert _{\psi _1} \le \mathcal{K}\mathcal{M}. \end{aligned}$$

Based on the property of sub-exponential distribution (Wainwright, 2019),

the \(\psi _1\) norm of \(n^{-1/2} {\partial \ell _{k}(\theta ^*_k; Y_k) }/{\partial \theta _{kp}}\) can be bounded by some constant \({\mathcal {M}}_*\ge \mathcal{K}\mathcal{M}\), such that

$$\begin{aligned} \Vert \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \Vert _{\psi _1} = \sup _{m\ge 1} \frac{1}{m} \left( E \left[ \left( \frac{1}{\sqrt{n}} \frac{\partial \ell _{k}(\theta ^*_k; Y_k) }{\partial \theta _{kp}} \right) ^{m} \right] \right) ^{1/m} \le {\mathcal {M}}_* . \end{aligned}$$

\(\square\)

Lemma C.2

Based on Assumption 2.3 and 2.5, let \(w_k = 1\), and there exists some \({\tilde{r}}\), for any \(\theta \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)\), the observed Hessian can be formulated as follows,

$$\begin{aligned} \frac{1}{n} \nabla ^2 {\mathcal {L}} (\theta ) = \frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \{ f _1(\eta _{ki}) - (y_{ki} - g^{-1}_k(\eta _{ki}^*)) f _2(\eta _{ki}) \} x_{ki}x_{ki}^T . \end{aligned}$$

with \(\eta _{ki} = \sum _{p=1}^{p_n} x_{kpi}\theta _{kp}\) and \(\eta _{ki}^* = \sum _{p=1}^{p_n} x_{kpi}\theta _{kp}^*\), and the functions of linear predictors \(f _1(\eta _{ki})\) and \(f _2(\eta _{ki})\) are both bounded. Furthermore, the function \(f _1(\eta _{ki}) > 0\) for \(\theta \in {\mathbb {B}}_{{\tilde{r}}}(\theta ^*)\).

Proof

First, the observed Hessian can be constructed as follow

$$\begin{aligned} \frac{1}{n}\nabla ^2 {\mathcal {L}}(\theta ) =&\frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg \{ {\bigg (\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\bigg )^2} \\&- (g_k^{-1}(\eta _{ki}^*)-g_k^{-1}(\eta _{ki}))\times \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) \bigg \} x_{ki}x_{ki}^T \\&- \frac{1}{n} \sum _{k=1}^K \sum _{i=1}^n \frac{ y_{ki} - g_k^{-1}(\eta _{ki}^*)}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) x_{ki}x_{ki}^T . \end{aligned}$$

Therefore, we can set

$$\begin{aligned} f _1(\eta _{ki}) =&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg \{ {\bigg (\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\bigg )^2} \\&- (g_k^{-1}(\eta _{ki}^*)-g_k^{-1}(\eta _{ki}))\times \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) \bigg \} \end{aligned}$$

and by applying the approximation,

$$\begin{aligned} g_k^{-1}(\eta _{ki}^*)-g_k^{-1}(\eta _{ki}) = \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}( \eta _{ki}^*- \eta _{ki}) . \end{aligned}$$

We can further show that

$$\begin{aligned} f _1(\eta _{ki}) =&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \bigg \{ {\bigg (\frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}\bigg )^2} \\&- \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}}( \eta _{ki}^*- \eta _{ki})\times \bigg ( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \bigg ( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \bigg )^2 \bigg ) \bigg \} . \end{aligned}$$

Based on Assumption 2.3, we can set that there exists some positive constants \(K_1\), \(K_2\), \(K_3\), \(K_4\), \(K_5\), and \(K_6\),

$$\begin{aligned} K_1 \le \max _{k,i} \vert \frac{\partial g_k^{-1}(\eta )}{ \partial \eta } \vert _{\eta = \eta _{ki}} \vert \le K_2, \text { and } \max _{k,i} \vert \frac{\partial ^2 g_k^{-1}(\eta )}{ \partial \eta ^2 } \vert _{\eta = \eta _{ki}} \vert \le K_3. \end{aligned}$$

Since the variance function has a polynomial form of mean, then

$$\begin{aligned} K4 \le V_k( g_k^{-1}(\eta _{ki} )) \le K5 \text { and } V_k^\prime ( g_k^{-1}(\eta _{ki} )) \le K6. \end{aligned}$$

Therefore, the function \(f _1(\eta _{ki})\) is bounded by

$$\begin{aligned} f _1(\eta _{ki}) \ge&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \left( K_1^2 - K_2( K_3 + K_2^2 K_6/K_4 ) \vert \eta _{ki}^*- \eta _{ki} \vert \right) \\ \ge&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \left( K_1^2 - K_2( K_3 + K_2^2 K_6/K_4 ) L \Vert \theta - \theta ^*\Vert _1 \right) \end{aligned}$$

Therefore, as \(\Vert \theta - \theta ^*\Vert _1 \le r\) and \(\Vert x_{ki}\Vert _\infty \le L\), we can set \({\tilde{r}} = \min \{r, K^\prime K_3^2 \}\) with constant \(K^\prime = 1/(L K_2( K_3 + K_2^2 K_6/K_4 ))\), and we can show that \(0< f _1(\eta _{ki}) < \infty\). In addition, we can also set

$$\begin{aligned} f _2(\eta _{ki}) =&\frac{1}{\phi _k V(g_k^{-1}(\eta _{ki}))} \left( \frac{\partial ^2 g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}^2} - \frac{V^{'}(g_k^{-1}(\eta _{ki}))}{V(g_k^{-1}(\eta _{ki}))} \left( \frac{\partial g_k^{-1}(\eta _{ki})}{ \partial \eta _{ki}} \right) ^2 \right) , \end{aligned}$$

which is a bounded function based on Assumption 2.3. \(\square\)

Lemma C.3

Under Assumptions 2.32.5, for some positive constants \(\alpha\) and \(\varepsilon\),

$$\begin{aligned} P\bigg ({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta )_{\mathcal{S}\mathcal{S}} - H(\theta ^*)_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \le \varepsilon \bigg ) \ge&1 - 2K \exp \bigg \{- \alpha \frac{\varepsilon ^2}{({\mathcal {M}}_*s)^2} n + 2 \log (s) \bigg \}, \ \\ P\bigg ({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta )_{\mathcal{S}\mathcal{S}^c} - H(\theta ^*)_{\mathcal{S}\mathcal{S}^c} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_{\infty } \le \varepsilon \bigg ) \ge&1 - 2 K\exp \bigg \{- \alpha \frac{\varepsilon ^2}{({\mathcal {M}}_*s)^2} n + \log (s(p_n - s )) \bigg \} . \end{aligned}$$

Proof

With the same notation as in the proof of Lemma 2.3, we can show that \(\Delta H_{\mathcal{S}\mathcal{S}}^* = \text {diag}( \Delta _k H_{\mathcal{S}\mathcal{S}}^*)_{k=1}^K\). For any \(\varepsilon > 0\),

$$\begin{aligned} P\left({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty> \varepsilon \right)&\overset{(i)}{=}\ P\left(\sup _k {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty> \varepsilon ) \le Ks^2 \sup _{k, p, p^{'} } P( \vert \Delta _k H_{[p,p^{'}]}^*\vert > \frac{\varepsilon }{s}\right) \\&\overset{(ii)}{\le } 2K \exp \left\{ -\alpha \min \{\frac{\varepsilon ^2}{({\mathcal {M}}_* s)^2}, \frac{\varepsilon }{{\mathcal {M}}_* s} \}n + 2 \log (s) \right\} . \end{aligned}$$

In the step (i),  we apply the result in Lemma C.6. In the step (ii),  we apply the concentration result of the Hessian matrix based on Lemma 2.1. Using the same method, we derive that

$$\begin{aligned} P\left({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta H_{\mathcal{S}\mathcal{S}^c}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty> \varepsilon \right )&\le K \sup _k P\left({\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}^c}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty > \varepsilon \right) \\&\le 2K \exp \{-\alpha \min \left\{\frac{\varepsilon ^2}{({\mathcal {M}}_* s)^2}, \frac{\varepsilon }{{\mathcal {M}}_* s} \}n + \log (p_n - s) + \log (s) \right\}. \end{aligned}$$

\(\square\)

Lemma C.4

Under Assumptions 2.32.6, there exist some positive constants \(\alpha\) and \(\varepsilon\) with \(\varepsilon < \kappa _{-}\),

$$\begin{aligned} P\bigg ( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \frac{1}{n} \nabla ^2 {\mathcal {L}}(\theta ^*)_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2 \ge \kappa - \varepsilon \bigg ) \le 1 - 2 K \exp \left\{ - \frac{\alpha \varepsilon ^2}{({\mathcal {M}}_*s)^2}n + 2 \log (s) \right\} . \end{aligned}$$

Proof

With the same notation as in the proofs of Lemmas 2.3 and C.3, we have the sub-matrix of Hessian denoted by \(_kH^*_\mathcal{S}\mathcal{S}\). Lemma 2.2 shows that with high probability, the eigenvalues of \(H(\theta ^*)\) are bounded and positive. Therefore, for any sub-matrix of Hessian, we have

$$\begin{aligned} \kappa _{-} \le \min \Lambda (_kH^*_\mathcal{S}\mathcal{S}). \end{aligned}$$
(C1)

Based on Courant–Fischer variational representation (Ravikumar et al., 2010), we have

$$\begin{aligned} \min \Lambda (_kH^*_\mathcal{S}\mathcal{S})&= \min \Lambda (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}} + {_k H}^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}}) \\&= \min _{\Vert x\Vert _2 =1} x^T({_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} + {_k H}^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}}) x \\&\le y^T {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} y + y^T(_kH^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}})y, \end{aligned}$$

where y is the unit-norm eigenvector of \({\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}\). Using condition C1, we can show that

$$\begin{aligned} y^T {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} y \ge \min \Lambda ({_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} ) \ge&\min \Lambda ({_kH}^*_\mathcal{S}\mathcal{S}) - y^T(_kH^*_{\mathcal{S}\mathcal{S}} - {_k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}})y\\ \ge&\kappa _{-} - {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| _kH^*_{\mathcal{S}\mathcal{S}} - { _k{\mathcal {H}}}^*_{\mathcal{S}\mathcal{S}} \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2. \end{aligned}$$

Next, we have

$$\begin{aligned} P( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_2> \varepsilon )&\le P( {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| \Delta _k H_{\mathcal{S}\mathcal{S}}^* \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_F> \varepsilon ) \le s^2 \sup _{k, p, p^{'}} P ( \vert \Delta _k H_{[pp^{'}]}^*\vert > {\varepsilon }/{s} ) \\&\le 2 \exp \left\{ -\alpha \min \left\{\frac{\varepsilon ^2}{({\mathcal {M}}_* s)^2}, \frac{\varepsilon }{{\mathcal {M}}_* s} \right\}n + 2 \log (s) \right\} . \end{aligned}$$

As a result, we can show that with a probability at least \(1 - 2 K \exp \left\{ - \frac{\alpha \varepsilon ^2}{({\mathcal {M}}_*s)^2}n + 2 \log (s) \right\}\),

$$\begin{aligned} \Lambda (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}) \ge \kappa _{-} - \varepsilon . \end{aligned}$$
(C2)

Furthermore, for \(\varepsilon < \kappa _{-}\) in C2, we set the constant \(\delta = \kappa _{-} - \varepsilon > 0\), such that

$$\begin{aligned} P( \Lambda (_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}) \le \delta ) = P( \Lambda ([_k{\mathcal {H}}^*_{\mathcal{S}\mathcal{S}}]^{-1}) \ge \delta ^{-1} ) . \end{aligned}$$
(C3)

\(\square\)

Lemma C.5

Consider vectors u and \(v \in R^{Kp_n}\) double-indexed as \(u = (u_{11}, \cdots , u_{kp}, \cdots , u_{Kp_n})\) and \(v = (v_{11}, \cdots , v_{kp}, \cdots , v_{Kp_n})\) for \(k = 1,2, \cdots , K\) and \(p = 1,2, \cdots , p_n\). Then

$$\begin{aligned} uv \le \Vert u\Vert _{2,1}\Vert v\Vert _{2,\infty }. \end{aligned}$$

Proof

We apply Hölder’s inequality to show that

$$\begin{aligned} uv \le&\sum _{p=1}^{p_n}\sum _{k=1}^K u_{kp} v_{kp} \le \sum _{p=1}^{p_n} \Vert u^{(p)}\Vert _2\Vert v^{(p)}\Vert _2 \le \Vert u\Vert _{2,1}\Vert v\Vert _{2,\infty }. \end{aligned}$$

\(\square\)

Lemma C.6

Suppose a matrix \(A \in {\mathbb {R}}^{Kd\times Kd}\) consists of diagonal blocks such that \(A = \text {diag}(A_k)_{k=1}^K\), and each block matrix has the same dimension that \(A_k \in {\mathbb {R}}^{d\times d}\). Then,

$$\begin{aligned} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_1 \le \sup _{k} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A_k \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_1 \text { and } {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \le \sup _{k} {\left| \hspace{-1.0625pt}\left| \hspace{-1.0625pt}\left| A_k \right| \hspace{-1.0625pt}\right| \hspace{-1.0625pt}\right| }_\infty \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, Y., Xu, W. & Gao, X. Heterogeneous multi-task feature learning with mixed \(\ell _{2,1}\) regularization. Mach Learn 113, 891–932 (2024). https://doi.org/10.1007/s10994-023-06410-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06410-0

Keywords

Navigation