Skip to main content
Log in

Reduced-rank multi-label classification

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Multi-label classification is a natural generalization of the classical binary classification for classifying multiple class labels. It differs from multi-class classification in that the multiple class labels are not exclusive. The key challenge is to improve the classification accuracy by incorporating the intrinsic dependency structure among the multiple class labels. In this article we propose to model the dependency structure via a reduced-rank multi-label classification model, and to enforce a group lasso regularization for sparse estimation. An alternative optimization scheme is developed to facilitate the computation, where a constrained manifold optimization technique and a gradient descent algorithm are alternated to maximize the resultant regularized log-likelihood. Various simulated examples and two real applications are conducted to demonstrate the effectiveness of the proposed method. More importantly, its asymptotic behavior is quantified in terms of the estimation and variable selection consistencies, as well as the model selection consistency via the Bayesian information criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Barker, M., Rayens, W.: Partial least squares for discrimination. J. Chemom. 17, 166–173 (2003)

    Article  Google Scholar 

  • Barutcuoglu, Z., Schapire, R., Troyanskaya, O.: Hierarchical multi-label prediction of gene function. Bioinformatics 22, 830–836 (2006)

    Article  Google Scholar 

  • Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognit. 37, 1757–1771 (2004)

    Article  Google Scholar 

  • Breiman, B., Friedman, J.H.: Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. 59, 354 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  • Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X. Z., Raich, R., Hadley, S. J. K., Hadley, A. S., Betts, M. G.: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment. IEEE International Workshop on Machine Learning for Signal Processing (2012)

  • Chen, L.S., Huang, J.H.: Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Am. Stat. Assoc. 107, 1533–1545 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Clare, A., King, R.: Knowledge discovery in multi-label phenotype data. 5th European Conference on Principles of Data Mining and Knowledge Discovery. Lecture Notes in Artificial Intelligence, 2168, pp. 42–53, (2001)

  • Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 148–155, (1998)

  • Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–353 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  • Elisseeff, A., Weston, J.: A kernel method for multi-labeled classification. Adv. Neural Inf. Process. Syst. 14, 681–687 (2002)

    Google Scholar 

  • Fan, R.E., Lin, C.J.: A study on threshold selection for multi-label classification. Technical Report, National Taiwan University, (2007)

  • Izenman, A.J.: Reduced-rank regression for the multivariate linear model. J. Multivar. Anal. 5, 248–264 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  • Luaces, O., Dez, J., Barranquero, J., Jos del Coz, J., Bahamonde, A.: Binary relevance efficacy for multilabel classification. Prog. Artif. Intell. 4, 303–313 (2012)

    Article  Google Scholar 

  • Madjarov, G., Kocev, D., Gjorgjevikj, D., Dzeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recognit. 45, 3084–3104 (2012)

    Article  Google Scholar 

  • Nocedal, J., Yuan, Y.X.: Combining trust region and line search techniques. Adv. Nonlinear Program. 260, 153–175 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  • Peters, S., Jacob, Y., Denoyer, L., Gallinari, P.: Iterative multi-label multi-relational classification algorithm for complex social networks. Soc. Netw. Anal. Min. 2, 17–29 (2012)

    Article  Google Scholar 

  • Ravikumar, P., Wainwright, S., Raskutti, G., Yu, B.: High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electron. J. Stat. 5, 935–980 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Springer (2009)

  • Richtarik, P., Takac, M.: Parallel coordinate descent methods for big data optimization. (2012), arXiv:1212.0873

  • Rothman, A., Bickel, P., Levina, E., Zhu, J.: Sparse permutation invariant covariance estimation. Electron. J. Stat. 2, 494–515 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Shao, J.: Mathematical Statistics, 2nd edn. Springer, New York (2003)

    Book  MATH  Google Scholar 

  • Schwarz, G.E.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  • Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 3, 1–13 (2007)

    Article  Google Scholar 

  • Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification. IEEE Trans. Knowl. Data Eng. 23, 1079–1089 (2008)

    Article  Google Scholar 

  • Wang, H.L.: A note on adaptive group lasso. Comput. Stat. Data Anal. 52, 5277–5286 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, J., Wang, L.: Sparse supervised dimension reduction in high dimensional classification. Electron. J. Stat. 4, 914–931 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142, 397–434 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Yu, H. F., Jain, P., Kar P., Dhillon I. S.: Large-scale multi-label learning with missing labels. Proceedings of the 31st International Conference on Machine Learning (2014)

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 68, 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, M.L., Zhou, Z.H.: A lazy learning approach to multi-label learning. Pattern Recognit. 40, 2038–2048 (2007)

    Article  MATH  Google Scholar 

  • Zhou, Z.H., Zhang, M.L.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18, 1338–1351 (2006)

    Article  Google Scholar 

Download references

Acknowledgments

JW’s research is partly supported by HK GRF Grant 11302615, CityU SRG Grant 7004244 and CityU Startup Grant 7200380. The authors would like to thank the associate editor and two anonymous referees for their constructive suggestion and comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Yuan.

Appendix: Technical proofs

Appendix: Technical proofs

Proof of Theorem 1

Since the factorization of \(({\mathbf{B}}^*,{\mathbf{A}}^*)\) in (2) is not unique as \({\mathbf{B}}^* {\mathbf{A}}^*={\mathbf{B}}^* \varvec{\Lambda } \varvec{\Lambda }^T {\mathbf{A}}^*\) for any orthogonal matrix \(\varvec{\Lambda }\), we denote by \(\mathcal {T}_{{\mathbf{C}}^*}\) the collection of all such \(({\mathbf{B}}^*,{\mathbf{A}}^*)\)’s. In the sequel, \(({\mathbf{B}}^*, {\mathbf{A}}^*)\) refers to any given pair in \(\mathcal {T}_{{\mathbf{C}}^*}\). Let \(\mathbf{\Gamma } = \hbox {vec}(\mathbf{B},\mathbf{A},\mathbf{c}_0)\), \(\mathbf{\Gamma }^* = \hbox {vec}({\mathbf{B}}^*,{\mathbf{A}}^*,{\mathbf{c}}_0^*)\), \(T(\mathbf{\Gamma } ) = l(\mathbf{B},\mathbf{A},\mathbf{c}_0)\), \(T_p(\mathbf{\Gamma }) = l_p(\mathbf{B},\mathbf{A},\mathbf{c}_0)\). The Taylor expansion of \(T(\mathbf{\Gamma })\) at \(\mathbf{\Gamma }^*\) implies

$$\begin{aligned} T(\mathbf{\Gamma }) =&\, T(\mathbf{\Gamma }^*) + \frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*)\\&+ (\mathbf{\Gamma } - \mathbf{\Gamma }^*)^T \frac{1}{2} H(\tilde{\mathbf{\Gamma }}) (\mathbf \Gamma - \mathbf \Gamma ^*), \end{aligned}$$

where \(H({\tilde{\mathbf{\Gamma }}})\) is the Hessian matrix, \(\tilde{\mathbf{\Gamma }}\) is a matrix between \(\mathbf{\Gamma }\) and \(\mathbf{\Gamma }^*\).

The proof proceeds as follows. We first construct a neighborhood of \(\mathbf{\Gamma }^*\) as \(N_n(\gamma , \mathbf{\Gamma }^*) = B_n(\gamma , \mathbf{\Gamma }^*) \bigcap \mathcal {M}_{\mathbf{B}}\), where \(B_n(\gamma , \mathbf{\Gamma }^*) = \{\mathbf{\Gamma }: \Vert I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*) \Vert _2 \le \gamma /\sqrt{n} \}\), and \(\mathcal {M}_{\mathbf{B}} = \{ \mathbf{\Gamma }: {\mathbf{B}}^T {\mathbf{B}} = {\mathbf{I}}_{r^*}\}\). Note that \(B_n(\gamma , \mathbf{\Gamma }^*)\) is a connected and closed ellipsoid, and \(\mathcal {M}_{\mathbf{B}}\) is identical with the manifold \(\{\mathbf{B}, {\mathbf{B}}^T {\mathbf{B}} = {\mathbf{I}}_{r^*}\}\) times \(\mathcal {R}^{q r +q}\), therefore it implies \(N_n(\gamma , \mathbf{\Gamma }^*)\) is a closed and connected set. Then we show that \(T_p(\mathbf{\Gamma }^*)\) is smaller than \(T_p(\mathbf{\Gamma })\) for any \(\mathbf \Gamma \) on the boundary of \(N_n(\gamma , \mathbf{\Gamma }^*) \), which implies that there exists a local minimizer within \(N_n(\gamma , \mathbf{\Gamma }^*)\). Finally, the desired result follows from the fact that \(\mathbf{\Gamma }^* \in N_n(\gamma , \mathbf{\Gamma }^*) \) and thus that the distance between the local minimizer and \(\mathbf{\Gamma }^*\) is upper bounded by \(\gamma /\sqrt{n}\).

Let \(\bar{N}_n(\gamma , \mathbf{\Gamma }^*) \) be the boundary of \(N_n(\gamma , \mathbf{\Gamma }^*) \), then for any \(\mathbf{\Gamma } \in \bar{N}_n(\gamma , \mathbf{\Gamma }^*)\),

$$\begin{aligned} T_p(\mathbf{\Gamma }) - T_p(\mathbf{\Gamma }^*)&= T(\mathbf{\Gamma } ) - T(\mathbf{\Gamma }^*)\\&\quad + \sum _{k=1}^p \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \left( \Vert {\mathbf{B}}^k\Vert _2 - \Vert {\mathbf{B}}^{*k}\Vert _2 \right) . \end{aligned}$$

It follows from the fact \(\Vert {\mathbf{B}}^{*k}\Vert _2 =0\) for \(k>p_0\), and the Cauchy-Schwarz inequality that

$$\begin{aligned}&\sum _{k=1}^p \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \left( \Vert {\mathbf{B}}^k\Vert _2 - \Vert {\mathbf{B}}^{*k}\Vert _2 \right) \\&\quad \ge \sum _{k=1}^{p_0} \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \left( \Vert {\mathbf{B}}^k\Vert _2 -\Vert {\mathbf{B}}^{*k}\Vert _2 \right) \\&\quad \ge -\sum _{k=1}^{p_0} \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \Vert {\mathbf{B}}^k-{\mathbf{B}}^{*k}\Vert _2. \end{aligned}$$

Then we have

$$\begin{aligned}&T_p(\mathbf{\Gamma } ) - T_p(\mathbf{\Gamma }^*) \\&\quad \ge T(\mathbf{\Gamma } ) - T(\mathbf{\Gamma }^*) - \sum _{k=1}^{p_0} \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \Vert \mathbf{\Gamma } - \mathbf{\Gamma }^*\Vert _2 \\&\quad \ge \frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*) + \frac{1}{2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)^T H(\tilde{\mathbf{\Gamma }}) (\mathbf \Gamma - \mathbf \Gamma ^*) \\&\qquad -\frac{\lambda p_0 }{\sqrt{n} \min _{1\le k \le p_0} \Vert {\mathbf{B}}_0^k \Vert ^{1}_2 } \Vert I_1(\mathbf{\Gamma }^*)^{-1/2}\Vert _2 \Vert \\&\quad \quad \times \sqrt{n} I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)\Vert _2. \end{aligned}$$

Next we bound each term separately. The first term can be bounded as

$$\begin{aligned}&\frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*) \\&\quad = \left( \sqrt{n} I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*) \right) ^T \left( {n} I_1(\mathbf{\Gamma }^*) \right) ^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \\&\quad \ge -\gamma \left\| n^{-1/2} I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right\| _2. \end{aligned}$$

By the Markov’s inequality,

$$\begin{aligned}&P\left( \left\| n^{-1/2} I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right\| _2 \le \frac{\gamma }{2} \right) \\&\quad \ge 1 - \frac{4}{\gamma ^2} E \left\| n^{-1/2} I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right\| _2^2 \\&\quad = 1 - \frac{4}{\gamma ^2} E\left( \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T} n^{-1} I_1(\mathbf{\Gamma }^*)^{-1} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right) \\&\quad = 1 - \frac{4}{\gamma ^2} \dim (\mathbf{\Gamma ^*}) = 1-\frac{4(p+q)r^*+4q}{\gamma ^2}, \end{aligned}$$

where the last equality follows from the fact that \( I_1(\mathbf{\Gamma }^*)\) is the Fisher-information matrix, \(T(\mathbf{\Gamma })\) is the log-likelihood. Then it follows that \(P\left( \frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*) > -\frac{\gamma ^2}{2} \right) \ge 1-\frac{4(p+q)r^*+4q}{\gamma ^2}\).

Since \(\frac{1}{n} H(\tilde{\mathbf{\Gamma }}) \overset{P}{\rightarrow }I_1(\mathbf{\Gamma }^*)\) as \(n\rightarrow \infty \), the second can be bounded as

$$\begin{aligned}&\frac{1}{2}(\mathbf{\Gamma } - \mathbf{\Gamma }^*)^T H(\tilde{\mathbf{\Gamma }}) (\mathbf \Gamma - \mathbf \Gamma ^*) \\&=\frac{1}{2} \left( \sqrt{n}I_1(\mathbf{\Gamma ^*})^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)\right) ^T I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{1}{n} H(\tilde{\mathbf{\Gamma }}) \\&\qquad {\times } I_1(\mathbf{\Gamma }^*)^{-1/2} \left( \sqrt{n}I_1(\mathbf \Gamma ^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*) \right) \overset{P}{\rightarrow }\frac{\gamma ^2}{2}. \end{aligned}$$

The last term can be bounded as follows. Since \({\mathbf{B}}_0\) is the consistent estimate to some \({\mathbf{B}}^*\), \(\min _{1\le k\le p_0 }\Vert {\mathbf{B}}_0^k \Vert _2^{1} \ge c_3 \) for certain \(c_3 >0\). By Assumption C3 there exists \(c_4 > 0\) such that \(\Vert I_1(\mathbf{\Gamma }^*)^{-1/2} \Vert _2 \le c_4\). Along with \(\lambda /\sqrt{n} \rightarrow 0\) as \(n\rightarrow \infty \),

$$\begin{aligned}&\frac{\lambda p_0 }{\sqrt{n} \min _{1\le k \le p_0} \Vert {\mathbf{B}}_0^k \Vert _2 } \Vert I_1(\mathbf{\Gamma }^*)^{-1/2}\Vert _2 \Vert \sqrt{n} I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)\Vert _2\\&\quad \le c_4 p_0 \gamma \lambda ({\sqrt{n}} c_3)^{-1} \overset{P}{\rightarrow }0. \end{aligned}$$

Combining the above bounds, for any \(\eta >0\), we can select \(\gamma \) sufficiently large such that for any \(\mathbf{\Gamma } \in \bar{N}_n( \gamma , \mathbf{\Gamma ^*})\), \(P\left( T_p(\mathbf{\Gamma }) - T_p(\mathbf{\Gamma }^*) >0 \right) > 1-\eta \), therefore there exists at least one local minimizer \(\widehat{\mathbf{\Gamma }}\) of \(T_p(\cdot )\) inside \({N}_n( \gamma , \mathbf{\Gamma ^*})\), and it follows \(\Vert \widehat{\mathbf{\Gamma }} - \mathbf{\Gamma }^* \Vert _2 \le O(\gamma /\sqrt{n})\), \(\Vert \widehat{\mathbf{c}}_0 - {\mathbf{c}}_0^* \Vert \le O(\gamma /\sqrt{n}) \), \(\Vert \widehat{\mathbf{A}} - \mathbf{A}^* \Vert _F \le O(\gamma /\sqrt{n})\), as well as \(\Vert \widehat{\mathbf{B}} - \mathbf{B}^* \Vert _F \le O(\gamma /\sqrt{n})\). It completes the proof of Theorem 1. \(\square \)

Proof of Theorem 2

First we note that the active set induced by \(\widehat{\mathbf{C}}\) is the same as that induced by \(\widehat{\mathbf{B}}\) in the sense that \(\Vert \widehat{\mathbf{C}}^k\Vert = 0\) if and only if \(\Vert \widehat{\mathbf{B}}^k\Vert = 0\). We now prove this theorem by contradiction. Suppose that there exists some \(k>p_0\) such that \(\Vert \widehat{\mathbf{B}}^k \Vert _2 > 0\). Denote \({\mathbf{G}} = \frac{\partial l_p(\cdot )}{\partial \mathbf{B}}\), then the first order Karush-Kuhn-Tucker condition on \(\widehat{\mathbf{B}}\in \mathcal {M}_{r^*}^p\) yields \( \widehat{\mathbf{G}} \widehat{\mathbf{B}}^T =\widehat{\mathbf{B}} \widehat{\mathbf{G}}^T\) (Wen and Yin 2013), leading to \(\widehat{\mathbf{G}} = \widehat{\mathbf{B}}\widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\) given that \(\widehat{\mathbf{B}}^T \widehat{\mathbf{B}} = {\mathbf{I}}_{r^*}\). That is, for any k, \(\widehat{\mathbf{G}}^k = \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\), where \(\widehat{\mathbf{G}}^k\) and \(\widehat{\mathbf{B}}^k\) are the k-th rows of \(\widehat{\mathbf{G}}\) and \(\widehat{\mathbf{B}}\), respectively. We will then show that \(\Vert \widehat{\mathbf{G}}^k\Vert _2\) and \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2\) are of different magnitudes, leading to contradiction.

One one hand, we have

$$\begin{aligned} \widehat{\mathbf{G}}^k = \frac{ \partial l_p(\widehat{\mathbf{B}},\widehat{\mathbf{A}}, \widehat{\mathbf{c}}_0 )}{\partial {\mathbf{B}}^k}&= \frac{ \partial l (\widehat{\mathbf{B}},\widehat{\mathbf{A}}, \widehat{\mathbf{c}}_0)}{\partial {\mathbf{B}}^k } + \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^k} \\&= \frac{ \partial l ({\mathbf{B}}^*,\widehat{\mathbf{A}},\widehat{\mathbf{c}}_0)}{\partial {\mathbf{B}}^k } + \big ( \widehat{\mathbf{B}}^k - {\mathbf{B}}^{*k} \big ) H(\tilde{\mathbf{B}}^k) \\&\quad + \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^k}, \end{aligned}$$

where the k-th row \(\frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^k} = \Vert {\mathbf{B}}_0^k \Vert _2^{-1} \frac{{\widehat{\mathbf{B}}}^k}{\Vert {\widehat{\mathbf{B}}}^k \Vert _2}\), and \(\tilde{\mathbf{B}}^{k}\) is between \({\widehat{\mathbf{B}}}^k\) and \({\mathbf{B}}^{*k}\). By Theorem 1, \(\widehat{\mathbf{B}}\) and \(\widehat{\mathbf{A}}\) are the \(\sqrt{n}\)-consistent estimates of some \(\mathbf{B}^*\) and \(\mathbf{A}^*\) in \(\mathcal {T}_{{\mathbf{C}}^*}\), and \(\widehat{\mathbf{c}}_0\) is the \(\sqrt{n}\)-estimate of \({\mathbf{c}}_0^*\), then \(n^{-1} H(\tilde{\mathbf{B}}^{k}) = I_1({\mathbf{B}}^{*k}) + O_p(1/\sqrt{n})\), and \(n^{-1} \frac{ \partial l ({\mathbf{B}}^*,\widehat{\mathbf{A}},\widehat{\mathbf{c}}_0)}{\partial {\mathbf{B}}^k } - S_1({\mathbf{B}}^{*k}) = O_p(1/\sqrt{n})\), where the score function \(S_1({\mathbf{B}}^{*k}) = \frac{1}{n} \frac{ \partial }{\partial {\mathbf{B}}^k} E \left( l ({\mathbf{B}}^*,{\mathbf{A}}^*, {\mathbf{c}}^*_0 ) \right) =0\). Consequently, we have

$$\begin{aligned} \widehat{\mathbf{G}}^k \!= \!O_p(\sqrt{n}) \!+\! {\widehat{\mathbf{B}}}^k \left( n I_1({\mathbf{B}}^{*k}) \!+ \!O_p(\sqrt{n}) \right) \!+ \!\frac{ \lambda {\widehat{\mathbf{B}}}^k }{\Vert {\mathbf{B}}_0^k \Vert _2 \Vert {\widehat{\mathbf{B}}}^k\Vert _2}. \end{aligned}$$

Furthermore, as \(\Vert \mathbf{B}_0^k\Vert =O_p(n^{-1/2})\) and \(I_1({\mathbf{B}}^*)\) is positive definite, \(\Vert \widehat{\mathbf{G}}^k\Vert _2\) is of the same order as \(O_p(n) \Vert \widehat{\mathbf{B}}^k\Vert _2\).

On the other hand,

$$\begin{aligned} \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}&= \widehat{\mathbf{B}}^k \left( O_p(\sqrt{n}) \!+ \!\left( n I_1({\mathbf{B}}^{*}) \!+ \!O_p(\sqrt{n}) \right) ({\widehat{\mathbf{B}}} - {{\mathbf{B}}^*} )^T\right. \\&\qquad \qquad \left. + \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^T} \right) \widehat{\mathbf{B}}. \end{aligned}$$

Then \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2 \le \Vert \widehat{\mathbf{B}}^k\Vert _2 \big \Vert \big ( O_p(\sqrt{n}) + \big ( n I_1({\mathbf{B}}^{*})+ O_p(\sqrt{n}) \big ) ({\widehat{\mathbf{B}}} - {{\mathbf{B}}^*} )^T+ \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^T} \big ) \widehat{\mathbf{B}} \big \Vert _F\). By Theorem 1, we have \(\Vert \widehat{\mathbf{B}} - {\mathbf{B}}^* \Vert _F = O_p(1/\sqrt{n})\), and thus \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2 \le O_p(\lambda \sqrt{n}) \Vert \widehat{\mathbf{B}}^k\Vert _2\). Since \(\Vert \widehat{\mathbf{B}}^k\Vert _2 >0\) and \(\lambda /\sqrt{n} \rightarrow 0\), it can be concluded that \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2\) is of smaller magnitude than \(\Vert \widehat{\mathbf{G}}^k\Vert _2\), which contradicts with the fact that \(\widehat{\mathbf{G}}^k = \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\). This implies that \(\Vert \widehat{\mathbf{B}}^k \Vert _2 = 0\) for all \(k>p_0\) and completes the proof. \(\square \)

Proof of Lemma 1

First we have

$$\begin{aligned} \hbox {BIC}_{\lambda , r} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}&= \frac{l \left( \widehat{\mathbf{C}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} \right) }{n} - \frac{l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \nonumber \\&\qquad + \frac{\log n}{n}\left( \hbox {df}_{\widehat{\mathbf{C}}_{\lambda , r}} -\hbox {df}_{\widehat{\mathbf{C}}^*} \right) . \end{aligned}$$
(11)

Since \(\lambda = \lambda _n = \log n , r = r^*\) satisfies the conditions in Theorems 1 and 2, \((\widehat{\mathbf{c}}_0)_{\lambda _n, r^*}\) is the consistent estimate of \({\mathbf{c}}_0^*\), \(\widehat{\mathbf{C}}_{\lambda _n, r^*} = \widehat{\mathbf{B}}_{\lambda _n, r^*}\widehat{\mathbf{A}}_{\lambda _n, r^*}\) is the consistent estimate of \({\mathbf{C}}^*\), then \(P\left( \hbox {df}_{\widehat{\mathbf{C}}_{\lambda _n, r^*}} =\hbox {df}_{\widehat{\mathbf{C}}^* } \right) \rightarrow 1\), as well as \( \frac{l \left( \widehat{\mathbf{C}}_{\lambda _n, r^*}, (\widehat{\mathbf{c}}_0)_{\lambda _n, r^*} \right) }{n} - \frac{l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \overset{P}{\rightarrow }0 \), then it completes the proof. \(\square \)

Proof of Lemma 2

The proof proceeds by cases:

  1. (i)

    For \((\lambda ,r)\in \Omega _{+}\) such that \(\hbox {df}_{\widehat{\mathbf{C}}_{ {\lambda }, {r}}} > \hbox {df}_{{\mathbf{C}}^*}\), from (11) we have

    $$\begin{aligned}&\hbox {BIC}_{{\lambda , r}} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} \ge \frac{l \left( \widehat{\mathbf{C}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} \right) - l\left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \\&\quad +\frac{\log n}{n}\ge \frac{l({\mathbf{C}}_{m}, {\mathbf{c}}_{0m} )- l({\mathbf{C}}^*,{\mathbf{c}}^*_0)}{n} +\frac{\log n}{n}, \end{aligned}$$

    where \(\left( {\mathbf{C}}_m, {\mathbf{c}}_{0m}\right) \) denote the minimizer of \(l(\cdot )\). By the classical asymptotic theory, since p and q are fixed, as \(n\rightarrow \infty \), \(- 2l\left( {\mathbf{C}}_m, {\mathbf{c}}_{0m}\right) + 2l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) \overset{D}{\rightarrow }\chi ^2_{(p+1)q}\) and \(\log n \rightarrow \infty \), then it follows

    $$\begin{aligned} P \left( \inf _{(\lambda ,r)\in \Omega _{+}} ~ \hbox {BIC}_{{\lambda , r}} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} > 0 \right) \rightarrow 1. \end{aligned}$$
  2. (ii)

    For \((\lambda ,r)\in \Omega _{-}\), we denote \({\mathcal {C}}_{-}= \{ (\mathbf{C},\mathbf{c}_0) : \mathcal {A}_{{\mathbf{C}}} \nsupseteq \mathcal {A}_{{\mathbf{C}}^*}, \text {or} ~ \mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{{\mathbf{C}}^*} ~ \text {and} ~ r< r^*\}\), then \(({\widehat{\mathbf{C}}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} ) \in {\mathcal {C}}_{-}\). In (11) for any \((\mathbf{C},\mathbf{c}_0) \in {\mathcal {C}}_{-}\), since the degree of freedom terms are finite, as \(n \rightarrow \infty \),

    $$\begin{aligned}&\hbox {BIC}_{\mathbf{C},\mathbf{c}_0} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} \overset{P}{\rightarrow }E(l_1(\mathbf{C},{\mathbf{c}}_0,\mathbf{x},\mathbf{y})) \\&\quad - E(l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})) = \hbox {vec}(\mathbf{C}-{\mathbf{C}}^*,{\mathbf{c}}_0-{\mathbf{c}}_0^*)^T\\&\qquad \times \frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2}~ \hbox {vec}(\mathbf{C}-{\mathbf{C}}^*,{\mathbf{c}}_0-{\mathbf{c}}_0^*), \end{aligned}$$

    where \((\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0)\) is between \((\mathbf{C},\mathbf{c}_0)\) and \(({\mathbf{C}}^*,{\mathbf{c}}^*_0)\). Next we will show \(\inf _{(\mathbf{C},\mathbf{c}_0)\in \mathcal {C}_{-}} \Vert \mathbf{C}- \mathbf{C}^* \Vert _F >0 \) by considering the following two cases.

    1. (a)

      For those \((\mathbf{C},\mathbf{c}_0) \in \mathcal {C}_{-}\) with \(\mathcal {A}_{\mathbf{C}} \nsupseteq \mathcal {A}_{\mathbf{C}^*}\), then \(\Vert \mathbf{C}^* - \mathbf{C}\Vert _F \ge \Vert {\mathbf{C}}^{*k} - {\mathbf{C}}^{k}\Vert _2 = \Vert {\mathbf{C}}^{*k} \Vert _2 >0\), for k such that \(\mathbf{x}_{(k)} \in \mathcal {A}^c_{\mathbf{C}} \bigcap \mathcal {A}_{\mathbf{C}^*}\), then \(\inf _{\mathcal {A}_{\mathbf{C}} \nsupseteq \mathcal {A}_{\mathbf{C}^*}} \Vert \mathbf{C}- {\mathbf{C}}^* \Vert _F >0\).

    2. (b)

      For those \((\mathbf{C},\mathbf{c}_0) \in \mathcal {C}_{-}\) with \(\mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{\mathbf{C}^*}\), and \(\hbox {rank}({\mathbf{C}}) < r^*\), it immediately implies \(\Vert {\mathbf{C}}^* - {\mathbf{C}}\Vert _2 >0\), and then \(\Vert {\mathbf{C}} - {\mathbf{C}}^* \Vert _F \ge \Vert {\mathbf{C}} - {\mathbf{C}}^* \Vert _2 >0\), and \(\inf _{\mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{\mathbf{C}^*},\hbox {rank}({\mathbf{C}}) < r^* } \Vert \mathbf{C}- \mathbf{C}^* \Vert _F > 0\).

Combining both cases, we have \(\inf _{(\mathbf{C},\mathbf{c}_0)\in \mathcal {C}_{-}} \Vert \mathbf{C}- \mathbf{C}^* \Vert _F >0 \), which, together with the fact that \(\frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2} \) is positive definite, implies the desired results. \(\square \)

Proof of Theorem 3

We just need to show for any \(\epsilon > 0\), \(P \left( E(l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}}}, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})) - E(l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})) \le \epsilon \right) \rightarrow 1\). Then by the proof of Lemma 2,

$$\begin{aligned}&P\left( \Vert \widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}} - {\mathbf{C}}^* \Vert _F+ \Vert (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}} - {\mathbf{c}}_0^*\Vert _2 \le \phi _{\min }\right. \\&\quad \left. \left( \frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2} \right) ^{-1} \epsilon ^{1/2} \right) \rightarrow 1, \end{aligned}$$

where \((\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0)\) is between \((\mathbf{C},\mathbf{c}_0)\) and \(({\mathbf{C}}^*,{\mathbf{c}}^*_0)\). Since \(\epsilon \) is arbitrary, and \(\frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2}\) is positive definite with finite eigenvalues, then it completes the proof. \(\square \)

From Lemma 1, for any \(\epsilon >0\), as \(n\rightarrow \infty \), \(P\big (\hbox {BIC}_{\hat{\lambda },\hat{r}} \le \hbox {BIC}_{{\lambda _n},r^*} \le \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} + \epsilon /3 \big ) \rightarrow 1\). Furthermore,

$$\begin{aligned}&E\left( l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}}}, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})\right) - E\left( l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})\right) \\&= E\left( l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}} },(\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}},\mathbf{X},\mathbf{Y})\right) - \hbox {BIC}_{{\hat{\lambda },\hat{r}} } + \hbox {BIC}_{{\hat{\lambda },\hat{r}} }\\&\qquad -\hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}+\hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}-E\left( l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})\right) . \end{aligned}$$

When n is large enough, with probability tending to 1, \(E\left( l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}} }, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})\right) - \hbox {BIC}_{{\hat{\lambda },\hat{r}} } \le \epsilon /3\), \(\hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}-E\left( l_1({\mathbf{C}}^*,{\mathbf{c}}^*_0, \mathbf{X},\mathbf{Y})\right) \le \epsilon /3\), and it implies the results.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, T., Wang, J. Reduced-rank multi-label classification. Stat Comput 27, 181–191 (2017). https://doi.org/10.1007/s11222-015-9615-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9615-0

Keywords

Navigation