Abstract
Multi-label classification is a natural generalization of the classical binary classification for classifying multiple class labels. It differs from multi-class classification in that the multiple class labels are not exclusive. The key challenge is to improve the classification accuracy by incorporating the intrinsic dependency structure among the multiple class labels. In this article we propose to model the dependency structure via a reduced-rank multi-label classification model, and to enforce a group lasso regularization for sparse estimation. An alternative optimization scheme is developed to facilitate the computation, where a constrained manifold optimization technique and a gradient descent algorithm are alternated to maximize the resultant regularized log-likelihood. Various simulated examples and two real applications are conducted to demonstrate the effectiveness of the proposed method. More importantly, its asymptotic behavior is quantified in terms of the estimation and variable selection consistencies, as well as the model selection consistency via the Bayesian information criterion.

Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Barker, M., Rayens, W.: Partial least squares for discrimination. J. Chemom. 17, 166–173 (2003)
Barutcuoglu, Z., Schapire, R., Troyanskaya, O.: Hierarchical multi-label prediction of gene function. Bioinformatics 22, 830–836 (2006)
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognit. 37, 1757–1771 (2004)
Breiman, B., Friedman, J.H.: Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. 59, 354 (1997)
Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X. Z., Raich, R., Hadley, S. J. K., Hadley, A. S., Betts, M. G.: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment. IEEE International Workshop on Machine Learning for Signal Processing (2012)
Chen, L.S., Huang, J.H.: Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Am. Stat. Assoc. 107, 1533–1545 (2012)
Clare, A., King, R.: Knowledge discovery in multi-label phenotype data. 5th European Conference on Principles of Data Mining and Knowledge Discovery. Lecture Notes in Artificial Intelligence, 2168, pp. 42–53, (2001)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 148–155, (1998)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–353 (1998)
Elisseeff, A., Weston, J.: A kernel method for multi-labeled classification. Adv. Neural Inf. Process. Syst. 14, 681–687 (2002)
Fan, R.E., Lin, C.J.: A study on threshold selection for multi-label classification. Technical Report, National Taiwan University, (2007)
Izenman, A.J.: Reduced-rank regression for the multivariate linear model. J. Multivar. Anal. 5, 248–264 (1975)
Luaces, O., Dez, J., Barranquero, J., Jos del Coz, J., Bahamonde, A.: Binary relevance efficacy for multilabel classification. Prog. Artif. Intell. 4, 303–313 (2012)
Madjarov, G., Kocev, D., Gjorgjevikj, D., Dzeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recognit. 45, 3084–3104 (2012)
Nocedal, J., Yuan, Y.X.: Combining trust region and line search techniques. Adv. Nonlinear Program. 260, 153–175 (1998)
Peters, S., Jacob, Y., Denoyer, L., Gallinari, P.: Iterative multi-label multi-relational classification algorithm for complex social networks. Soc. Netw. Anal. Min. 2, 17–29 (2012)
Ravikumar, P., Wainwright, S., Raskutti, G., Yu, B.: High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electron. J. Stat. 5, 935–980 (2011)
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Springer (2009)
Richtarik, P., Takac, M.: Parallel coordinate descent methods for big data optimization. (2012), arXiv:1212.0873
Rothman, A., Bickel, P., Levina, E., Zhu, J.: Sparse permutation invariant covariance estimation. Electron. J. Stat. 2, 494–515 (2008)
Shao, J.: Mathematical Statistics, 2nd edn. Springer, New York (2003)
Schwarz, G.E.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 3, 1–13 (2007)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification. IEEE Trans. Knowl. Data Eng. 23, 1079–1089 (2008)
Wang, H.L.: A note on adaptive group lasso. Comput. Stat. Data Anal. 52, 5277–5286 (2008)
Wang, J., Wang, L.: Sparse supervised dimension reduction in high dimensional classification. Electron. J. Stat. 4, 914–931 (2010)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142, 397–434 (2013)
Yu, H. F., Jain, P., Kar P., Dhillon I. S.: Large-scale multi-label learning with missing labels. Proceedings of the 31st International Conference on Machine Learning (2014)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 68, 49–67 (2006)
Zhang, M.L., Zhou, Z.H.: A lazy learning approach to multi-label learning. Pattern Recognit. 40, 2038–2048 (2007)
Zhou, Z.H., Zhang, M.L.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18, 1338–1351 (2006)
Acknowledgments
JW’s research is partly supported by HK GRF Grant 11302615, CityU SRG Grant 7004244 and CityU Startup Grant 7200380. The authors would like to thank the associate editor and two anonymous referees for their constructive suggestion and comments.
Author information
Authors and Affiliations
Corresponding author
Appendix: Technical proofs
Appendix: Technical proofs
Proof of Theorem 1
Since the factorization of \(({\mathbf{B}}^*,{\mathbf{A}}^*)\) in (2) is not unique as \({\mathbf{B}}^* {\mathbf{A}}^*={\mathbf{B}}^* \varvec{\Lambda } \varvec{\Lambda }^T {\mathbf{A}}^*\) for any orthogonal matrix \(\varvec{\Lambda }\), we denote by \(\mathcal {T}_{{\mathbf{C}}^*}\) the collection of all such \(({\mathbf{B}}^*,{\mathbf{A}}^*)\)’s. In the sequel, \(({\mathbf{B}}^*, {\mathbf{A}}^*)\) refers to any given pair in \(\mathcal {T}_{{\mathbf{C}}^*}\). Let \(\mathbf{\Gamma } = \hbox {vec}(\mathbf{B},\mathbf{A},\mathbf{c}_0)\), \(\mathbf{\Gamma }^* = \hbox {vec}({\mathbf{B}}^*,{\mathbf{A}}^*,{\mathbf{c}}_0^*)\), \(T(\mathbf{\Gamma } ) = l(\mathbf{B},\mathbf{A},\mathbf{c}_0)\), \(T_p(\mathbf{\Gamma }) = l_p(\mathbf{B},\mathbf{A},\mathbf{c}_0)\). The Taylor expansion of \(T(\mathbf{\Gamma })\) at \(\mathbf{\Gamma }^*\) implies
where \(H({\tilde{\mathbf{\Gamma }}})\) is the Hessian matrix, \(\tilde{\mathbf{\Gamma }}\) is a matrix between \(\mathbf{\Gamma }\) and \(\mathbf{\Gamma }^*\).
The proof proceeds as follows. We first construct a neighborhood of \(\mathbf{\Gamma }^*\) as \(N_n(\gamma , \mathbf{\Gamma }^*) = B_n(\gamma , \mathbf{\Gamma }^*) \bigcap \mathcal {M}_{\mathbf{B}}\), where \(B_n(\gamma , \mathbf{\Gamma }^*) = \{\mathbf{\Gamma }: \Vert I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*) \Vert _2 \le \gamma /\sqrt{n} \}\), and \(\mathcal {M}_{\mathbf{B}} = \{ \mathbf{\Gamma }: {\mathbf{B}}^T {\mathbf{B}} = {\mathbf{I}}_{r^*}\}\). Note that \(B_n(\gamma , \mathbf{\Gamma }^*)\) is a connected and closed ellipsoid, and \(\mathcal {M}_{\mathbf{B}}\) is identical with the manifold \(\{\mathbf{B}, {\mathbf{B}}^T {\mathbf{B}} = {\mathbf{I}}_{r^*}\}\) times \(\mathcal {R}^{q r +q}\), therefore it implies \(N_n(\gamma , \mathbf{\Gamma }^*)\) is a closed and connected set. Then we show that \(T_p(\mathbf{\Gamma }^*)\) is smaller than \(T_p(\mathbf{\Gamma })\) for any \(\mathbf \Gamma \) on the boundary of \(N_n(\gamma , \mathbf{\Gamma }^*) \), which implies that there exists a local minimizer within \(N_n(\gamma , \mathbf{\Gamma }^*)\). Finally, the desired result follows from the fact that \(\mathbf{\Gamma }^* \in N_n(\gamma , \mathbf{\Gamma }^*) \) and thus that the distance between the local minimizer and \(\mathbf{\Gamma }^*\) is upper bounded by \(\gamma /\sqrt{n}\).
Let \(\bar{N}_n(\gamma , \mathbf{\Gamma }^*) \) be the boundary of \(N_n(\gamma , \mathbf{\Gamma }^*) \), then for any \(\mathbf{\Gamma } \in \bar{N}_n(\gamma , \mathbf{\Gamma }^*)\),
It follows from the fact \(\Vert {\mathbf{B}}^{*k}\Vert _2 =0\) for \(k>p_0\), and the Cauchy-Schwarz inequality that
Then we have
Next we bound each term separately. The first term can be bounded as
By the Markov’s inequality,
where the last equality follows from the fact that \( I_1(\mathbf{\Gamma }^*)\) is the Fisher-information matrix, \(T(\mathbf{\Gamma })\) is the log-likelihood. Then it follows that \(P\left( \frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*) > -\frac{\gamma ^2}{2} \right) \ge 1-\frac{4(p+q)r^*+4q}{\gamma ^2}\).
Since \(\frac{1}{n} H(\tilde{\mathbf{\Gamma }}) \overset{P}{\rightarrow }I_1(\mathbf{\Gamma }^*)\) as \(n\rightarrow \infty \), the second can be bounded as
The last term can be bounded as follows. Since \({\mathbf{B}}_0\) is the consistent estimate to some \({\mathbf{B}}^*\), \(\min _{1\le k\le p_0 }\Vert {\mathbf{B}}_0^k \Vert _2^{1} \ge c_3 \) for certain \(c_3 >0\). By Assumption C3 there exists \(c_4 > 0\) such that \(\Vert I_1(\mathbf{\Gamma }^*)^{-1/2} \Vert _2 \le c_4\). Along with \(\lambda /\sqrt{n} \rightarrow 0\) as \(n\rightarrow \infty \),
Combining the above bounds, for any \(\eta >0\), we can select \(\gamma \) sufficiently large such that for any \(\mathbf{\Gamma } \in \bar{N}_n( \gamma , \mathbf{\Gamma ^*})\), \(P\left( T_p(\mathbf{\Gamma }) - T_p(\mathbf{\Gamma }^*) >0 \right) > 1-\eta \), therefore there exists at least one local minimizer \(\widehat{\mathbf{\Gamma }}\) of \(T_p(\cdot )\) inside \({N}_n( \gamma , \mathbf{\Gamma ^*})\), and it follows \(\Vert \widehat{\mathbf{\Gamma }} - \mathbf{\Gamma }^* \Vert _2 \le O(\gamma /\sqrt{n})\), \(\Vert \widehat{\mathbf{c}}_0 - {\mathbf{c}}_0^* \Vert \le O(\gamma /\sqrt{n}) \), \(\Vert \widehat{\mathbf{A}} - \mathbf{A}^* \Vert _F \le O(\gamma /\sqrt{n})\), as well as \(\Vert \widehat{\mathbf{B}} - \mathbf{B}^* \Vert _F \le O(\gamma /\sqrt{n})\). It completes the proof of Theorem 1. \(\square \)
Proof of Theorem 2
First we note that the active set induced by \(\widehat{\mathbf{C}}\) is the same as that induced by \(\widehat{\mathbf{B}}\) in the sense that \(\Vert \widehat{\mathbf{C}}^k\Vert = 0\) if and only if \(\Vert \widehat{\mathbf{B}}^k\Vert = 0\). We now prove this theorem by contradiction. Suppose that there exists some \(k>p_0\) such that \(\Vert \widehat{\mathbf{B}}^k \Vert _2 > 0\). Denote \({\mathbf{G}} = \frac{\partial l_p(\cdot )}{\partial \mathbf{B}}\), then the first order Karush-Kuhn-Tucker condition on \(\widehat{\mathbf{B}}\in \mathcal {M}_{r^*}^p\) yields \( \widehat{\mathbf{G}} \widehat{\mathbf{B}}^T =\widehat{\mathbf{B}} \widehat{\mathbf{G}}^T\) (Wen and Yin 2013), leading to \(\widehat{\mathbf{G}} = \widehat{\mathbf{B}}\widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\) given that \(\widehat{\mathbf{B}}^T \widehat{\mathbf{B}} = {\mathbf{I}}_{r^*}\). That is, for any k, \(\widehat{\mathbf{G}}^k = \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\), where \(\widehat{\mathbf{G}}^k\) and \(\widehat{\mathbf{B}}^k\) are the k-th rows of \(\widehat{\mathbf{G}}\) and \(\widehat{\mathbf{B}}\), respectively. We will then show that \(\Vert \widehat{\mathbf{G}}^k\Vert _2\) and \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2\) are of different magnitudes, leading to contradiction.
One one hand, we have
where the k-th row \(\frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^k} = \Vert {\mathbf{B}}_0^k \Vert _2^{-1} \frac{{\widehat{\mathbf{B}}}^k}{\Vert {\widehat{\mathbf{B}}}^k \Vert _2}\), and \(\tilde{\mathbf{B}}^{k}\) is between \({\widehat{\mathbf{B}}}^k\) and \({\mathbf{B}}^{*k}\). By Theorem 1, \(\widehat{\mathbf{B}}\) and \(\widehat{\mathbf{A}}\) are the \(\sqrt{n}\)-consistent estimates of some \(\mathbf{B}^*\) and \(\mathbf{A}^*\) in \(\mathcal {T}_{{\mathbf{C}}^*}\), and \(\widehat{\mathbf{c}}_0\) is the \(\sqrt{n}\)-estimate of \({\mathbf{c}}_0^*\), then \(n^{-1} H(\tilde{\mathbf{B}}^{k}) = I_1({\mathbf{B}}^{*k}) + O_p(1/\sqrt{n})\), and \(n^{-1} \frac{ \partial l ({\mathbf{B}}^*,\widehat{\mathbf{A}},\widehat{\mathbf{c}}_0)}{\partial {\mathbf{B}}^k } - S_1({\mathbf{B}}^{*k}) = O_p(1/\sqrt{n})\), where the score function \(S_1({\mathbf{B}}^{*k}) = \frac{1}{n} \frac{ \partial }{\partial {\mathbf{B}}^k} E \left( l ({\mathbf{B}}^*,{\mathbf{A}}^*, {\mathbf{c}}^*_0 ) \right) =0\). Consequently, we have
Furthermore, as \(\Vert \mathbf{B}_0^k\Vert =O_p(n^{-1/2})\) and \(I_1({\mathbf{B}}^*)\) is positive definite, \(\Vert \widehat{\mathbf{G}}^k\Vert _2\) is of the same order as \(O_p(n) \Vert \widehat{\mathbf{B}}^k\Vert _2\).
On the other hand,
Then \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2 \le \Vert \widehat{\mathbf{B}}^k\Vert _2 \big \Vert \big ( O_p(\sqrt{n}) + \big ( n I_1({\mathbf{B}}^{*})+ O_p(\sqrt{n}) \big ) ({\widehat{\mathbf{B}}} - {{\mathbf{B}}^*} )^T+ \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^T} \big ) \widehat{\mathbf{B}} \big \Vert _F\). By Theorem 1, we have \(\Vert \widehat{\mathbf{B}} - {\mathbf{B}}^* \Vert _F = O_p(1/\sqrt{n})\), and thus \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2 \le O_p(\lambda \sqrt{n}) \Vert \widehat{\mathbf{B}}^k\Vert _2\). Since \(\Vert \widehat{\mathbf{B}}^k\Vert _2 >0\) and \(\lambda /\sqrt{n} \rightarrow 0\), it can be concluded that \(\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2\) is of smaller magnitude than \(\Vert \widehat{\mathbf{G}}^k\Vert _2\), which contradicts with the fact that \(\widehat{\mathbf{G}}^k = \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\). This implies that \(\Vert \widehat{\mathbf{B}}^k \Vert _2 = 0\) for all \(k>p_0\) and completes the proof. \(\square \)
Proof of Lemma 1
First we have
Since \(\lambda = \lambda _n = \log n , r = r^*\) satisfies the conditions in Theorems 1 and 2, \((\widehat{\mathbf{c}}_0)_{\lambda _n, r^*}\) is the consistent estimate of \({\mathbf{c}}_0^*\), \(\widehat{\mathbf{C}}_{\lambda _n, r^*} = \widehat{\mathbf{B}}_{\lambda _n, r^*}\widehat{\mathbf{A}}_{\lambda _n, r^*}\) is the consistent estimate of \({\mathbf{C}}^*\), then \(P\left( \hbox {df}_{\widehat{\mathbf{C}}_{\lambda _n, r^*}} =\hbox {df}_{\widehat{\mathbf{C}}^* } \right) \rightarrow 1\), as well as \( \frac{l \left( \widehat{\mathbf{C}}_{\lambda _n, r^*}, (\widehat{\mathbf{c}}_0)_{\lambda _n, r^*} \right) }{n} - \frac{l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \overset{P}{\rightarrow }0 \), then it completes the proof. \(\square \)
Proof of Lemma 2
The proof proceeds by cases:
-
(i)
For \((\lambda ,r)\in \Omega _{+}\) such that \(\hbox {df}_{\widehat{\mathbf{C}}_{ {\lambda }, {r}}} > \hbox {df}_{{\mathbf{C}}^*}\), from (11) we have
$$\begin{aligned}&\hbox {BIC}_{{\lambda , r}} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} \ge \frac{l \left( \widehat{\mathbf{C}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} \right) - l\left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \\&\quad +\frac{\log n}{n}\ge \frac{l({\mathbf{C}}_{m}, {\mathbf{c}}_{0m} )- l({\mathbf{C}}^*,{\mathbf{c}}^*_0)}{n} +\frac{\log n}{n}, \end{aligned}$$where \(\left( {\mathbf{C}}_m, {\mathbf{c}}_{0m}\right) \) denote the minimizer of \(l(\cdot )\). By the classical asymptotic theory, since p and q are fixed, as \(n\rightarrow \infty \), \(- 2l\left( {\mathbf{C}}_m, {\mathbf{c}}_{0m}\right) + 2l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) \overset{D}{\rightarrow }\chi ^2_{(p+1)q}\) and \(\log n \rightarrow \infty \), then it follows
$$\begin{aligned} P \left( \inf _{(\lambda ,r)\in \Omega _{+}} ~ \hbox {BIC}_{{\lambda , r}} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} > 0 \right) \rightarrow 1. \end{aligned}$$ -
(ii)
For \((\lambda ,r)\in \Omega _{-}\), we denote \({\mathcal {C}}_{-}= \{ (\mathbf{C},\mathbf{c}_0) : \mathcal {A}_{{\mathbf{C}}} \nsupseteq \mathcal {A}_{{\mathbf{C}}^*}, \text {or} ~ \mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{{\mathbf{C}}^*} ~ \text {and} ~ r< r^*\}\), then \(({\widehat{\mathbf{C}}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} ) \in {\mathcal {C}}_{-}\). In (11) for any \((\mathbf{C},\mathbf{c}_0) \in {\mathcal {C}}_{-}\), since the degree of freedom terms are finite, as \(n \rightarrow \infty \),
$$\begin{aligned}&\hbox {BIC}_{\mathbf{C},\mathbf{c}_0} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} \overset{P}{\rightarrow }E(l_1(\mathbf{C},{\mathbf{c}}_0,\mathbf{x},\mathbf{y})) \\&\quad - E(l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})) = \hbox {vec}(\mathbf{C}-{\mathbf{C}}^*,{\mathbf{c}}_0-{\mathbf{c}}_0^*)^T\\&\qquad \times \frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2}~ \hbox {vec}(\mathbf{C}-{\mathbf{C}}^*,{\mathbf{c}}_0-{\mathbf{c}}_0^*), \end{aligned}$$where \((\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0)\) is between \((\mathbf{C},\mathbf{c}_0)\) and \(({\mathbf{C}}^*,{\mathbf{c}}^*_0)\). Next we will show \(\inf _{(\mathbf{C},\mathbf{c}_0)\in \mathcal {C}_{-}} \Vert \mathbf{C}- \mathbf{C}^* \Vert _F >0 \) by considering the following two cases.
-
(a)
For those \((\mathbf{C},\mathbf{c}_0) \in \mathcal {C}_{-}\) with \(\mathcal {A}_{\mathbf{C}} \nsupseteq \mathcal {A}_{\mathbf{C}^*}\), then \(\Vert \mathbf{C}^* - \mathbf{C}\Vert _F \ge \Vert {\mathbf{C}}^{*k} - {\mathbf{C}}^{k}\Vert _2 = \Vert {\mathbf{C}}^{*k} \Vert _2 >0\), for k such that \(\mathbf{x}_{(k)} \in \mathcal {A}^c_{\mathbf{C}} \bigcap \mathcal {A}_{\mathbf{C}^*}\), then \(\inf _{\mathcal {A}_{\mathbf{C}} \nsupseteq \mathcal {A}_{\mathbf{C}^*}} \Vert \mathbf{C}- {\mathbf{C}}^* \Vert _F >0\).
-
(b)
For those \((\mathbf{C},\mathbf{c}_0) \in \mathcal {C}_{-}\) with \(\mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{\mathbf{C}^*}\), and \(\hbox {rank}({\mathbf{C}}) < r^*\), it immediately implies \(\Vert {\mathbf{C}}^* - {\mathbf{C}}\Vert _2 >0\), and then \(\Vert {\mathbf{C}} - {\mathbf{C}}^* \Vert _F \ge \Vert {\mathbf{C}} - {\mathbf{C}}^* \Vert _2 >0\), and \(\inf _{\mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{\mathbf{C}^*},\hbox {rank}({\mathbf{C}}) < r^* } \Vert \mathbf{C}- \mathbf{C}^* \Vert _F > 0\).
-
(a)
Combining both cases, we have \(\inf _{(\mathbf{C},\mathbf{c}_0)\in \mathcal {C}_{-}} \Vert \mathbf{C}- \mathbf{C}^* \Vert _F >0 \), which, together with the fact that \(\frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2} \) is positive definite, implies the desired results. \(\square \)
Proof of Theorem 3
We just need to show for any \(\epsilon > 0\), \(P \left( E(l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}}}, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})) - E(l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})) \le \epsilon \right) \rightarrow 1\). Then by the proof of Lemma 2,
where \((\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0)\) is between \((\mathbf{C},\mathbf{c}_0)\) and \(({\mathbf{C}}^*,{\mathbf{c}}^*_0)\). Since \(\epsilon \) is arbitrary, and \(\frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2}\) is positive definite with finite eigenvalues, then it completes the proof. \(\square \)
From Lemma 1, for any \(\epsilon >0\), as \(n\rightarrow \infty \), \(P\big (\hbox {BIC}_{\hat{\lambda },\hat{r}} \le \hbox {BIC}_{{\lambda _n},r^*} \le \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} + \epsilon /3 \big ) \rightarrow 1\). Furthermore,
When n is large enough, with probability tending to 1, \(E\left( l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}} }, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})\right) - \hbox {BIC}_{{\hat{\lambda },\hat{r}} } \le \epsilon /3\), \(\hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}-E\left( l_1({\mathbf{C}}^*,{\mathbf{c}}^*_0, \mathbf{X},\mathbf{Y})\right) \le \epsilon /3\), and it implies the results.
Rights and permissions
About this article
Cite this article
Yuan, T., Wang, J. Reduced-rank multi-label classification. Stat Comput 27, 181–191 (2017). https://doi.org/10.1007/s11222-015-9615-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9615-0