Reduced-rank multi-label classification

Yuan, Ting; Wang, Junhui

doi:10.1007/s11222-015-9615-0

Reduced-rank multi-label classification

Published: 21 November 2015

Volume 27, pages 181–191, (2017)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Ting Yuan¹ &
Junhui Wang^1,2

528 Accesses
Explore all metrics

Abstract

Multi-label classification is a natural generalization of the classical binary classification for classifying multiple class labels. It differs from multi-class classification in that the multiple class labels are not exclusive. The key challenge is to improve the classification accuracy by incorporating the intrinsic dependency structure among the multiple class labels. In this article we propose to model the dependency structure via a reduced-rank multi-label classification model, and to enforce a group lasso regularization for sparse estimation. An alternative optimization scheme is developed to facilitate the computation, where a constrained manifold optimization technique and a gradient descent algorithm are alternated to maximize the resultant regularized log-likelihood. Various simulated examples and two real applications are conducted to demonstrate the effectiveness of the proposed method. More importantly, its asymptotic behavior is quantified in terms of the estimation and variable selection consistencies, as well as the model selection consistency via the Bayesian information criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-label feature selection via feature manifold learning and sparsity regularization

Article 01 March 2017

Sparse and low-rank representation for multi-label classification

Article 26 November 2018

Multi-label feature selection based on logistic regression and manifold learning

Article 04 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Barker, M., Rayens, W.: Partial least squares for discrimination. J. Chemom. 17, 166–173 (2003)
Article Google Scholar
Barutcuoglu, Z., Schapire, R., Troyanskaya, O.: Hierarchical multi-label prediction of gene function. Bioinformatics 22, 830–836 (2006)
Article Google Scholar
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognit. 37, 1757–1771 (2004)
Article Google Scholar
Breiman, B., Friedman, J.H.: Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. 59, 354 (1997)
Article MathSciNet MATH Google Scholar
Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X. Z., Raich, R., Hadley, S. J. K., Hadley, A. S., Betts, M. G.: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment. IEEE International Workshop on Machine Learning for Signal Processing (2012)
Chen, L.S., Huang, J.H.: Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Am. Stat. Assoc. 107, 1533–1545 (2012)
Article MathSciNet MATH Google Scholar
Clare, A., King, R.: Knowledge discovery in multi-label phenotype data. 5th European Conference on Principles of Data Mining and Knowledge Discovery. Lecture Notes in Artificial Intelligence, 2168, pp. 42–53, (2001)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 148–155, (1998)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–353 (1998)
Article MathSciNet MATH Google Scholar
Elisseeff, A., Weston, J.: A kernel method for multi-labeled classification. Adv. Neural Inf. Process. Syst. 14, 681–687 (2002)
Google Scholar
Fan, R.E., Lin, C.J.: A study on threshold selection for multi-label classification. Technical Report, National Taiwan University, (2007)
Izenman, A.J.: Reduced-rank regression for the multivariate linear model. J. Multivar. Anal. 5, 248–264 (1975)
Article MathSciNet MATH Google Scholar
Luaces, O., Dez, J., Barranquero, J., Jos del Coz, J., Bahamonde, A.: Binary relevance efficacy for multilabel classification. Prog. Artif. Intell. 4, 303–313 (2012)
Article Google Scholar
Madjarov, G., Kocev, D., Gjorgjevikj, D., Dzeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recognit. 45, 3084–3104 (2012)
Article Google Scholar
Nocedal, J., Yuan, Y.X.: Combining trust region and line search techniques. Adv. Nonlinear Program. 260, 153–175 (1998)
Article MathSciNet MATH Google Scholar
Peters, S., Jacob, Y., Denoyer, L., Gallinari, P.: Iterative multi-label multi-relational classification algorithm for complex social networks. Soc. Netw. Anal. Min. 2, 17–29 (2012)
Article Google Scholar
Ravikumar, P., Wainwright, S., Raskutti, G., Yu, B.: High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electron. J. Stat. 5, 935–980 (2011)
Article MathSciNet MATH Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Springer (2009)
Richtarik, P., Takac, M.: Parallel coordinate descent methods for big data optimization. (2012), arXiv:1212.0873
Rothman, A., Bickel, P., Levina, E., Zhu, J.: Sparse permutation invariant covariance estimation. Electron. J. Stat. 2, 494–515 (2008)
Article MathSciNet MATH Google Scholar
Shao, J.: Mathematical Statistics, 2nd edn. Springer, New York (2003)
Book MATH Google Scholar
Schwarz, G.E.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MathSciNet MATH Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 3, 1–13 (2007)
Article Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification. IEEE Trans. Knowl. Data Eng. 23, 1079–1089 (2008)
Article Google Scholar
Wang, H.L.: A note on adaptive group lasso. Comput. Stat. Data Anal. 52, 5277–5286 (2008)
Article MathSciNet MATH Google Scholar
Wang, J., Wang, L.: Sparse supervised dimension reduction in high dimensional classification. Electron. J. Stat. 4, 914–931 (2010)
Article MathSciNet MATH Google Scholar
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142, 397–434 (2013)
Article MathSciNet MATH Google Scholar
Yu, H. F., Jain, P., Kar P., Dhillon I. S.: Large-scale multi-label learning with missing labels. Proceedings of the 31st International Conference on Machine Learning (2014)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 68, 49–67 (2006)
Article MathSciNet MATH Google Scholar
Zhang, M.L., Zhou, Z.H.: A lazy learning approach to multi-label learning. Pattern Recognit. 40, 2038–2048 (2007)
Article MATH Google Scholar
Zhou, Z.H., Zhang, M.L.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18, 1338–1351 (2006)
Article Google Scholar

Download references

Acknowledgments

JW’s research is partly supported by HK GRF Grant 11302615, CityU SRG Grant 7004244 and CityU Startup Grant 7200380. The authors would like to thank the associate editor and two anonymous referees for their constructive suggestion and comments.

Author information

Authors and Affiliations

Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, USA
Ting Yuan & Junhui Wang
Department of Mathematics, City University of Hong Kong, Kowloon Tong, Hong Kong
Junhui Wang

Authors

Ting Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Junhui Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Yuan.

Appendix: Technical proofs

Proof of Theorem 1

Since the factorization of $({\mathbf{B}}^*,{\mathbf{A}}^*)$ in (2) is not unique as ${\mathbf{B}}^* {\mathbf{A}}^*={\mathbf{B}}^* \varvec{\Lambda } \varvec{\Lambda }^T {\mathbf{A}}^*$ for any orthogonal matrix $\varvec{\Lambda }$, we denote by $\mathcal {T}_{{\mathbf{C}}^*}$ the collection of all such $({\mathbf{B}}^*,{\mathbf{A}}^*)$’s. In the sequel, $({\mathbf{B}}^*, {\mathbf{A}}^*)$ refers to any given pair in $\mathcal {T}_{{\mathbf{C}}^*}$. Let $\mathbf{\Gamma } = \hbox {vec}(\mathbf{B},\mathbf{A},\mathbf{c}_0)$, $\mathbf{\Gamma }^* = \hbox {vec}({\mathbf{B}}^*,{\mathbf{A}}^*,{\mathbf{c}}_0^*)$, $T(\mathbf{\Gamma } ) = l(\mathbf{B},\mathbf{A},\mathbf{c}_0)$, $T_p(\mathbf{\Gamma }) = l_p(\mathbf{B},\mathbf{A},\mathbf{c}_0)$. The Taylor expansion of $T(\mathbf{\Gamma })$ at $\mathbf{\Gamma }^*$ implies

$$\begin{aligned} T(\mathbf{\Gamma }) =&\, T(\mathbf{\Gamma }^*) + \frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*)\\&+ (\mathbf{\Gamma } - \mathbf{\Gamma }^*)^T \frac{1}{2} H(\tilde{\mathbf{\Gamma }}) (\mathbf \Gamma - \mathbf \Gamma ^*), \end{aligned}$$

where $H({\tilde{\mathbf{\Gamma }}})$ is the Hessian matrix, $\tilde{\mathbf{\Gamma }}$ is a matrix between $\mathbf{\Gamma }$ and $\mathbf{\Gamma }^*$.

The proof proceeds as follows. We first construct a neighborhood of $\mathbf{\Gamma }^*$ as $N_n(\gamma , \mathbf{\Gamma }^*) = B_n(\gamma , \mathbf{\Gamma }^*) \bigcap \mathcal {M}_{\mathbf{B}}$, where $B_n(\gamma , \mathbf{\Gamma }^*) = \{\mathbf{\Gamma }: \Vert I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*) \Vert _2 \le \gamma /\sqrt{n} \}$, and $\mathcal {M}_{\mathbf{B}} = \{ \mathbf{\Gamma }: {\mathbf{B}}^T {\mathbf{B}} = {\mathbf{I}}_{r^*}\}$. Note that $B_n(\gamma , \mathbf{\Gamma }^*)$ is a connected and closed ellipsoid, and $\mathcal {M}_{\mathbf{B}}$ is identical with the manifold $\{\mathbf{B}, {\mathbf{B}}^T {\mathbf{B}} = {\mathbf{I}}_{r^*}\}$ times $\mathcal {R}^{q r +q}$, therefore it implies $N_n(\gamma , \mathbf{\Gamma }^*)$ is a closed and connected set. Then we show that $T_p(\mathbf{\Gamma }^*)$ is smaller than $T_p(\mathbf{\Gamma })$ for any $\mathbf \Gamma $ on the boundary of $N_n(\gamma , \mathbf{\Gamma }^*) $, which implies that there exists a local minimizer within $N_n(\gamma , \mathbf{\Gamma }^*)$. Finally, the desired result follows from the fact that $\mathbf{\Gamma }^* \in N_n(\gamma , \mathbf{\Gamma }^*) $ and thus that the distance between the local minimizer and $\mathbf{\Gamma }^*$ is upper bounded by $\gamma /\sqrt{n}$.

Let $\bar{N}_n(\gamma , \mathbf{\Gamma }^*) $ be the boundary of $N_n(\gamma , \mathbf{\Gamma }^*) $, then for any $\mathbf{\Gamma } \in \bar{N}_n(\gamma , \mathbf{\Gamma }^*)$,

$$\begin{aligned} T_p(\mathbf{\Gamma }) - T_p(\mathbf{\Gamma }^*)&= T(\mathbf{\Gamma } ) - T(\mathbf{\Gamma }^*)\\&\quad + \sum _{k=1}^p \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \left( \Vert {\mathbf{B}}^k\Vert _2 - \Vert {\mathbf{B}}^{*k}\Vert _2 \right) . \end{aligned}$$

It follows from the fact $\Vert {\mathbf{B}}^{*k}\Vert _2 =0$ for $k>p_0$, and the Cauchy-Schwarz inequality that

$$\begin{aligned}&\sum _{k=1}^p \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \left( \Vert {\mathbf{B}}^k\Vert _2 - \Vert {\mathbf{B}}^{*k}\Vert _2 \right) \\&\quad \ge \sum _{k=1}^{p_0} \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \left( \Vert {\mathbf{B}}^k\Vert _2 -\Vert {\mathbf{B}}^{*k}\Vert _2 \right) \\&\quad \ge -\sum _{k=1}^{p_0} \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \Vert {\mathbf{B}}^k-{\mathbf{B}}^{*k}\Vert _2. \end{aligned}$$

Then we have

$$\begin{aligned}&T_p(\mathbf{\Gamma } ) - T_p(\mathbf{\Gamma }^*) \\&\quad \ge T(\mathbf{\Gamma } ) - T(\mathbf{\Gamma }^*) - \sum _{k=1}^{p_0} \lambda \Vert {\mathbf{B}}_0^k \Vert ^{-1}_2 \Vert \mathbf{\Gamma } - \mathbf{\Gamma }^*\Vert _2 \\&\quad \ge \frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*) + \frac{1}{2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)^T H(\tilde{\mathbf{\Gamma }}) (\mathbf \Gamma - \mathbf \Gamma ^*) \\&\qquad -\frac{\lambda p_0 }{\sqrt{n} \min _{1\le k \le p_0} \Vert {\mathbf{B}}_0^k \Vert ^{1}_2 } \Vert I_1(\mathbf{\Gamma }^*)^{-1/2}\Vert _2 \Vert \\&\quad \quad \times \sqrt{n} I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)\Vert _2. \end{aligned}$$

Next we bound each term separately. The first term can be bounded as

$$\begin{aligned}&\frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*) \\&\quad = \left( \sqrt{n} I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*) \right) ^T \left( {n} I_1(\mathbf{\Gamma }^*) \right) ^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \\&\quad \ge -\gamma \left\| n^{-1/2} I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right\| _2. \end{aligned}$$

By the Markov’s inequality,

$$\begin{aligned}&P\left( \left\| n^{-1/2} I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right\| _2 \le \frac{\gamma }{2} \right) \\&\quad \ge 1 - \frac{4}{\gamma ^2} E \left\| n^{-1/2} I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right\| _2^2 \\&\quad = 1 - \frac{4}{\gamma ^2} E\left( \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T} n^{-1} I_1(\mathbf{\Gamma }^*)^{-1} \frac{\partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }} \right) \\&\quad = 1 - \frac{4}{\gamma ^2} \dim (\mathbf{\Gamma ^*}) = 1-\frac{4(p+q)r^*+4q}{\gamma ^2}, \end{aligned}$$

where the last equality follows from the fact that $ I_1(\mathbf{\Gamma }^*)$ is the Fisher-information matrix, $T(\mathbf{\Gamma })$ is the log-likelihood. Then it follows that $P\left( \frac{ \partial { T(\mathbf{\Gamma ^*})}}{\partial \mathbf{\Gamma }^T}(\mathbf{\Gamma } - \mathbf{\Gamma }^*) > -\frac{\gamma ^2}{2} \right) \ge 1-\frac{4(p+q)r^*+4q}{\gamma ^2}$.

Since $\frac{1}{n} H(\tilde{\mathbf{\Gamma }}) \overset{P}{\rightarrow }I_1(\mathbf{\Gamma }^*)$ as $n\rightarrow \infty $, the second can be bounded as

$$\begin{aligned}&\frac{1}{2}(\mathbf{\Gamma } - \mathbf{\Gamma }^*)^T H(\tilde{\mathbf{\Gamma }}) (\mathbf \Gamma - \mathbf \Gamma ^*) \\&=\frac{1}{2} \left( \sqrt{n}I_1(\mathbf{\Gamma ^*})^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)\right) ^T I_1(\mathbf{\Gamma }^*)^{-1/2} \frac{1}{n} H(\tilde{\mathbf{\Gamma }}) \\&\qquad {\times } I_1(\mathbf{\Gamma }^*)^{-1/2} \left( \sqrt{n}I_1(\mathbf \Gamma ^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*) \right) \overset{P}{\rightarrow }\frac{\gamma ^2}{2}. \end{aligned}$$

The last term can be bounded as follows. Since ${\mathbf{B}}_0$ is the consistent estimate to some ${\mathbf{B}}^*$, $\min _{1\le k\le p_0 }\Vert {\mathbf{B}}_0^k \Vert _2^{1} \ge c_3 $ for certain $c_3 >0$. By Assumption C3 there exists $c_4 > 0$ such that $\Vert I_1(\mathbf{\Gamma }^*)^{-1/2} \Vert _2 \le c_4$. Along with $\lambda /\sqrt{n} \rightarrow 0$ as $n\rightarrow \infty $,

$$\begin{aligned}&\frac{\lambda p_0 }{\sqrt{n} \min _{1\le k \le p_0} \Vert {\mathbf{B}}_0^k \Vert _2 } \Vert I_1(\mathbf{\Gamma }^*)^{-1/2}\Vert _2 \Vert \sqrt{n} I_1(\mathbf{\Gamma }^*)^{1/2} (\mathbf{\Gamma } - \mathbf{\Gamma }^*)\Vert _2\\&\quad \le c_4 p_0 \gamma \lambda ({\sqrt{n}} c_3)^{-1} \overset{P}{\rightarrow }0. \end{aligned}$$

Combining the above bounds, for any $\eta >0$, we can select $\gamma $ sufficiently large such that for any $\mathbf{\Gamma } \in \bar{N}_n( \gamma , \mathbf{\Gamma ^*})$, $P\left( T_p(\mathbf{\Gamma }) - T_p(\mathbf{\Gamma }^*) >0 \right) > 1-\eta $, therefore there exists at least one local minimizer $\widehat{\mathbf{\Gamma }}$ of $T_p(\cdot )$ inside ${N}_n( \gamma , \mathbf{\Gamma ^*})$, and it follows $\Vert \widehat{\mathbf{\Gamma }} - \mathbf{\Gamma }^* \Vert _2 \le O(\gamma /\sqrt{n})$, $\Vert \widehat{\mathbf{c}}_0 - {\mathbf{c}}_0^* \Vert \le O(\gamma /\sqrt{n}) $, $\Vert \widehat{\mathbf{A}} - \mathbf{A}^* \Vert _F \le O(\gamma /\sqrt{n})$, as well as $\Vert \widehat{\mathbf{B}} - \mathbf{B}^* \Vert _F \le O(\gamma /\sqrt{n})$. It completes the proof of Theorem 1. $\square $

Proof of Theorem 2

First we note that the active set induced by $\widehat{\mathbf{C}}$ is the same as that induced by $\widehat{\mathbf{B}}$ in the sense that $\Vert \widehat{\mathbf{C}}^k\Vert = 0$ if and only if $\Vert \widehat{\mathbf{B}}^k\Vert = 0$. We now prove this theorem by contradiction. Suppose that there exists some $k>p_0$ such that $\Vert \widehat{\mathbf{B}}^k \Vert _2 > 0$. Denote ${\mathbf{G}} = \frac{\partial l_p(\cdot )}{\partial \mathbf{B}}$, then the first order Karush-Kuhn-Tucker condition on $\widehat{\mathbf{B}}\in \mathcal {M}_{r^*}^p$ yields $ \widehat{\mathbf{G}} \widehat{\mathbf{B}}^T =\widehat{\mathbf{B}} \widehat{\mathbf{G}}^T$ (Wen and Yin 2013), leading to $\widehat{\mathbf{G}} = \widehat{\mathbf{B}}\widehat{\mathbf{G}}^T \widehat{\mathbf{B}}$ given that $\widehat{\mathbf{B}}^T \widehat{\mathbf{B}} = {\mathbf{I}}_{r^*}$. That is, for any k, $\widehat{\mathbf{G}}^k = \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}$, where $\widehat{\mathbf{G}}^k$ and $\widehat{\mathbf{B}}^k$ are the k-th rows of $\widehat{\mathbf{G}}$ and $\widehat{\mathbf{B}}$, respectively. We will then show that $\Vert \widehat{\mathbf{G}}^k\Vert _2$ and $\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2$ are of different magnitudes, leading to contradiction.

One one hand, we have

$$\begin{aligned} \widehat{\mathbf{G}}^k = \frac{ \partial l_p(\widehat{\mathbf{B}},\widehat{\mathbf{A}}, \widehat{\mathbf{c}}_0 )}{\partial {\mathbf{B}}^k}&= \frac{ \partial l (\widehat{\mathbf{B}},\widehat{\mathbf{A}}, \widehat{\mathbf{c}}_0)}{\partial {\mathbf{B}}^k } + \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^k} \\&= \frac{ \partial l ({\mathbf{B}}^*,\widehat{\mathbf{A}},\widehat{\mathbf{c}}_0)}{\partial {\mathbf{B}}^k } + \big ( \widehat{\mathbf{B}}^k - {\mathbf{B}}^{*k} \big ) H(\tilde{\mathbf{B}}^k) \\&\quad + \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^k}, \end{aligned}$$

where the k-th row $\frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^k} = \Vert {\mathbf{B}}_0^k \Vert _2^{-1} \frac{{\widehat{\mathbf{B}}}^k}{\Vert {\widehat{\mathbf{B}}}^k \Vert _2}$, and $\tilde{\mathbf{B}}^{k}$ is between ${\widehat{\mathbf{B}}}^k$ and ${\mathbf{B}}^{*k}$. By Theorem 1, $\widehat{\mathbf{B}}$ and $\widehat{\mathbf{A}}$ are the $\sqrt{n}$-consistent estimates of some $\mathbf{B}^*$ and $\mathbf{A}^*$ in $\mathcal {T}_{{\mathbf{C}}^*}$, and $\widehat{\mathbf{c}}_0$ is the $\sqrt{n}$-estimate of ${\mathbf{c}}_0^*$, then $n^{-1} H(\tilde{\mathbf{B}}^{k}) = I_1({\mathbf{B}}^{*k}) + O_p(1/\sqrt{n})$, and $n^{-1} \frac{ \partial l ({\mathbf{B}}^*,\widehat{\mathbf{A}},\widehat{\mathbf{c}}_0)}{\partial {\mathbf{B}}^k } - S_1({\mathbf{B}}^{*k}) = O_p(1/\sqrt{n})$, where the score function $S_1({\mathbf{B}}^{*k}) = \frac{1}{n} \frac{ \partial }{\partial {\mathbf{B}}^k} E \left( l ({\mathbf{B}}^*,{\mathbf{A}}^*, {\mathbf{c}}^*_0 ) \right) =0$. Consequently, we have

$$\begin{aligned} \widehat{\mathbf{G}}^k \!= \!O_p(\sqrt{n}) \!+\! {\widehat{\mathbf{B}}}^k \left( n I_1({\mathbf{B}}^{*k}) \!+ \!O_p(\sqrt{n}) \right) \!+ \!\frac{ \lambda {\widehat{\mathbf{B}}}^k }{\Vert {\mathbf{B}}_0^k \Vert _2 \Vert {\widehat{\mathbf{B}}}^k\Vert _2}. \end{aligned}$$

Furthermore, as $\Vert \mathbf{B}_0^k\Vert =O_p(n^{-1/2})$ and $I_1({\mathbf{B}}^*)$ is positive definite, $\Vert \widehat{\mathbf{G}}^k\Vert _2$ is of the same order as $O_p(n) \Vert \widehat{\mathbf{B}}^k\Vert _2$.

On the other hand,

$$\begin{aligned} \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}&= \widehat{\mathbf{B}}^k \left( O_p(\sqrt{n}) \!+ \!\left( n I_1({\mathbf{B}}^{*}) \!+ \!O_p(\sqrt{n}) \right) ({\widehat{\mathbf{B}}} - {{\mathbf{B}}^*} )^T\right. \\&\qquad \qquad \left. + \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^T} \right) \widehat{\mathbf{B}}. \end{aligned}$$

Then $\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2 \le \Vert \widehat{\mathbf{B}}^k\Vert _2 \big \Vert \big ( O_p(\sqrt{n}) + \big ( n I_1({\mathbf{B}}^{*})+ O_p(\sqrt{n}) \big ) ({\widehat{\mathbf{B}}} - {{\mathbf{B}}^*} )^T+ \lambda \frac{\partial J(\widehat{\mathbf{B}})}{\partial {\mathbf{B}}^T} \big ) \widehat{\mathbf{B}} \big \Vert _F$. By Theorem 1, we have $\Vert \widehat{\mathbf{B}} - {\mathbf{B}}^* \Vert _F = O_p(1/\sqrt{n})$, and thus $\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2 \le O_p(\lambda \sqrt{n}) \Vert \widehat{\mathbf{B}}^k\Vert _2$. Since $\Vert \widehat{\mathbf{B}}^k\Vert _2 >0$ and $\lambda /\sqrt{n} \rightarrow 0$, it can be concluded that $\Vert \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}\Vert _2$ is of smaller magnitude than $\Vert \widehat{\mathbf{G}}^k\Vert _2$, which contradicts with the fact that $\widehat{\mathbf{G}}^k = \widehat{\mathbf{B}}^k \widehat{\mathbf{G}}^T \widehat{\mathbf{B}}$. This implies that $\Vert \widehat{\mathbf{B}}^k \Vert _2 = 0$ for all $k>p_0$ and completes the proof. $\square $

Proof of Lemma 1

First we have

$$\begin{aligned} \hbox {BIC}_{\lambda , r} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}&= \frac{l \left( \widehat{\mathbf{C}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} \right) }{n} - \frac{l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \nonumber \\&\qquad + \frac{\log n}{n}\left( \hbox {df}_{\widehat{\mathbf{C}}_{\lambda , r}} -\hbox {df}_{\widehat{\mathbf{C}}^*} \right) . \end{aligned}$$

(11)

Since $\lambda = \lambda _n = \log n , r = r^*$ satisfies the conditions in Theorems 1 and 2, $(\widehat{\mathbf{c}}_0)_{\lambda _n, r^*}$ is the consistent estimate of ${\mathbf{c}}_0^*$, $\widehat{\mathbf{C}}_{\lambda _n, r^*} = \widehat{\mathbf{B}}_{\lambda _n, r^*}\widehat{\mathbf{A}}_{\lambda _n, r^*}$ is the consistent estimate of ${\mathbf{C}}^*$, then $P\left( \hbox {df}_{\widehat{\mathbf{C}}_{\lambda _n, r^*}} =\hbox {df}_{\widehat{\mathbf{C}}^* } \right) \rightarrow 1$, as well as $ \frac{l \left( \widehat{\mathbf{C}}_{\lambda _n, r^*}, (\widehat{\mathbf{c}}_0)_{\lambda _n, r^*} \right) }{n} - \frac{l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \overset{P}{\rightarrow }0 $, then it completes the proof. $\square $

Proof of Lemma 2

The proof proceeds by cases:

(i)
For $(\lambda ,r)\in \Omega _{+}$ such that $\hbox {df}_{\widehat{\mathbf{C}}_{ {\lambda }, {r}}} > \hbox {df}_{{\mathbf{C}}^*}$, from (11) we have
$$\begin{aligned}&\hbox {BIC}_{{\lambda , r}} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} \ge \frac{l \left( \widehat{\mathbf{C}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} \right) - l\left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) }{n} \\&\quad +\frac{\log n}{n}\ge \frac{l({\mathbf{C}}_{m}, {\mathbf{c}}_{0m} )- l({\mathbf{C}}^*,{\mathbf{c}}^*_0)}{n} +\frac{\log n}{n}, \end{aligned}$$
where $\left( {\mathbf{C}}_m, {\mathbf{c}}_{0m}\right) $ denote the minimizer of $l(\cdot )$. By the classical asymptotic theory, since p and q are fixed, as $n\rightarrow \infty $, $- 2l\left( {\mathbf{C}}_m, {\mathbf{c}}_{0m}\right) + 2l \left( {\mathbf{C}}^*, {\mathbf{c}}_0^* \right) \overset{D}{\rightarrow }\chi ^2_{(p+1)q}$ and $\log n \rightarrow \infty $, then it follows
$$\begin{aligned} P \left( \inf _{(\lambda ,r)\in \Omega _{+}} ~ \hbox {BIC}_{{\lambda , r}} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} > 0 \right) \rightarrow 1. \end{aligned}$$
(ii)
For $(\lambda ,r)\in \Omega _{-}$, we denote ${\mathcal {C}}_{-}= \{ (\mathbf{C},\mathbf{c}_0) : \mathcal {A}_{{\mathbf{C}}} \nsupseteq \mathcal {A}_{{\mathbf{C}}^*}, \text {or} ~ \mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{{\mathbf{C}}^*} ~ \text {and} ~ r< r^*\}$, then $({\widehat{\mathbf{C}}}_{\lambda , r}, (\widehat{\mathbf{c}}_0)_{\lambda , r} ) \in {\mathcal {C}}_{-}$. In (11) for any $(\mathbf{C},\mathbf{c}_0) \in {\mathcal {C}}_{-}$, since the degree of freedom terms are finite, as $n \rightarrow \infty $,
$$\begin{aligned}&\hbox {BIC}_{\mathbf{C},\mathbf{c}_0} - \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} \overset{P}{\rightarrow }E(l_1(\mathbf{C},{\mathbf{c}}_0,\mathbf{x},\mathbf{y})) \\&\quad - E(l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})) = \hbox {vec}(\mathbf{C}-{\mathbf{C}}^*,{\mathbf{c}}_0-{\mathbf{c}}_0^*)^T\\&\qquad \times \frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2}~ \hbox {vec}(\mathbf{C}-{\mathbf{C}}^*,{\mathbf{c}}_0-{\mathbf{c}}_0^*), \end{aligned}$$
where $(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0)$ is between $(\mathbf{C},\mathbf{c}_0)$ and $({\mathbf{C}}^*,{\mathbf{c}}^*_0)$. Next we will show $\inf _{(\mathbf{C},\mathbf{c}_0)\in \mathcal {C}_{-}} \Vert \mathbf{C}- \mathbf{C}^* \Vert _F >0 $ by considering the following two cases.
1. (a)
  For those $(\mathbf{C},\mathbf{c}_0) \in \mathcal {C}_{-}$ with $\mathcal {A}_{\mathbf{C}} \nsupseteq \mathcal {A}_{\mathbf{C}^*}$, then $\Vert \mathbf{C}^* - \mathbf{C}\Vert _F \ge \Vert {\mathbf{C}}^{*k} - {\mathbf{C}}^{k}\Vert _2 = \Vert {\mathbf{C}}^{*k} \Vert _2 >0$, for k such that $\mathbf{x}_{(k)} \in \mathcal {A}^c_{\mathbf{C}} \bigcap \mathcal {A}_{\mathbf{C}^*}$, then $\inf _{\mathcal {A}_{\mathbf{C}} \nsupseteq \mathcal {A}_{\mathbf{C}^*}} \Vert \mathbf{C}- {\mathbf{C}}^* \Vert _F >0$.
2. (b)
  For those $(\mathbf{C},\mathbf{c}_0) \in \mathcal {C}_{-}$ with $\mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{\mathbf{C}^*}$, and $\hbox {rank}({\mathbf{C}}) < r^*$, it immediately implies $\Vert {\mathbf{C}}^* - {\mathbf{C}}\Vert _2 >0$, and then $\Vert {\mathbf{C}} - {\mathbf{C}}^* \Vert _F \ge \Vert {\mathbf{C}} - {\mathbf{C}}^* \Vert _2 >0$, and $\inf _{\mathcal {A}_{\mathbf{C}} \supseteq \mathcal {A}_{\mathbf{C}^*},\hbox {rank}({\mathbf{C}}) < r^* } \Vert \mathbf{C}- \mathbf{C}^* \Vert _F > 0$.

Combining both cases, we have $\inf _{(\mathbf{C},\mathbf{c}_0)\in \mathcal {C}_{-}} \Vert \mathbf{C}- \mathbf{C}^* \Vert _F >0 $, which, together with the fact that $\frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2} $ is positive definite, implies the desired results. $\square $

Proof of Theorem 3

We just need to show for any $\epsilon > 0$, $P \left( E(l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}}}, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})) - E(l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})) \le \epsilon \right) \rightarrow 1$. Then by the proof of Lemma 2,

$$\begin{aligned}&P\left( \Vert \widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}} - {\mathbf{C}}^* \Vert _F+ \Vert (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}} - {\mathbf{c}}_0^*\Vert _2 \le \phi _{\min }\right. \\&\quad \left. \left( \frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2} \right) ^{-1} \epsilon ^{1/2} \right) \rightarrow 1, \end{aligned}$$

where $(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0)$ is between $(\mathbf{C},\mathbf{c}_0)$ and $({\mathbf{C}}^*,{\mathbf{c}}^*_0)$. Since $\epsilon $ is arbitrary, and $\frac{\partial ^2 E\left( l_1(\tilde{\mathbf{C}},\tilde{\mathbf{c}}_0,\mathbf{X},\mathbf{Y})\right) }{\partial {(\mathbf{C},\mathbf{c}_0)}^2}$ is positive definite with finite eigenvalues, then it completes the proof. $\square $

From Lemma 1, for any $\epsilon >0$, as $n\rightarrow \infty $, $P\big (\hbox {BIC}_{\hat{\lambda },\hat{r}} \le \hbox {BIC}_{{\lambda _n},r^*} \le \hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*} + \epsilon /3 \big ) \rightarrow 1$. Furthermore,

$$\begin{aligned}&E\left( l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}}}, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})\right) - E\left( l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})\right) \\&= E\left( l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}} },(\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}},\mathbf{X},\mathbf{Y})\right) - \hbox {BIC}_{{\hat{\lambda },\hat{r}} } + \hbox {BIC}_{{\hat{\lambda },\hat{r}} }\\&\qquad -\hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}+\hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}-E\left( l_1({\mathbf{C}}^*,{\mathbf{c}}_0^*,\mathbf{X},\mathbf{Y})\right) . \end{aligned}$$

When n is large enough, with probability tending to 1, $E\left( l_1({\widehat{\mathbf{C}}_{\hat{\lambda },\hat{r}} }, (\widehat{\mathbf{c}}_0)_{\hat{\lambda },\hat{r}}, \mathbf{X},\mathbf{Y})\right) - \hbox {BIC}_{{\hat{\lambda },\hat{r}} } \le \epsilon /3$, $\hbox {BIC}_{{\mathbf{C}}^*,{\mathbf{c}}_0^*}-E\left( l_1({\mathbf{C}}^*,{\mathbf{c}}^*_0, \mathbf{X},\mathbf{Y})\right) \le \epsilon /3$, and it implies the results.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, T., Wang, J. Reduced-rank multi-label classification. Stat Comput 27, 181–191 (2017). https://doi.org/10.1007/s11222-015-9615-0

Download citation

Received: 25 May 2014
Accepted: 11 November 2015
Published: 21 November 2015
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11222-015-9615-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reduced-rank multi-label classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-label feature selection via feature manifold learning and sparsity regularization

Sparse and low-rank representation for multi-label classification

Multi-label feature selection based on logistic regression and manifold learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Technical proofs

Proof of Theorem 1

Proof of Theorem 2

Proof of Lemma 1

Proof of Lemma 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Reduced-rank multi-label classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-label feature selection via feature manifold learning and sparsity regularization

Sparse and low-rank representation for multi-label classification

Multi-label feature selection based on logistic regression and manifold learning

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Technical proofs

Appendix: Technical proofs

Proof of Theorem 1

Proof of Theorem 2

Proof of Lemma 1

Proof of Lemma 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation