Adaptive semi-supervised learning on labeled and unlabeled data with different distributions

Fujino, Akinori; Ueda, Naonori; Nagata, Masaaki

doi:10.1007/s10115-012-0576-8

Adaptive semi-supervised learning on labeled and unlabeled data with different distributions

Regular Paper
Published: 25 October 2012

Volume 37, pages 129–154, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Akinori Fujino¹,
Naonori Ueda¹ &
Masaaki Nagata¹

697 Accesses
11 Citations
Explore all metrics

Abstract

Developing methods for designing good classifiers from labeled samples whose distribution is different from that of test samples is an important and challenging research issue in the fields of machine learning and its application. This paper focuses on designing semi-supervised classifiers with a high generalization ability by using unlabeled samples drawn by the same distribution as the test samples and presents a semi-supervised learning method based on a hybrid discriminative and generative model. Although JESS-CM is one of the most successful semi-supervised classifier design frameworks based on a hybrid approach, it has an overfitting problem in the task setting that we consider in this paper. We propose an objective function that utilizes both labeled and unlabeled samples for the discriminative training of hybrid classifiers and then expect the objective function to mitigate the overfitting problem. We show the effect of the objective function by theoretical analysis and empirical evaluation. Our experimental results for text classification using four typical benchmark test collections confirmed that with our task setting in most cases, the proposed method outperformed the JESS-CM framework. We also confirmed experimentally that the proposed method was useful for obtaining better performance when classifying data samples into either known or unknown classes, which were included in given labeled samples or not, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparison of Adjusted Methods for Selecting Useful Unlabeled Data for Semi-Supervised Learning Algorithms

A New Text Semi-supervised Multi-label Learning Model Based on Using the Label-Feature Relations

MASS: A Semi-supervised Multi-label Classification Algorithm with Specific Features

Notes

Although the JESS-CM framework was applied to the tasks of labeling structural data such as sequence labeling and dependency parsing in the original papers, we review the JESS-CM framework in multi-class and single-label problems to discuss simply the difference between the hybrid frameworks of JESS-CM and our proposed method.
Original JESS-CM classifiers are constructed by using multiple generative models. Since the method for combining and training the discriminative function and generative models does not depend on the number of generative models, $J$, we show the JESS-CM framework at $J=1$ to simplify the discussion.
In our experiments, we employed fixed initial values computed by using labeled and unlabeled samples, as described in Sect. 5.2.
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz.
http://www.cs.umass.edu/~mccallum/data/cora-classify.tar.gz.
http://www.cs.umass.edu/~mccallum/data/sraa.tar.gz.
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-18828.tar.gz.
The latest version of UniverSVM can be downloaded from http://mloss.org/software/view/19/.
With our experimental settings, where the number of labeled samples was smaller than the number of unlabeled samples (e.g. $N=500$ vs. $M=2500$), the number of vocabulary words appearing in a labeled document set, $V_l$, was usually smaller than that appearing in an unlabeled documents set, $V_u$. Therefore, $r_l$ was larger than $r_u$ as shown in Table 1. The difference between $V_l$ and $V_u$ also derived that $V_l+V_u-V_b$ was similar to $V_u$. Therefore, $r_a$ was close to $r_u$.

References

Agarwal A, Daumé III H (2009) Exponential family hybrid semi-supervised learning. In: Proceedings of the 21st international joint conference on artifical, intelligence (IJCAI-09), pp 974–979
Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
MathSciNet MATH Google Scholar
Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on machine learning (ICML 2007), pp 81–88
Blitzer J, Foster D, Kakade S (2011) Domain adaptation with coupled subspaces. In: Proceedings of the 14th international conference on artificial intelligence and statistics (AISTATS 2011), pp 173–181
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006), pp 120–128
Bouchard G (2007) Bias-variance tradeoff in hybrid generative-discriminative models. In: Proceedings of the sixth international conference on machine learning and applications (ICMLA’07), pp 124–129
Bouchard G, Triggs B (2004) The tradeoff between generative and discriminative classifiers. In: Proceedings of the IASC international symposium on computational statistics, pp 721–728
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge
Google Scholar
Chen SF, Rosenfeld R (1999) A Gaussian prior for smoothing maximum entropy models. Carnegie Mellon University, Technical report
Collobert R, Sinz F, Weston J, Bottou L (2006) Large scale transductive SVMs. J Mach Learn Res 7:1687–1712
MathSciNet MATH Google Scholar
Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring naive Bayes classifiers for text classification. In: Proceedings of the 22nd national conference on artificial intelligence (AAAI-07), pp 540–545
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
MathSciNet MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175
Article MATH Google Scholar
Druck G, McCallum A (2010) High-performance semi-supervised learning using discriminatively constrained generative models. In: Proceedings of the 27th international conference on machine learning (ICML 2010), pp 319–326
Druck G, Pal C, Zhu X, McCallum A (2007) Semi-supervised classification with hybrid generative/discriminative methods. In: Proceedings of 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07), pp 280–289
Fujino A, Ueda N, Nagata M (2010) A robust semi-supervised classification method for transfer learning. In: Proceedings of the 19th ACM international conference on information and knowledge management (CIKM’10), pp 379–388
Fujino A, Ueda N, Saito K (2008) Semi-supervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. IEEE Trans Pattern Anal Mach Intell (TPAMI) 30(3): 424–437
Article Google Scholar
Grandvalet Y, Bengio Y (2005) Semi-supervised learning by entropy minimization. In: Lawrence K. Saul, Yair Weiss, Léon Bottou (eds) Advances in neural information processing systems 17. MIT Press, Cambridge, pp 529–536
Jiang J (2007) A literature survey on domain adaptation of statistical classifiers. http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/
Jiang J, Zhai C (2007) Instance weighting for domain adaptation in NLP. In: Proceedings of the 45th annual meeting of the association of computational linguistics (ACL-07), pp 264–271
Lasserre JA, Bishop CM, Minka TP (2006) Principled hybrids of generative and discriminative models. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pp 87–94
Liang P, Jordan MI (2008) An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In: Proceedings of the 25th international conference on machine learning (ICML 2008), pp 584–591
Ling X, Dai W, Xue GR, Yang Q, Yu Y (2008) Spectral domain-transfer learning. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), pp 488–496
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program Ser B 45(3):503–528
Article MathSciNet MATH Google Scholar
Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134
Article MATH Google Scholar
Pan SJ, Tsang IW, Kwok JT, Yang Q (2009) Domain adaptation via transfer component analysis. In: Proceedings of the 21st international joint conference on artifical intelligence (IJCAI-09), pp 1187–1192
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Seeger M (2001) Learning with labeled and unlabeled data. University of Edinburgh, Technical report
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244
Article MathSciNet MATH Google Scholar
Sugiyama S, Müller KR (2005) Input-dependent estimation of generalization error under covariate shift. Stat Decis 23(4):249–279
MATH Google Scholar
Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: Proceedings of the 46th annual meeting of the association of computational linguistics (ACL-08), pp 665–673
Suzuki J, Isozaki H, Carreras X, Collins M (2009) An empirical study of semi-supervised structured conditional models for dependency parsing. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP 2009), pp 551–560
Vapnik V (1999) The nature of statistical learning theory, 2nd edn. Springer, New York
Google Scholar
Wang Z, Song Y, Zhang C (2009) Knowledge transfer on hybrid graph. In: Proceedings of the 21st international joint conference on artifical intelligence (IJCAI-09), pp 1291–1296
Wu D, Lee WS, Ye N, Chieu HL (2009) Domain adaptive bootstrapping for named entity recognition. In: Proceedings of 2009 conference on empirical methods in natural language processing (EMNLP 2009), pp 1523–1532
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st international conference on machine learning (ICML 2004), pp 114–121
Zhu X (2005) Semi-supervised learning literature survey. Technical report, University of Wisconsin

Download references

Author information

Authors and Affiliations

NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237, Japan
Akinori Fujino, Naonori Ueda & Masaaki Nagata

Authors

Akinori Fujino
View author publications
You can also search for this author inPubMed Google Scholar
Naonori Ueda
View author publications
You can also search for this author inPubMed Google Scholar
Masaaki Nagata
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Akinori Fujino.

Appendix

1.1 Derivation of objective function for parameter estimation

We derive Eq. (14) from Eq. (12). By substituting Eq. (7) for $P(k|{\varvec{{x}}})$ in Eq. (9) for labeled samples, $D_l = \{({\varvec{{x}}}_n,y_n)\}_{n=1}^N$, and substituting Eq. (13) for $P(k|{\varvec{{x}}})$ in Eq. (9) for unlabeled samples, $D_u =\{{\varvec{{x}}}_m\}_{m=1}^M$, we can transform Eq. (8) to

$$\begin{aligned} J_d(W)&= \sum _{n=1}^N \log P_d(y_n|{\varvec{{x}}}_n;W)\nonumber \\&+ \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;W,\Theta ,\beta ) \log \frac{P_d(k|{\varvec{{x}}}_m;W)}{P(k|{\varvec{{x}}}_m;W,\Theta ,\beta )} + \log p(W) \nonumber \\&= \sum _{n=1}^N \log P_d(y_n|{\varvec{{x}}}_n;W) + \sum _{m=1}^M \log \sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W)p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k)^{\beta } \nonumber \\&- \beta \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;W,\Theta ,\beta ) \log p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k) + \log p(W). \end{aligned}$$

(24)

By substituting Eqs. (7) and (13) for $P(k|{\varvec{{x}}})$ in Eq. (11) for $D_l = \{({\varvec{{x}}}_n,y_n)\}_{n=1}^N$ and $D_u =\{{\varvec{{x}}}_m\}_{m=1}^M$, respectively, we can transform Eq. (10) to

$$\begin{aligned} J_g(\Theta )&= \sum _{n=1}^N p_g({\varvec{{x}}}_n,y_n;|\varvec{\theta }_{y_n}) + \sum _{m=1}^M P(k|{\varvec{{x}}}_m;W,\Theta ,\beta ) \log p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k) + \log p(\Theta ).\nonumber \\ \end{aligned}$$

(25)

By substituting these equations for $J_d(W)$ and $J_g(\Theta )$ in Eq. (12), we can obtain Eq. (14).

1.2 Proof of inequality about $Q$-function

We prove the inequality, $J(\Psi ) - J (\Psi ^{(t)}) \ge Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)})$, described in Sect. 4.2. From Eq. (14), we can obtain the equation,

$$\begin{aligned}&J(\Psi )- J (\Psi ^{(t)}) \nonumber \\&\quad = \log \frac{p(W)}{p(W^{(t)})} + \beta \log \frac{p(\Theta )}{p(\Theta ^{(t)})} + \sum _{n=1}^N \log \frac{P_d(y_n|{\varvec{{x}}}_n;W)p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n})^{\beta }}{P_d(y_n|{\varvec{{x}}}_n;W^{(t)})p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n}^{(t)})^{\beta }} \nonumber \\&\quad + \sum _{m=1}^M \log \frac{\sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W)p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k)^{\beta }}{\sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W^{(t)})p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k^{(t)})^{\beta }} \nonumber \\&\quad = \log \frac{p(W)}{p(W^{(t)})} + \beta \log \frac{p(\Theta )}{p(\Theta ^{(t)})} + \sum _{n=1}^N \log \frac{P_d(y_n|{\varvec{{x}}}_n;W)}{P_d(y_n|{\varvec{{x}}}_n;W^{(t)})} \frac{p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n})^{\beta }}{p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n}^{(t)})^{\beta }} \nonumber \\&\quad + \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) \log \frac{P_d(k|{\varvec{{x}}}_m;W)}{P_d(k|{\varvec{{x}}}_m;W^{(t)})} \frac{p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k)^{\beta }}{p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k^{(t)})^{\beta }} \frac{P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta )}{P(k|{\varvec{{x}}}_m;\Psi ,\beta )}.\nonumber \\ \end{aligned}$$

(26)

According to Eqs. (15)–(17), we can transform the above equation to

$$\begin{aligned}&J(\Psi ) - J (\Psi ^{(t)}) \nonumber \\&\quad = Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)}) + \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) \log \frac{P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta )}{P(k|{\varvec{{x}}}_m;\Psi ,\beta )}. \end{aligned}$$

(27)

Since $\log b \le b - 1$, $\sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ,\beta ) = 1$, and $\sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) = 1$,

$$\begin{aligned}&J(\Psi ) - J (\Psi ^{(t)}) \nonumber \\&\quad \ge Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)}) + \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) \left\{ 1- \frac{P(k|{\varvec{{x}}}_m;\Psi ,\beta )}{P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta )} \right\} \nonumber \\&\quad = Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)}). \end{aligned}$$

(28)

1.3 Proof that $g_d$ is Concave

If the Hessian matrix of $g_d (W,\Psi ^{(t)})$ shown in Eq. (16) is negative semidefinite, $g_d (W,\Psi ^{(t)})$ is a concave function with respect to $W$. We prove that the Hessian matrix of $g_d (W,\Psi ^{(t)})$ is negative semidefinite when applying the MLR model and Gaussian prior described in Sect. 4.4.

Using the MLR model and Gaussian prior, $P_d(y|{\varvec{{x}}};W) = \exp \left( {\varvec{{w}}}_y^T {\varvec{{x}}}\right)/\sum _{k=1}^K \exp \left( {\varvec{{w}}}_k^T {\varvec{{x}}}\right)$ and $p(W) = \prod _{k=1}^K \exp \left(-{\varvec{{w}}}_k^T {\varvec{{w}}}_k / 2 \sigma ^2\right)$, the objective function, $g_d (W,\Psi ^{(t)})$, shown in Eq. (16) is rewritten as

$$\begin{aligned} g_d (W;\Psi ^{(t)})&= - \sum _{k=1}^K \frac{{\varvec{{w}}}_k^T {\varvec{{w}}}_k}{2 \sigma ^2} + \sum _{n=1}^N \left\{ {\varvec{{w}}}_{y_n}^T {\varvec{{x}}}_n - \log \sum _{k=1}^K \exp ({\varvec{{w}}}_k^T {\varvec{{x}}}_n) \right\} \nonumber \\&+ \sum _{m=1}^M \left\{ \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) {\varvec{{w}}}_k^T {\varvec{{x}}}_m - \log \sum _{k=1}^K \exp ({\varvec{{w}}}_k^T {\varvec{{x}}}_m) \right\} . \end{aligned}$$

(29)

To obtain the Hessian matrix $\left[\partial ^2 g_d/\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^{\prime }}^T\right]_{k,k^{\prime }}$ of $g_d$, we partially differentiate $g_d$ with respect to ${\varvec{{w}}}_k$ such that

$$\begin{aligned} \frac{\partial g_d}{\partial {\varvec{{w}}}_k}&= - \frac{{\varvec{{w}}}_k}{\sigma ^2} + \sum _{n=1}^N \left\{ I_{y_n} (k) - P_d(k|{\varvec{{x}}}_n;W)\right\} {\varvec{{x}}}_n \nonumber \\&+ \sum _{m=1}^M \left\{ P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) - P_d(k|{\varvec{{x}}}_m;W)\right\} {\varvec{{x}}}_m, \end{aligned}$$

(30)

where $I_{y_n} (k)$ is an indicator function that satisfies $I_{y_n} (k) = 1~(I_{y_n} (k) = 0)$ when $k = y_n (k \ne y_n)$. Then, we partially differentiate $\partial g_d/\partial {\varvec{{w}}}_k$ with respect to ${\varvec{{w}}}_{k^{\prime }}$ such that

$$\begin{aligned} \frac{\partial ^{2} g_{d}}{\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^\prime }^{T}}&= - \frac{1}{\sigma ^2} I_k (k^{\prime }) {\varvec{{I}}}_V - \sum _{n=1}^N P_d(k|{\varvec{{x}}}_n;W) \left\{ I_{k} (k^\prime ) - P_d(k^\prime |{\varvec{{x}}}_n;W)\right\} {\varvec{{x}}}_n {\varvec{{x}}}_n^T \nonumber \\&- \sum _{m=1}^M P_d(k|{\varvec{{x}}}_m;W) \left\{ I_{k} (k^\prime ) - P_d(k^\prime |{\varvec{{x}}}_m;W)\right\} {\varvec{{x}}}_m {\varvec{{x}}}_{m}^{T}, \end{aligned}$$

(31)

where ${\varvec{{I}}}_V$ is the $(V \times V)$-dimensional identity matrix, and $V$ is consistent with the dimension of ${\varvec{{w}}}_k$. Then, for arbitrary $VK$-dimensional vector ${\varvec{{u}}}=({\varvec{{u}}}_{1}^{T},\ldots ,{\varvec{{u}}}_{k}^{T},\ldots ,{\varvec{{u}}}_{K}^{T})^T$, where ${\varvec{{u}}}_k = (u_{k1},\ldots ,u_{ki},\ldots ,u_{kV})^{T}$,

$$\begin{aligned}&{\varvec{{u}}}^T \left[ \frac{\partial ^{2} g_d}{\partial {\varvec{{w}}}_{k} \partial {\varvec{{w}}}_{k^\prime }^{T}} \right]_{k,k^\prime }{\varvec{{u}}}\nonumber \\&\quad = - \sum _{k=1}^K \frac{{\varvec{{u}}}_{k}^{T} {\varvec{{u}}}_{k}}{\sigma ^{2}} - \sum _{n=1}^N \sum _{k=1}^K P_d(k|{\varvec{{x}}}_n;W) {\varvec{{u}}}_{k}^{T} {\varvec{{x}}}_n \left\{ {\varvec{{x}}}_{n}^{T} {\varvec{{u}}}_k - \sum _{k^\prime =1}^{K} P_d (k^\prime |{\varvec{{x}}}_{n};W) {\varvec{{x}}}_{n}^{T} {\varvec{{u}}}_{k^\prime }\right\} \nonumber \\&\quad - \sum _{m=1}^M \sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W) {\varvec{{u}}}_k^T {\varvec{{x}}}_m \left\{ {\varvec{{x}}}_m^T {\varvec{{u}}}_k - \sum _{k^\prime =1}^K P_d (k^\prime |{\varvec{{x}}}_m;W) {\varvec{{x}}}_{m}^{T}{\varvec{{u}}}_{k^\prime }\right\} \nonumber \\&\quad = - \sum _{k=1}^K \frac{{\varvec{{u}}}_{k}^{T} {\varvec{{u}}}_k}{\sigma ^{2}} - \sum _{n=1}^N \sum _{k=1}^K P_d(k|{\varvec{{x}}}_n;W) \left\{ {\varvec{{x}}}_{n}^{T}{\varvec{{u}}}_k - \sum _{k^\prime =1}^K P_d (k^\prime |{\varvec{{x}}}_n;W) {\varvec{{x}}}_{n}^{T} {\varvec{{u}}}_{k^\prime }\right\} ^2 \nonumber \\&\quad - \sum _{m=1}^M \sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W) \left\{ {\varvec{{x}}}_{m}^{T} {\varvec{{u}}}_k - \sum _{k^\prime =1}^K P_d (k^\prime |{\varvec{{x}}}_m;W) {\varvec{{x}}}_{m}^{T} {\varvec{{u}}}_{k^\prime } \right\} ^2, \end{aligned}$$

(32)

because $\sum _{k=1}^K \!P_d (k|{\varvec{{x}}};\!W) \!=\! 1$ and $P_d (k|{\varvec{{x}}};\!W) \!\ge \! 0$. When ${\varvec{{u}}}\!\ne \! \mathbf{0}, {\varvec{{u}}}^{T} \!\left[\partial ^2 g_d /\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^\prime }^T \right]_{k,k^\prime }$ ${\varvec{{u}}}\!<\! 0$ for arbitrary $W$. This shows that the Hessian matrix of $g_d(W,\Psi ^{(t)})$ with respect to $W$ is negative semidefinite.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fujino, A., Ueda, N. & Nagata, M. Adaptive semi-supervised learning on labeled and unlabeled data with different distributions. Knowl Inf Syst 37, 129–154 (2013). https://doi.org/10.1007/s10115-012-0576-8

Download citation

Received: 22 February 2011
Revised: 06 April 2012
Accepted: 06 October 2012
Published: 25 October 2012
Issue Date: October 2013
DOI: https://doi.org/10.1007/s10115-012-0576-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive semi-supervised learning on labeled and unlabeled data with different distributions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparison of Adjusted Methods for Selecting Useful Unlabeled Data for Semi-Supervised Learning Algorithms

A New Text Semi-supervised Multi-label Learning Model Based on Using the Label-Feature Relations

MASS: A Semi-supervised Multi-label Classification Algorithm with Specific Features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Derivation of objective function for parameter estimation

1.2 Proof of inequality about \(Q\)-function

1.3 Proof that \(g_d\) is Concave

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Adaptive semi-supervised learning on labeled and unlabeled data with different distributions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparison of Adjusted Methods for Selecting Useful Unlabeled Data for Semi-Supervised Learning Algorithms

A New Text Semi-supervised Multi-label Learning Model Based on Using the Label-Feature Relations

MASS: A Semi-supervised Multi-label Classification Algorithm with Specific Features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Derivation of objective function for parameter estimation

1.2 Proof of inequality about \(Q\)-function

1.3 Proof that \(g_d\) is Concave

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now