Abstract
Developing methods for designing good classifiers from labeled samples whose distribution is different from that of test samples is an important and challenging research issue in the fields of machine learning and its application. This paper focuses on designing semi-supervised classifiers with a high generalization ability by using unlabeled samples drawn by the same distribution as the test samples and presents a semi-supervised learning method based on a hybrid discriminative and generative model. Although JESS-CM is one of the most successful semi-supervised classifier design frameworks based on a hybrid approach, it has an overfitting problem in the task setting that we consider in this paper. We propose an objective function that utilizes both labeled and unlabeled samples for the discriminative training of hybrid classifiers and then expect the objective function to mitigate the overfitting problem. We show the effect of the objective function by theoretical analysis and empirical evaluation. Our experimental results for text classification using four typical benchmark test collections confirmed that with our task setting in most cases, the proposed method outperformed the JESS-CM framework. We also confirmed experimentally that the proposed method was useful for obtaining better performance when classifying data samples into either known or unknown classes, which were included in given labeled samples or not, respectively.






Similar content being viewed by others
Notes
Although the JESS-CM framework was applied to the tasks of labeling structural data such as sequence labeling and dependency parsing in the original papers, we review the JESS-CM framework in multi-class and single-label problems to discuss simply the difference between the hybrid frameworks of JESS-CM and our proposed method.
Original JESS-CM classifiers are constructed by using multiple generative models. Since the method for combining and training the discriminative function and generative models does not depend on the number of generative models, \(J\), we show the JESS-CM framework at \(J=1\) to simplify the discussion.
In our experiments, we employed fixed initial values computed by using labeled and unlabeled samples, as described in Sect. 5.2.
The latest version of UniverSVM can be downloaded from http://mloss.org/software/view/19/.
With our experimental settings, where the number of labeled samples was smaller than the number of unlabeled samples (e.g. \(N=500\) vs. \(M=2500\)), the number of vocabulary words appearing in a labeled document set, \(V_l\), was usually smaller than that appearing in an unlabeled documents set, \(V_u\). Therefore, \(r_l\) was larger than \(r_u\) as shown in Table 1. The difference between \(V_l\) and \(V_u\) also derived that \(V_l+V_u-V_b\) was similar to \(V_u\). Therefore, \(r_a\) was close to \(r_u\).
References
Agarwal A, Daumé III H (2009) Exponential family hybrid semi-supervised learning. In: Proceedings of the 21st international joint conference on artifical, intelligence (IJCAI-09), pp 974–979
Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on machine learning (ICML 2007), pp 81–88
Blitzer J, Foster D, Kakade S (2011) Domain adaptation with coupled subspaces. In: Proceedings of the 14th international conference on artificial intelligence and statistics (AISTATS 2011), pp 173–181
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006), pp 120–128
Bouchard G (2007) Bias-variance tradeoff in hybrid generative-discriminative models. In: Proceedings of the sixth international conference on machine learning and applications (ICMLA’07), pp 124–129
Bouchard G, Triggs B (2004) The tradeoff between generative and discriminative classifiers. In: Proceedings of the IASC international symposium on computational statistics, pp 721–728
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge
Chen SF, Rosenfeld R (1999) A Gaussian prior for smoothing maximum entropy models. Carnegie Mellon University, Technical report
Collobert R, Sinz F, Weston J, Bottou L (2006) Large scale transductive SVMs. J Mach Learn Res 7:1687–1712
Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring naive Bayes classifiers for text classification. In: Proceedings of the 22nd national conference on artificial intelligence (AAAI-07), pp 540–545
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175
Druck G, McCallum A (2010) High-performance semi-supervised learning using discriminatively constrained generative models. In: Proceedings of the 27th international conference on machine learning (ICML 2010), pp 319–326
Druck G, Pal C, Zhu X, McCallum A (2007) Semi-supervised classification with hybrid generative/discriminative methods. In: Proceedings of 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07), pp 280–289
Fujino A, Ueda N, Nagata M (2010) A robust semi-supervised classification method for transfer learning. In: Proceedings of the 19th ACM international conference on information and knowledge management (CIKM’10), pp 379–388
Fujino A, Ueda N, Saito K (2008) Semi-supervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. IEEE Trans Pattern Anal Mach Intell (TPAMI) 30(3): 424–437
Grandvalet Y, Bengio Y (2005) Semi-supervised learning by entropy minimization. In: Lawrence K. Saul, Yair Weiss, Léon Bottou (eds) Advances in neural information processing systems 17. MIT Press, Cambridge, pp 529–536
Jiang J (2007) A literature survey on domain adaptation of statistical classifiers. http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/
Jiang J, Zhai C (2007) Instance weighting for domain adaptation in NLP. In: Proceedings of the 45th annual meeting of the association of computational linguistics (ACL-07), pp 264–271
Lasserre JA, Bishop CM, Minka TP (2006) Principled hybrids of generative and discriminative models. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pp 87–94
Liang P, Jordan MI (2008) An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In: Proceedings of the 25th international conference on machine learning (ICML 2008), pp 584–591
Ling X, Dai W, Xue GR, Yang Q, Yu Y (2008) Spectral domain-transfer learning. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), pp 488–496
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program Ser B 45(3):503–528
Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134
Pan SJ, Tsang IW, Kwok JT, Yang Q (2009) Domain adaptation via transfer component analysis. In: Proceedings of the 21st international joint conference on artifical intelligence (IJCAI-09), pp 1187–1192
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Seeger M (2001) Learning with labeled and unlabeled data. University of Edinburgh, Technical report
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244
Sugiyama S, Müller KR (2005) Input-dependent estimation of generalization error under covariate shift. Stat Decis 23(4):249–279
Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: Proceedings of the 46th annual meeting of the association of computational linguistics (ACL-08), pp 665–673
Suzuki J, Isozaki H, Carreras X, Collins M (2009) An empirical study of semi-supervised structured conditional models for dependency parsing. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP 2009), pp 551–560
Vapnik V (1999) The nature of statistical learning theory, 2nd edn. Springer, New York
Wang Z, Song Y, Zhang C (2009) Knowledge transfer on hybrid graph. In: Proceedings of the 21st international joint conference on artifical intelligence (IJCAI-09), pp 1291–1296
Wu D, Lee WS, Ye N, Chieu HL (2009) Domain adaptive bootstrapping for named entity recognition. In: Proceedings of 2009 conference on empirical methods in natural language processing (EMNLP 2009), pp 1523–1532
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st international conference on machine learning (ICML 2004), pp 114–121
Zhu X (2005) Semi-supervised learning literature survey. Technical report, University of Wisconsin
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Derivation of objective function for parameter estimation
We derive Eq. (14) from Eq. (12). By substituting Eq. (7) for \(P(k|{\varvec{{x}}})\) in Eq. (9) for labeled samples, \(D_l = \{({\varvec{{x}}}_n,y_n)\}_{n=1}^N\), and substituting Eq. (13) for \(P(k|{\varvec{{x}}})\) in Eq. (9) for unlabeled samples, \(D_u =\{{\varvec{{x}}}_m\}_{m=1}^M\), we can transform Eq. (8) to
By substituting Eqs. (7) and (13) for \(P(k|{\varvec{{x}}})\) in Eq. (11) for \(D_l = \{({\varvec{{x}}}_n,y_n)\}_{n=1}^N\) and \(D_u =\{{\varvec{{x}}}_m\}_{m=1}^M\), respectively, we can transform Eq. (10) to
By substituting these equations for \(J_d(W)\) and \(J_g(\Theta )\) in Eq. (12), we can obtain Eq. (14).
1.2 Proof of inequality about \(Q\)-function
We prove the inequality, \(J(\Psi ) - J (\Psi ^{(t)}) \ge Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)})\), described in Sect. 4.2. From Eq. (14), we can obtain the equation,
According to Eqs. (15)–(17), we can transform the above equation to
Since \(\log b \le b - 1\), \(\sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ,\beta ) = 1\), and \(\sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) = 1\),
1.3 Proof that \(g_d\) is Concave
If the Hessian matrix of \(g_d (W,\Psi ^{(t)})\) shown in Eq. (16) is negative semidefinite, \(g_d (W,\Psi ^{(t)})\) is a concave function with respect to \(W\). We prove that the Hessian matrix of \(g_d (W,\Psi ^{(t)})\) is negative semidefinite when applying the MLR model and Gaussian prior described in Sect. 4.4.
Using the MLR model and Gaussian prior, \(P_d(y|{\varvec{{x}}};W) = \exp \left( {\varvec{{w}}}_y^T {\varvec{{x}}}\right)/\sum _{k=1}^K \exp \left( {\varvec{{w}}}_k^T {\varvec{{x}}}\right)\) and \(p(W) = \prod _{k=1}^K \exp \left(-{\varvec{{w}}}_k^T {\varvec{{w}}}_k / 2 \sigma ^2\right)\), the objective function, \(g_d (W,\Psi ^{(t)})\), shown in Eq. (16) is rewritten as
To obtain the Hessian matrix \(\left[\partial ^2 g_d/\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^{\prime }}^T\right]_{k,k^{\prime }}\) of \(g_d\), we partially differentiate \(g_d\) with respect to \({\varvec{{w}}}_k\) such that
where \(I_{y_n} (k)\) is an indicator function that satisfies \(I_{y_n} (k) = 1~(I_{y_n} (k) = 0)\) when \(k = y_n (k \ne y_n)\). Then, we partially differentiate \(\partial g_d/\partial {\varvec{{w}}}_k\) with respect to \({\varvec{{w}}}_{k^{\prime }}\) such that
where \({\varvec{{I}}}_V\) is the \((V \times V)\)-dimensional identity matrix, and \(V\) is consistent with the dimension of \({\varvec{{w}}}_k\). Then, for arbitrary \(VK\)-dimensional vector \({\varvec{{u}}}=({\varvec{{u}}}_{1}^{T},\ldots ,{\varvec{{u}}}_{k}^{T},\ldots ,{\varvec{{u}}}_{K}^{T})^T\), where \({\varvec{{u}}}_k = (u_{k1},\ldots ,u_{ki},\ldots ,u_{kV})^{T}\),
because \(\sum _{k=1}^K \!P_d (k|{\varvec{{x}}};\!W) \!=\! 1\) and \(P_d (k|{\varvec{{x}}};\!W) \!\ge \! 0\). When \({\varvec{{u}}}\!\ne \! \mathbf{0}, {\varvec{{u}}}^{T} \!\left[\partial ^2 g_d /\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^\prime }^T \right]_{k,k^\prime }\) \({\varvec{{u}}}\!<\! 0\) for arbitrary \(W\). This shows that the Hessian matrix of \(g_d(W,\Psi ^{(t)})\) with respect to \(W\) is negative semidefinite.
Rights and permissions
About this article
Cite this article
Fujino, A., Ueda, N. & Nagata, M. Adaptive semi-supervised learning on labeled and unlabeled data with different distributions. Knowl Inf Syst 37, 129–154 (2013). https://doi.org/10.1007/s10115-012-0576-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0576-8