Skip to main content
Log in

Semisupervised learning from different information sources

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper studies the use of a semisupervised learning algorithm from different information sources. We first offer a theoretical explanation as to why minimising the disagreement between individual models could lead to the performance improvement. Based on the observation, this paper proposes a semisupervised learning approach that attempts to minimise this disagreement by employing a co-updating method and making use of both labeled and unlabeled data. Three experiments to test the effectiveness of the approach are presented in this paper: (i) webpage classification from both content and hyperlinks; (ii) functional classification of gene using gene expression data and phylogenetic data and (iii) machine self-maintaining from both sensory and image data. The results show the effectiveness and efficiency of our approach and suggest its application potentials.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abney S (2002) Bootstrapping. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL 2002). Morgan Kaufmann, San Francisco, CA, pp 360–367

  • Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. In: Langley P (ed) Proceedings of the 17th international conference on machine learning (ICML 2000). Morgan Kaufmann, San Francisco, CA, pp 9–16

  • Becker S (1996) Mutual information maximization: models of cortical self-organization. Netw Comput Neural Syst 7:7–31

    Article  Google Scholar 

  • Bennett KP, Demiriz A, Maclin R (2002) Exploiting unlabeled data in ensemble methods. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD 2002). ACM Press, New York, NY, pp 289–296

  • Blum A, Chawla S (2001) Learning from labeled and unlabeled data using graph mincuts. In: Brodley CE, Danyluk AP (eds) Proceedings of the eighteenth international conference on machine learning (ICML 2001). Morgan Kaufmann, San Francisco, CA, pp 19–26

  • Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory (COLT 1998). ACM Press, New York, NY, pp 92–100

  • Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet C, Furey TS, Manuel Ares J, Haussler D (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proceedings of the National Academy of Sciences 97(1):262–267

    Article  Google Scholar 

  • Buc F, Grandvalet Y, Ambroise C (2002) Semi-supervised marginboost. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14 (NIPS 2001). The MIT Press, Cambridge, MA, pp 553–560

  • Castelli V, Cover T (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans Inf Theory 42:2102–2117

    Article  MathSciNet  Google Scholar 

  • Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of the 1999 joint SIGDAT conference on empirical methods in natural language processing and very large corpora. ACM Press, New York, NY, pp 100–110

  • Cozman FG, Cohen I (2002) Unlabeled data can degrade classification performance of generative classifiers. In: Proceedings of the fifteenth international Florida artificial intelligence society conference (FLAIRS 2002), pp 327–331

  • Dasgupta S, Littman ML, McAllester D (2001) PAC generalization bounds for co-training. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14 (NIPS 2001). The MIT Press, Cambridge, MA, pp 375–382

  • De Sa VR, Ballard D (1998) Category learning through multimodality sensing. Neural Comput 10:1097–1117

    Article  Google Scholar 

  • Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B 39(1):1–38

    MathSciNet  Google Scholar 

  • Fung G, Mangasarian O (2001) Semi-supervised support vector machines for unlabeled data classification. Optim Methods Softw 15:29–44

    Article  Google Scholar 

  • Ghahramani Z, Jordan MI (1994) Supervised learning from incomplete data via an EM approach. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6 (NIPS 1993). Morgan Kaufmann, San Francisco, CA, pp 120–127

  • Goldman S, Zhou Y (2000) Enhancing supervised learning with unlabeled data. In: Langley P (ed) Proceedings of the 17th international conference on machine learning (ICML 2000). Morgan Kaufmann, San Francisco, CA, pp 327–334

  • Li T, Zhu S, Li Q, Ogihara M (2003) Gene functional classification by semi-supervised learning from heterogeneous data. In: Proceedings of the 18th annual ACM symposium on applied computing (SAC’03). ACM Press, New York, NY, pp 78–82

  • McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  • Mitchell TM (1997) Machine learning. McGraw-Hill

  • Muslea I, Minton S, Knoblock CA (2000) Selective sampling with redundant views. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence (AAAI/IAAI 2000). The MIT Press, Cambridge, MA, pp 621–626

  • Neal R, Hinton G (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. The MIT Press, Cambridge, MA, pp 355–368

  • Nigam K (2001) Using unlabeled data to improve text classification. Doctoral Dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA

  • Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on information and knowledge management (CIKM 2000). ACM Press, New York, NY, pp 86–93

  • Nigam K, McCallum AK, Thrun S, Mitchell TM (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134

    Article  Google Scholar 

  • Pavlidis P, Weston J, Cai J, Grundy WN (2001) Gene functional classification from heterogeneous data. In: Proceedings of fifth annual international conference on computational molecular biology (RECOMB 2001). ACM Press, New York, NY, pp 249–255

  • Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco, CA

  • Roth D, Zelenko D (2000) Toward a theory of learning coherent concepts. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence (AAAI/IAAI 2000). The MIT Press, Cambridge, MA, pp 639–644

  • Seeger M (2000) Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, UK

  • Vapnik VN (1998) Statistical learning theory. Wiley, New York, NY

  • Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical report ML-TR 44, Department of Computer Science, Rutgers University, New Brunswick, NJ

    Google Scholar 

  • Wu L, Oviatt SL, Cohen PR (1999) Multimodal integration—a statistical view. IEEE Trans Multimedia 1:334–341

    Article  Google Scholar 

  • Zhang T, Oles F (2000) A probability analysis on the value of unlabeled data for classification problems. In: Langley P (ed) Proceedings of the 17th international conference on machine learning (ICML 2000). Morgan Kaufmann, San Francisco, CA, pp 1191–1198

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, T., Ogihara, M. Semisupervised learning from different information sources. Knowl Inf Syst 7, 289–309 (2005). https://doi.org/10.1007/s10115-004-0155-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0155-8

Keywords

Navigation