Skip to main content

Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Abstract

Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Technical Report RC23462, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA (2004)

    Google Scholar 

  2. Balcan, M.-F., Blum, A.: On a theory of learning with similarity functions. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, pp. 73–80. ACM Press, New York (2006), doi:10.1145/1143844.1143854

    Chapter  Google Scholar 

  3. Balcan, M.-F., et al.: Person identification in webcam images: an application of semi-supervised learning. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 1–9 (2005)

    Google Scholar 

  4. Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, ACM Press, New York (2006), doi:10.1145/1143844.1143854

    Google Scholar 

  5. Bickel, S. (ed.): Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, Humboldt University Berlin (2006)

    Google Scholar 

  6. Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Brodley, C.E., Pohoreckyj Danyluk, A. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  7. Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), Madison, Wisconsin, USA, July 1998, pp. 92–100 (1998)

    Google Scholar 

  8. Breitenbach, M., Grudic, G.Z.: Clustering with local and global consistency. Technical Report CU-CS-973-04, University of Colorado, Department of Computer Science (2004)

    Google Scholar 

  9. Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learning. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, pp. 585–592. MIT Press, Cambridge (2002)

    Google Scholar 

  10. Chapelle, O., Zien, A.: Semi-supervised learning by low density separation. In: Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS), Barbados, January 2005, pp. 57–64 (2005)

    Google Scholar 

  11. Delalleau, O., Bengio, Y., Roux, N.L.: Efficient non-parametric function induction in semi-supervised learning. In: Proceedings of the 10th International Workshop on Artificial Intelligence and statistics (AISTAT 2005) (2005)

    Google Scholar 

  12. Driessens, K., et al.: Using weighted nearest neighbor to benefit from unlabeled data. In: Ng, W.-K., et al. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  13. Garcke, J., Griebel, M.: Semi-supervised learning with sparse grids. In: Proceedings of the Workshop on Learning with Partially Classified Training Data (ICML2005), Bonn, Germany (2005)

    Google Scholar 

  14. Huang, T.M., Kecman, V.: Performance comparisons of semi-supervised learning algorithms. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 45–49 (2005)

    Google Scholar 

  15. Joachims, T.: Transductive inference for text classification using support vector machines. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27-30, 1999, pp. 200–209. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  16. Joachims, T.: Transductive learning via spectral graph partitioning. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), Washington, DC, USA, August 21-24, 2003, pp. 290–297. AAAI Press, Menlo Park (2003)

    Google Scholar 

  17. Jones, R.: Learning to extract entities from labeled and unlabeled text. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA (2005)

    Google Scholar 

  18. Kondor, R.I., Lafferty, J.D.: Diffusion kernels on graphs and other discrete input spaces. In: Sammut, C., Hoffmann, A.G. (eds.) Machine Learning, Proceedings of the Nineteenth International Conference (ICML) (2002)

    Google Scholar 

  19. Lewis, D., et al.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  20. Mahdavani, M., et al.: Fast computation methods for visually guided robots. In: Proceedings of the The 2005 International Conference on Robotics and Automation (ICRA) (2005)

    Google Scholar 

  21. Ng, A.Y., Jordan, M.T., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 849–856. MIT Press, Cambridge (2001)

    Google Scholar 

  22. Nigam, K., et al.: Text classification from labeled and unlabeled documents using em. Machine Learning 39(2/3) (2000)

    Google Scholar 

  23. Oliveira, C.S., Cozman, F.G., Cohen, I.: Splitting the unsupervised and supervised components of semi-supervised learning. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 67–73 (2005)

    Google Scholar 

  24. Pfahringer, B.: A semi-supervised spam mail detector. In: Bickel, S. (ed.) Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, Humboldt University Berlin, pp. 48–53 (2006)

    Google Scholar 

  25. Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. In: 7th IEEE Workshop on Applications of Computer Vision, pp. 29–36. IEEE Computer Society Press, Los Alamitos (2005)

    Google Scholar 

  26. Seeger, M.: Learning from labeled and unlabeled data. Technical report, University of Edinburgh, Institute for Adaptive and Neural Computation (2001)

    Google Scholar 

  27. Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–158. Springer, Heidelberg (2003)

    Google Scholar 

  28. Szummer, M., Jaakkola, T.: Partially labeled classification with markov random walks. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 945–952. MIT Press, Cambridge (2001)

    Google Scholar 

  29. Vapnik, V.N.: Statistical learning theory. J. Wilsley, New York (1998)

    MATH  Google Scholar 

  30. Vinueza, A., Grudic, G.Z.: Unsupervised outlier detection and semi-supervised learning. Technical Report CU-CS-976-04, University of Colorado, Department of Computer Science (2004)

    Google Scholar 

  31. Weston, J., et al.: Semi-supervised protein classification using cluster kernels. Bioinformatics 21(15), 3241–3247 (2005)

    Article  Google Scholar 

  32. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proc. of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 189–196 (1995)

    Google Scholar 

  33. Yu, K., Yu, S., Tresp, V.: Blockwise supervised inference on large graphs. In: Proc. of the 22nd International Conference on Machine Learning, Workshop on Learning with Partially Classified Training Data, Bonn, Germany (2005)

    Google Scholar 

  34. Zhou, D., et al.: Learning with local and global consistency. In: Thrun, S.Y., Lawrence, K.S., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)

    Google Scholar 

  35. Zhou, D., Huang, J., Schölkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Bonn, Germany, August 2005, pp. 1041–1048 (2005)

    Google Scholar 

  36. Zhou, D., et al.: Ranking on data manifolds. In: Thrun, S.Y., Lawrence, K.S., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)

    Google Scholar 

  37. Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)

    Google Scholar 

  38. Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA (2005)

    Google Scholar 

  39. Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised searning using gaussian fields and harmonic functions. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML) (2003)

    Google Scholar 

  40. Zhu, X., Lafferty, J.: Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML2005) (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Pfahringer, B., Leschi, C., Reutemann, P. (2007). Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71701-0_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71700-3

  • Online ISBN: 978-3-540-71701-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics