Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant

Pfahringer, Bernhard; Leschi, Claire; Reutemann, Peter

doi:10.1007/978-3-540-71701-0_25

Bernhard Pfahringer¹,
Claire Leschi² &
Peter Reutemann¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

Abstract

Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Technical Report RC23462, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA (2004)
Google Scholar
Balcan, M.-F., Blum, A.: On a theory of learning with similarity functions. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, pp. 73–80. ACM Press, New York (2006), doi:10.1145/1143844.1143854
Chapter Google Scholar
Balcan, M.-F., et al.: Person identification in webcam images: an application of semi-supervised learning. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 1–9 (2005)
Google Scholar
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, ACM Press, New York (2006), doi:10.1145/1143844.1143854
Google Scholar
Bickel, S. (ed.): Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, Humboldt University Berlin (2006)
Google Scholar
Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Brodley, C.E., Pohoreckyj Danyluk, A. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Morgan Kaufmann, San Francisco (2001)
Google Scholar
Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), Madison, Wisconsin, USA, July 1998, pp. 92–100 (1998)
Google Scholar
Breitenbach, M., Grudic, G.Z.: Clustering with local and global consistency. Technical Report CU-CS-973-04, University of Colorado, Department of Computer Science (2004)
Google Scholar
Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learning. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, pp. 585–592. MIT Press, Cambridge (2002)
Google Scholar
Chapelle, O., Zien, A.: Semi-supervised learning by low density separation. In: Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS), Barbados, January 2005, pp. 57–64 (2005)
Google Scholar
Delalleau, O., Bengio, Y., Roux, N.L.: Efficient non-parametric function induction in semi-supervised learning. In: Proceedings of the 10th International Workshop on Artificial Intelligence and statistics (AISTAT 2005) (2005)
Google Scholar
Driessens, K., et al.: Using weighted nearest neighbor to benefit from unlabeled data. In: Ng, W.-K., et al. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, Springer, Heidelberg (2006)
Chapter Google Scholar
Garcke, J., Griebel, M.: Semi-supervised learning with sparse grids. In: Proceedings of the Workshop on Learning with Partially Classified Training Data (ICML2005), Bonn, Germany (2005)
Google Scholar
Huang, T.M., Kecman, V.: Performance comparisons of semi-supervised learning algorithms. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 45–49 (2005)
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27-30, 1999, pp. 200–209. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Joachims, T.: Transductive learning via spectral graph partitioning. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), Washington, DC, USA, August 21-24, 2003, pp. 290–297. AAAI Press, Menlo Park (2003)
Google Scholar
Jones, R.: Learning to extract entities from labeled and unlabeled text. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA (2005)
Google Scholar
Kondor, R.I., Lafferty, J.D.: Diffusion kernels on graphs and other discrete input spaces. In: Sammut, C., Hoffmann, A.G. (eds.) Machine Learning, Proceedings of the Nineteenth International Conference (ICML) (2002)
Google Scholar
Lewis, D., et al.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Mahdavani, M., et al.: Fast computation methods for visually guided robots. In: Proceedings of the The 2005 International Conference on Robotics and Automation (ICRA) (2005)
Google Scholar
Ng, A.Y., Jordan, M.T., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 849–856. MIT Press, Cambridge (2001)
Google Scholar
Nigam, K., et al.: Text classification from labeled and unlabeled documents using em. Machine Learning 39(2/3) (2000)
Google Scholar
Oliveira, C.S., Cozman, F.G., Cohen, I.: Splitting the unsupervised and supervised components of semi-supervised learning. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 67–73 (2005)
Google Scholar
Pfahringer, B.: A semi-supervised spam mail detector. In: Bickel, S. (ed.) Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, Humboldt University Berlin, pp. 48–53 (2006)
Google Scholar
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. In: 7th IEEE Workshop on Applications of Computer Vision, pp. 29–36. IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Seeger, M.: Learning from labeled and unlabeled data. Technical report, University of Edinburgh, Institute for Adaptive and Neural Computation (2001)
Google Scholar
Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–158. Springer, Heidelberg (2003)
Google Scholar
Szummer, M., Jaakkola, T.: Partially labeled classification with markov random walks. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 945–952. MIT Press, Cambridge (2001)
Google Scholar
Vapnik, V.N.: Statistical learning theory. J. Wilsley, New York (1998)
MATH Google Scholar
Vinueza, A., Grudic, G.Z.: Unsupervised outlier detection and semi-supervised learning. Technical Report CU-CS-976-04, University of Colorado, Department of Computer Science (2004)
Google Scholar
Weston, J., et al.: Semi-supervised protein classification using cluster kernels. Bioinformatics 21(15), 3241–3247 (2005)
Article Google Scholar
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proc. of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 189–196 (1995)
Google Scholar
Yu, K., Yu, S., Tresp, V.: Blockwise supervised inference on large graphs. In: Proc. of the 22nd International Conference on Machine Learning, Workshop on Learning with Partially Classified Training Data, Bonn, Germany (2005)
Google Scholar
Zhou, D., et al.: Learning with local and global consistency. In: Thrun, S.Y., Lawrence, K.S., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)
Google Scholar
Zhou, D., Huang, J., Schölkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Bonn, Germany, August 2005, pp. 1041–1048 (2005)
Google Scholar
Zhou, D., et al.: Ranking on data manifolds. In: Thrun, S.Y., Lawrence, K.S., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)
Google Scholar
Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)
Google Scholar
Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA (2005)
Google Scholar
Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised searning using gaussian fields and harmonic functions. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML) (2003)
Google Scholar
Zhu, X., Lafferty, J.: Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML2005) (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Waikato, Hamilton, New Zealand
Bernhard Pfahringer & Peter Reutemann
INSA Lyon, France
Claire Leschi

Authors

Bernhard Pfahringer
View author publications
You can also search for this author in PubMed Google Scholar
Claire Leschi
View author publications
You can also search for this author in PubMed Google Scholar
Peter Reutemann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pfahringer, B., Leschi, C., Reutemann, P. (2007). Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_25

Download citation

DOI: https://doi.org/10.1007/978-3-540-71701-0_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics