Abstract
Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Technical Report RC23462, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA (2004)
Balcan, M.-F., Blum, A.: On a theory of learning with similarity functions. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, pp. 73–80. ACM Press, New York (2006), doi:10.1145/1143844.1143854
Balcan, M.-F., et al.: Person identification in webcam images: an application of semi-supervised learning. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 1–9 (2005)
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, ACM Press, New York (2006), doi:10.1145/1143844.1143854
Bickel, S. (ed.): Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, Humboldt University Berlin (2006)
Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Brodley, C.E., Pohoreckyj Danyluk, A. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Morgan Kaufmann, San Francisco (2001)
Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), Madison, Wisconsin, USA, July 1998, pp. 92–100 (1998)
Breitenbach, M., Grudic, G.Z.: Clustering with local and global consistency. Technical Report CU-CS-973-04, University of Colorado, Department of Computer Science (2004)
Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learning. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, pp. 585–592. MIT Press, Cambridge (2002)
Chapelle, O., Zien, A.: Semi-supervised learning by low density separation. In: Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS), Barbados, January 2005, pp. 57–64 (2005)
Delalleau, O., Bengio, Y., Roux, N.L.: Efficient non-parametric function induction in semi-supervised learning. In: Proceedings of the 10th International Workshop on Artificial Intelligence and statistics (AISTAT 2005) (2005)
Driessens, K., et al.: Using weighted nearest neighbor to benefit from unlabeled data. In: Ng, W.-K., et al. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, Springer, Heidelberg (2006)
Garcke, J., Griebel, M.: Semi-supervised learning with sparse grids. In: Proceedings of the Workshop on Learning with Partially Classified Training Data (ICML2005), Bonn, Germany (2005)
Huang, T.M., Kecman, V.: Performance comparisons of semi-supervised learning algorithms. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 45–49 (2005)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27-30, 1999, pp. 200–209. Morgan Kaufmann, San Francisco (1999)
Joachims, T.: Transductive learning via spectral graph partitioning. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), Washington, DC, USA, August 21-24, 2003, pp. 290–297. AAAI Press, Menlo Park (2003)
Jones, R.: Learning to extract entities from labeled and unlabeled text. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA (2005)
Kondor, R.I., Lafferty, J.D.: Diffusion kernels on graphs and other discrete input spaces. In: Sammut, C., Hoffmann, A.G. (eds.) Machine Learning, Proceedings of the Nineteenth International Conference (ICML) (2002)
Lewis, D., et al.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Mahdavani, M., et al.: Fast computation methods for visually guided robots. In: Proceedings of the The 2005 International Conference on Robotics and Automation (ICRA) (2005)
Ng, A.Y., Jordan, M.T., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 849–856. MIT Press, Cambridge (2001)
Nigam, K., et al.: Text classification from labeled and unlabeled documents using em. Machine Learning 39(2/3) (2000)
Oliveira, C.S., Cozman, F.G., Cohen, I.: Splitting the unsupervised and supervised components of semi-supervised learning. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005, pp. 67–73 (2005)
Pfahringer, B.: A semi-supervised spam mail detector. In: Bickel, S. (ed.) Proceedings of the ECML/PKDD 2006 Discovery Challenge Workshop, Humboldt University Berlin, pp. 48–53 (2006)
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. In: 7th IEEE Workshop on Applications of Computer Vision, pp. 29–36. IEEE Computer Society Press, Los Alamitos (2005)
Seeger, M.: Learning from labeled and unlabeled data. Technical report, University of Edinburgh, Institute for Adaptive and Neural Computation (2001)
Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–158. Springer, Heidelberg (2003)
Szummer, M., Jaakkola, T.: Partially labeled classification with markov random walks. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 945–952. MIT Press, Cambridge (2001)
Vapnik, V.N.: Statistical learning theory. J. Wilsley, New York (1998)
Vinueza, A., Grudic, G.Z.: Unsupervised outlier detection and semi-supervised learning. Technical Report CU-CS-976-04, University of Colorado, Department of Computer Science (2004)
Weston, J., et al.: Semi-supervised protein classification using cluster kernels. Bioinformatics 21(15), 3241–3247 (2005)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proc. of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 189–196 (1995)
Yu, K., Yu, S., Tresp, V.: Blockwise supervised inference on large graphs. In: Proc. of the 22nd International Conference on Machine Learning, Workshop on Learning with Partially Classified Training Data, Bonn, Germany (2005)
Zhou, D., et al.: Learning with local and global consistency. In: Thrun, S.Y., Lawrence, K.S., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)
Zhou, D., Huang, J., Schölkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: Proc. of the 22nd International Conference on Machine Learning (ICML 05), Bonn, Germany, August 2005, pp. 1041–1048 (2005)
Zhou, D., et al.: Ranking on data manifolds. In: Thrun, S.Y., Lawrence, K.S., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)
Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)
Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania, USA (2005)
Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised searning using gaussian fields and harmonic functions. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML) (2003)
Zhu, X., Lafferty, J.: Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML2005) (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Pfahringer, B., Leschi, C., Reutemann, P. (2007). Scaling Up Semi-supervised Learning: An Efficient and Effective LLGC Variant. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-71701-0_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)