Abstract
Usually, semi-supervised learning requires a number of prior knowledge to supervise the learning process, such as, seeds in Seeded-Kmeans, pair-wise constraints in COP-Kmeans, and labeled data for training an initial useful classifier in S3VM. Such prior knowledge is generally provided by the domain expert, so it is very expensive. In this paper, we propose a new automatical document labeling strategy to derive much more prior knowledge based on the very limited labeled data and the whole data set. Experimental results on 20-Newsgroup text data have shown that the new strategy is helpful for semi-supervised document categorization and improves the learning performance.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Macskassy, S., Banerjee, A., Davison, B., Hirsh, H.: Human performance on custering web pages: a preliminary study. In: Proceedings of ACM SIGKDD, pp. 264–268 (1998)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian Network classifier. Machine Learning 29, 131–163 (1997)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: a cluster-based approach to browsing large document collection. In: Proceedings of ACM SIGIR, pp. 318–329 (1992)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining (2000)
Zamir, O., Etzioni, O.: Web document clustering: a feasability demonstration. In: Proceedings of ACM SIGIR, pp. 46–54 (1998)
Jing, L., Ng, M., Huang, J.: An entropy weighting k-means algorithm for subspace clsutering of high-dimensional sparse data. IEEE transactions on knowledge and data engineering 19(8), 1026–1041 (2007)
Chapelle, O., Zien, A., Scholkopf, B.: Semi-supervised learning. MIT Press, Cambridge (2006)
Zhu, X.: Semi-supervised learning literature survey. Computer Sciences Technical Report 1530, University of Wisconsin-Madison (last modified on July19, 2008)
Hotho, A., Staab, S., Stumme, G.: Ontologies Improve Text Document Clustering. In: Proceeding of ICDM, pp. 19–22 (2003)
Zhou, Z., Chen, K., Dai, H.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on information systems 24(2), 219C–244C (2006)
Li, M., Zhou, Z.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on systems, man and cybernetics-part A: systems and humans 37, 1088C–1098C (2007)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Maching Learning 39, 103–134 (2000)
Fujino, A., Ueda, N., Saito, K.: A hybrid generative/discriminative approach to semi-supervised classifier desigen. Proceedings of the 20th AAAI (2005)
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of the 19th ICML, Sydney, Australia, pp. 27–34 (2002)
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means Clustering with Background Knowledge. In: Proceedings of the 18th ICML, pp. 577–584 (2001)
Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. eattle, WA, pp. 59–68 (2004)
Zhou, D., Bousquet, O., Lal, N., Weston, J., Scholkopf, B., Olkopf, B.: Learning with Local and Global Consistency. Advances in Neural Information Processing Systems 16 (2004)
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In: Proceedings of the 20th ICML (2003)
Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20(1) (2008)
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th ICML (1999)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research 7, 2399–2434 (2006)
Li, Y., Kwok, J., Zhou, Z.: Semi-supervised learning using label mean. In: Proceedings of the 26th ICML, Canada (2009)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th COLT, Wisconsin, MI, pp. 92–100 (1998)
Zhou, Z., Zhan, D., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: Proceedings of the 22th AAAI (2007)
Zhou, Z., Li, M.: Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17(11), 1529–1541 (2005)
Jain, K.: Data Clustering: 50 Years Beyond K-Means. Springer, Berlin (2008)
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of ACM SIGIR workshop on semantic Web (2003)
Jing, L., Zhou, L., Ng, M., Huang, J.: Ontology-based distance measure for text clustering. In: Proceedings of SIAM DM workshop on text mining (2006)
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th ACM SIGIR, pp. 787–788 (2007)
Schapire, R., Rochery, M., Rahim, M., Gupta, N.: Incorporating prior knowledge into boosting. In: Proceedings of 19th ICML, pp. 538–545 (2002)
Wu, X., Srihari, R.: Incorporating prior knowledge with weighted margin support vector machines. In: Proceedings of 10th ACM SIGKDD, pp. 326–333 (2004)
Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: Proceedings of 31th annual international ACM SIGIR, pp. 595–602 (2008)
Berry, M., Castellanos, M.: Survey of Text Mining II: Clustering, Classification, and Retrieval. Springer, Heidelberg (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, Y., Jing, L., Yu, J. (2009). New Labeling Strategy for Semi-supervised Document Categorization. In: Karagiannis, D., Jin, Z. (eds) Knowledge Science, Engineering and Management. KSEM 2009. Lecture Notes in Computer Science(), vol 5914. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10488-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-10488-6_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10487-9
Online ISBN: 978-3-642-10488-6
eBook Packages: Computer ScienceComputer Science (R0)