Abstract
Traditional approaches for building text classifiers usually require a lot of labeled documents, which are expensive to obtain. In this paper, we study the problem of building a text classifier from a keyword and unlabeled documents, so as to avoid labeling documents manually. Firstly, we expand the keyword into a set of query terms and retrieve a set of documents from the set of unlabeled documents. Then, from the documents retrieved, we mine a set of positive documents. Thirdly, with the help of these positive documents, more positive documents could be extracted from the unlabeled documents. And finally, we train a PU text classifier with these positive documents and unlabeled documents. Our experiment result on 20Newsgroup dataset shows that the proposed approach could help to build excellent text classifiers.
This work is supported by Young Cadreman Supporting Program of Northwest A&F University (01140301). Corresponding author: Yang Zhang.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT (1998)
Ghnai, R.: Combining labeled and unlabeled data for multiclass text categorization. In: ICML (2002)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine learning 39(2-3), 103–134 (2000)
Liu, B., Lee, W., Yu, P., Li, X.: Partially Supervised Classification of Text Documents. In: ICML, pp. 387–394 (2002)
Li, X., Liu, B.: Learning to Classify Texts Using Positive and Unlabeled Data. In: IJCAI, pp. 587–594 (2003)
Yu, H., Han, J., Chang, K.C.-C.: PEBL: Positive Example Based Learning for Web Page Classification Using SVM. In: KDD, pp. 239–248 (2002)
Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text Classification without Negative Examples. In: Proc. 21st Int’l Conf. Data Eng. (2005)
Yu, H., Han, J., Chang, K.C.-C.: PEBL: Web Page Classification without Negative Examples. IEEE Trans. Knowledge and Data Eng. (2004)
Li, X., Liu, B.: Learning from Positive and Unlabeled Examples with Different Data Distributions. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 218–229. Springer, Heidelberg (2005)
Fung, G.P.C., et al.: Text Classification without Negative Examples Revisit. IEEE Transactions on Knowledge and Data Engineering 18(1), 6–20 (2006)
Li, X., Liu, B., Ng, S.-K.: Learning to Classify Documents with Only a Small Positive Training Set. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS, vol. 4701, pp. 201–213. Springer, Heidelberg (2007)
McCallum, A.: Nigam. K.: Text classification by bootstrapping with keywords, EM and shrinkage. In: ACL Workshop on Unsupervised Learning in Natural Language Processing (1999)
Liu, B., Li, X., Lee, W., Yu, P.: Text Classification by Labeling Words. In: Proc. 19th Nat’l Conf. Artificial Intelligence (2004)
Barbara, D., Domeniconi, C., Kang, N.: Classifying Document Without Labels. In: Proceedings of the SIAM International Conference on Data Mining (2004)
Barbara, D., Domeniconi, C., Kang, N.: Mining Relevant Text from Unlabeled Documents. In: Proceedings of the Third IEEE International Conference on Data Mining (2003)
Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E., Milios, E.: Semantic Similarity Methods in Wordnet and their Application Information Retrieval on the Web. In: 7th ACM International Workshop on Web Information and Data Management (2005)
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proc. Third Int’l Conf. Data Mining (2003)
Bradley, P.S., Fayyad, U.: Refining Initial Points for k-Means Clustering. In: Proc. 15th Int’l Conf. Machine Learning (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Qiu, Q., Zhang, Y., Zhu, J. (2009). Building a Text Classifier by a Keyword and Unlabeled Documents. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_54
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)