Building a Text Classifier by a Keyword and Unlabeled Documents

Qiu, Qiang; Zhang, Yang; Zhu, Junping

doi:10.1007/978-3-642-01307-2_54

Building a Text Classifier by a Keyword and Unlabeled Documents

Qiang Qiu²³,
Yang Zhang²³ &
Junping Zhu²³

Conference paper

3137 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Abstract

Traditional approaches for building text classifiers usually require a lot of labeled documents, which are expensive to obtain. In this paper, we study the problem of building a text classifier from a keyword and unlabeled documents, so as to avoid labeling documents manually. Firstly, we expand the keyword into a set of query terms and retrieve a set of documents from the set of unlabeled documents. Then, from the documents retrieved, we mine a set of positive documents. Thirdly, with the help of these positive documents, more positive documents could be extracted from the unlabeled documents. And finally, we train a PU text classifier with these positive documents and unlabeled documents. Our experiment result on 20Newsgroup dataset shows that the proposed approach could help to build excellent text classifiers.

This work is supported by Young Cadreman Supporting Program of Northwest A&F University (01140301). Corresponding author: Yang Zhang.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT (1998)
Google Scholar
Ghnai, R.: Combining labeled and unlabeled data for multiclass text categorization. In: ICML (2002)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine learning 39(2-3), 103–134 (2000)
Article MATH Google Scholar
Liu, B., Lee, W., Yu, P., Li, X.: Partially Supervised Classification of Text Documents. In: ICML, pp. 387–394 (2002)
Google Scholar
Li, X., Liu, B.: Learning to Classify Texts Using Positive and Unlabeled Data. In: IJCAI, pp. 587–594 (2003)
Google Scholar
Yu, H., Han, J., Chang, K.C.-C.: PEBL: Positive Example Based Learning for Web Page Classification Using SVM. In: KDD, pp. 239–248 (2002)
Google Scholar
Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text Classification without Negative Examples. In: Proc. 21st Int’l Conf. Data Eng. (2005)
Google Scholar
Yu, H., Han, J., Chang, K.C.-C.: PEBL: Web Page Classification without Negative Examples. IEEE Trans. Knowledge and Data Eng. (2004)
Google Scholar
Li, X., Liu, B.: Learning from Positive and Unlabeled Examples with Different Data Distributions. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 218–229. Springer, Heidelberg (2005)
Chapter Google Scholar
Fung, G.P.C., et al.: Text Classification without Negative Examples Revisit. IEEE Transactions on Knowledge and Data Engineering 18(1), 6–20 (2006)
Article Google Scholar
Li, X., Liu, B., Ng, S.-K.: Learning to Classify Documents with Only a Small Positive Training Set. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS, vol. 4701, pp. 201–213. Springer, Heidelberg (2007)
Chapter Google Scholar
McCallum, A.: Nigam. K.: Text classification by bootstrapping with keywords, EM and shrinkage. In: ACL Workshop on Unsupervised Learning in Natural Language Processing (1999)
Google Scholar
Liu, B., Li, X., Lee, W., Yu, P.: Text Classification by Labeling Words. In: Proc. 19th Nat’l Conf. Artificial Intelligence (2004)
Google Scholar
Barbara, D., Domeniconi, C., Kang, N.: Classifying Document Without Labels. In: Proceedings of the SIAM International Conference on Data Mining (2004)
Google Scholar
Barbara, D., Domeniconi, C., Kang, N.: Mining Relevant Text from Unlabeled Documents. In: Proceedings of the Third IEEE International Conference on Data Mining (2003)
Google Scholar
Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E., Milios, E.: Semantic Similarity Methods in Wordnet and their Application Information Retrieval on the Web. In: 7th ACM International Workshop on Web Information and Data Management (2005)
Google Scholar
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proc. Third Int’l Conf. Data Mining (2003)
Google Scholar
Bradley, P.S., Fayyad, U.: Refining Initial Points for k-Means Clustering. In: Proc. 15th Int’l Conf. Machine Learning (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Engineering, Northwest A&F University, Yangling, Shaanxi Province, P.R. China, 712100
Qiang Qiu, Yang Zhang & Junping Zhu

Authors

Qiang Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Junping Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5 Tiwanont Road, 12000, Bangkadi, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Dept. of Computer Engineering, Faculty of Engineering, Chulalongkorn University, 10330, Bangkok, Thailand
Boonserm Kijsirikul
Faculty of Science & Engineering, York University, 355 Lumbers Building, 4700 Keele Street, M3J 1P3, Toronto, Ontario, Canada
Nick Cercone
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, 923-1292, Ishikawa, Japan
Tu-Bao Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qiu, Q., Zhang, Y., Zhu, J. (2009). Building a Text Classifier by a Keyword and Unlabeled Documents. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_54

Download citation

DOI: https://doi.org/10.1007/978-3-642-01307-2_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics