Skip to main content

Building a Text Classifier by a Keyword and Unlabeled Documents

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Abstract

Traditional approaches for building text classifiers usually require a lot of labeled documents, which are expensive to obtain. In this paper, we study the problem of building a text classifier from a keyword and unlabeled documents, so as to avoid labeling documents manually. Firstly, we expand the keyword into a set of query terms and retrieve a set of documents from the set of unlabeled documents. Then, from the documents retrieved, we mine a set of positive documents. Thirdly, with the help of these positive documents, more positive documents could be extracted from the unlabeled documents. And finally, we train a PU text classifier with these positive documents and unlabeled documents. Our experiment result on 20Newsgroup dataset shows that the proposed approach could help to build excellent text classifiers.

This work is supported by Young Cadreman Supporting Program of Northwest A&F University (01140301). Corresponding author: Yang Zhang.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT (1998)

    Google Scholar 

  2. Ghnai, R.: Combining labeled and unlabeled data for multiclass text categorization. In: ICML (2002)

    Google Scholar 

  3. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine learning 39(2-3), 103–134 (2000)

    Article  MATH  Google Scholar 

  4. Liu, B., Lee, W., Yu, P., Li, X.: Partially Supervised Classification of Text Documents. In: ICML, pp. 387–394 (2002)

    Google Scholar 

  5. Li, X., Liu, B.: Learning to Classify Texts Using Positive and Unlabeled Data. In: IJCAI, pp. 587–594 (2003)

    Google Scholar 

  6. Yu, H., Han, J., Chang, K.C.-C.: PEBL: Positive Example Based Learning for Web Page Classification Using SVM. In: KDD, pp. 239–248 (2002)

    Google Scholar 

  7. Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text Classification without Negative Examples. In: Proc. 21st Int’l Conf. Data Eng. (2005)

    Google Scholar 

  8. Yu, H., Han, J., Chang, K.C.-C.: PEBL: Web Page Classification without Negative Examples. IEEE Trans. Knowledge and Data Eng. (2004)

    Google Scholar 

  9. Li, X., Liu, B.: Learning from Positive and Unlabeled Examples with Different Data Distributions. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 218–229. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Fung, G.P.C., et al.: Text Classification without Negative Examples Revisit. IEEE Transactions on Knowledge and Data Engineering 18(1), 6–20 (2006)

    Article  Google Scholar 

  11. Li, X., Liu, B., Ng, S.-K.: Learning to Classify Documents with Only a Small Positive Training Set. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS, vol. 4701, pp. 201–213. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. McCallum, A.: Nigam. K.: Text classification by bootstrapping with keywords, EM and shrinkage. In: ACL Workshop on Unsupervised Learning in Natural Language Processing (1999)

    Google Scholar 

  13. Liu, B., Li, X., Lee, W., Yu, P.: Text Classification by Labeling Words. In: Proc. 19th Nat’l Conf. Artificial Intelligence (2004)

    Google Scholar 

  14. Barbara, D., Domeniconi, C., Kang, N.: Classifying Document Without Labels. In: Proceedings of the SIAM International Conference on Data Mining (2004)

    Google Scholar 

  15. Barbara, D., Domeniconi, C., Kang, N.: Mining Relevant Text from Unlabeled Documents. In: Proceedings of the Third IEEE International Conference on Data Mining (2003)

    Google Scholar 

  16. Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E., Milios, E.: Semantic Similarity Methods in Wordnet and their Application Information Retrieval on the Web. In: 7th ACM International Workshop on Web Information and Data Management (2005)

    Google Scholar 

  17. Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proc. Third Int’l Conf. Data Mining (2003)

    Google Scholar 

  18. Bradley, P.S., Fayyad, U.: Refining Initial Points for k-Means Clustering. In: Proc. 15th Int’l Conf. Machine Learning (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Qiu, Q., Zhang, Y., Zhu, J. (2009). Building a Text Classifier by a Keyword and Unlabeled Documents. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01307-2_54

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01306-5

  • Online ISBN: 978-3-642-01307-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics