Skip to main content
Log in

Using Wikipedia knowledge to improve text classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, traditional classification methods are based on the “Bag of Words” (BOW) representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. In this paper, we automatically construct a thesaurus of concepts from Wikipedia. We then introduce a unified framework to expand the BOW representation with semantic relations (synonymy, hyponymy, and associative relations), and demonstrate its efficacy in enhancing previous approaches for text classification. Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at SIGIR’03

  2. Gabrilovich E, Markovitch S (2005) Feature generation for text categorization using world knowledge. In Proceedings of the 19th international joint conference on artificial intelligence (IJCAI’05)

  3. Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21nd AAAI conference on artificial intelligence (AAAI’06)

  4. Milne D, Medelyan O, Witten IH (2006) Mining domain-specific Thesauri from Wikipedia: a case study. In: Proceedings of 2007 IEEE/WIC/ACM international conference on web intelligence (WI’06)

  5. Bunescu R, Pasca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th conference of the european chapter of the association for computational linguistics (EACL’06)

  6. Strube M, Ponzetto SP (2006) WikiRelate! computing semantic relatedness using Wikipedia. In: Proceedings of the 21nd AAAI conference on artificial intelligence (AAAI’06)

  7. Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Google Scholar 

  8. Agirre E, Rigau G (1995) A proposal for word sense disambiguation using conceptual distance. In: Proceedings of the 1st international conference on recent advances in natural language processing (RANLP’95)

  9. Reuters-21578 text categorization test collection, Distribution 1.0. Reuters. 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/

  10. Hersh W, Buckley C, Leone T, Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval (SIGIR’94), pp 192–201

  11. Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning (ICML’95), pp 331–339

  12. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th european conference on machine learning (ECML’98), pp 137-142

  13. Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22th annual international ACM-SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49

  14. de Buenaga Rodriguez M, Gomez Hidalgo JM, Agudo BD (1999) Using WordNet to complement training information in text categorization. In: The 2nd international conference on recent advances in natural language processing (RANLP’97)

  15. Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international World Wide Web conference (WWW’03)

  16. Ponzetto SP, Strube M (2007) Deriving a large scale taxonomy from Wikipedia. In: Proceedings of the 22nd AAAI conference on artificial intelligence (AAAI’07)

  17. Urena-Lopez LA, Buenaga M, Gomez JM (2001) Integrating linguistic resources in TC through WSD. Comput Hum 35:215C230

    Google Scholar 

  18. Miller G (1995) WordNet: a lexical database for english. Communications of the ACM

  19. Wikipedia (2001). http://en.wikipedia.org/wiki/Wikipedia:About

  20. Open Directory Project (1998). http://dmoz.org

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pu Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, P., Hu, J., Zeng, HJ. et al. Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19, 265–281 (2009). https://doi.org/10.1007/s10115-008-0152-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0152-4

Keywords

Navigation