Using Wikipedia knowledge to improve text classification

Wang, Pu; Hu, Jian; Zeng, Hua-Jun; Chen, Zheng

doi:10.1007/s10115-008-0152-4

Using Wikipedia knowledge to improve text classification

Regular Paper
Published: 17 September 2008

Volume 19, pages 265–281, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Pu Wang¹,
Jian Hu²,
Hua-Jun Zeng² &
…
Zheng Chen²

1221 Accesses
102 Citations
1 Altmetric
Explore all metrics

Abstract

Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, traditional classification methods are based on the “Bag of Words” (BOW) representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. In this paper, we automatically construct a thesaurus of concepts from Wikipedia. We then introduce a unified framework to expand the BOW representation with semantic relations (synonymy, hyponymy, and associative relations), and demonstrate its efficacy in enhancing previous approaches for text classification. Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proceedings of the semantic web workshop at SIGIR’03
Gabrilovich E, Markovitch S (2005) Feature generation for text categorization using world knowledge. In Proceedings of the 19th international joint conference on artificial intelligence (IJCAI’05)
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21nd AAAI conference on artificial intelligence (AAAI’06)
Milne D, Medelyan O, Witten IH (2006) Mining domain-specific Thesauri from Wikipedia: a case study. In: Proceedings of 2007 IEEE/WIC/ACM international conference on web intelligence (WI’06)
Bunescu R, Pasca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th conference of the european chapter of the association for computational linguistics (EACL’06)
Strube M, Ponzetto SP (2006) WikiRelate! computing semantic relatedness using Wikipedia. In: Proceedings of the 21nd AAAI conference on artificial intelligence (AAAI’06)
Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Google Scholar
Agirre E, Rigau G (1995) A proposal for word sense disambiguation using conceptual distance. In: Proceedings of the 1st international conference on recent advances in natural language processing (RANLP’95)
Reuters-21578 text categorization test collection, Distribution 1.0. Reuters. 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/
Hersh W, Buckley C, Leone T, Hickam D (1994) OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval (SIGIR’94), pp 192–201
Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning (ICML’95), pp 331–339
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th european conference on machine learning (ECML’98), pp 137-142
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22th annual international ACM-SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49
de Buenaga Rodriguez M, Gomez Hidalgo JM, Agudo BD (1999) Using WordNet to complement training information in text categorization. In: The 2nd international conference on recent advances in natural language processing (RANLP’97)
Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international World Wide Web conference (WWW’03)
Ponzetto SP, Strube M (2007) Deriving a large scale taxonomy from Wikipedia. In: Proceedings of the 22nd AAAI conference on artificial intelligence (AAAI’07)
Urena-Lopez LA, Buenaga M, Gomez JM (2001) Integrating linguistic resources in TC through WSD. Comput Hum 35:215C230
Google Scholar
Miller G (1995) WordNet: a lexical database for english. Communications of the ACM
Wikipedia (2001). http://en.wikipedia.org/wiki/Wikipedia:About
Open Directory Project (1998). http://dmoz.org

Download references

Author information

Authors and Affiliations

Department of Computer Science, George Mason University, Fairfax, VA, 22030, USA
Pu Wang
Machine Learning Group, Microsoft Research Asia, Beijing, China
Jian Hu, Hua-Jun Zeng & Zheng Chen

Authors

Pu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hua-Jun Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pu Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, P., Hu, J., Zeng, HJ. et al. Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19, 265–281 (2009). https://doi.org/10.1007/s10115-008-0152-4

Download citation

Received: 29 October 2007
Revised: 24 December 2007
Accepted: 29 January 2008
Published: 17 September 2008
Issue Date: June 2009
DOI: https://doi.org/10.1007/s10115-008-0152-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Wikipedia knowledge to improve text classification

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A Review on Word Embedding Techniques for Text Classification

A review of semi-supervised learning for text classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using Wikipedia knowledge to improve text classification

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A Review on Word Embedding Techniques for Text Classification

A review of semi-supervised learning for text classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation