A comparative study on text representation schemes in text categorization

Song, Fengxi; Liu, Shuhai; Yang, Jingyu

doi:10.1007/s10044-005-0256-3

A comparative study on text representation schemes in text categorization

Original Article
Published: 28 July 2005

Volume 8, pages 199–209, (2005)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Fengxi Song¹,
Shuhai Liu¹ &
Jingyu Yang¹

1131 Accesses
33 Citations
Explore all metrics

Abstract

It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: “stop words removal”, “word stemming”, “indexing”, “weighting”, and “normalization”. Statistical analyses of experimental results show that performing “normalization” can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text categorization based on a new classification by thresholds

Article 03 June 2021

Improved Document Categorization Through Feature-Rich Combinations

An Extensive Selection of Features as Combinations for Automatic Text Categorization

Notes

In some text categorization literature Mutual Information is named as Information Gain.

References

Maron M (1961) Automatic indexing: an experimental inquiry. J Assoc Comput Mach 8(3):404–417
Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans PAMI 22(1):4–37
Google Scholar
Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retrieval 1(2):69–90
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning (ECML). Springer, Berlin Heidelberg New York
Google Scholar
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. Proceedings of the CIKM-98, Seventh ACM International Conference on Information and Knowledge Management, pp 148–155
Yang Y, Liu X (1999) A re-evaluation of text categorization methods. Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pp 42–49
Zhang T, Oles FJ (2001) Text categorization based on regularized linear classification methods. Inf Retrieval 4:5–31
Article Google Scholar
Chakrabarti S, Roy S, Soundalgekar MV, Bombay I (2002) Fast and accuracy text classification via multiple linear discriminant projections. Proceedings of the 28th VLDB Conference, Hong Kong, China
Petridis V, Kaburlasos VG, Fragkou P, Kehagias A (2001) Text classification using the -FLNMAP neural network. Proceedings of the 2001 International Joint Conference on Neural Networks (IJCNN2001)
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article Google Scholar
Baker LD, McCallum AK (1998) Distributional clustering of words for text categorisation. Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp 96–103
Yang Y Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Machine learning, Proceedings of the 14th International Conference (ICML’97), pp 412–420
Ma J, Zhao Y Ahalt S OSU SVM Classifier Matlab Toolbox (ver 3.00). Available at: http://www.eng.ohio-state.edu/~maj/osu_svm/
Porter MF (1980) An algorithm for suffix striping, Program, vol 14, no. 3, pp 130–137
Lewis, Reuters-21578, Distribution 1.0. Available at: http://www.research.att.com/~lewis/reuters21578.html
Hsu C, Lin C (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2)
Lang K (1995) Newsweeder: learning to filter netnews. Proceeding of the Twelfth International Conference on Machine Learning, pp 331–339
Schutze H, Hull DA, Pedersen JO (1995) A comparison of classifiers and document representations for the routing problem. Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, pp 229–23

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their precious suggestions.

Author information

Authors and Affiliations

Department of Computer Science, Nanjing University of Science and Technology, China
Fengxi Song, Shuhai Liu & Jingyu Yang

Authors

Fengxi Song
View author publications
You can also search for this author in PubMed Google Scholar
Shuhai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jingyu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fengxi Song.

Appendices

Appendix

Table 4 The micro-average break-even points of different text representations for reuters-21578 when the top 1,000 features are selected

Full size table

Table 5 The micro-average break-even points of different text representations for reuters-21578 when the top 2,000 features are selected

Full size table

Table 6 The micro-average break-even points of different text representations for Reuters-21578 when the top 3,000 features are selected

Full size table

Table 7 The micro-average break-even points of different text representations for reuters-21578 when the top 4,000 features are selected

Full size table

Table 8 The micro-average break-even points of different text representations for reuters-21578 when the top 5,000 features are selected

Full size table

Table 9 The micro-average break-even points of different text representations for reuters-21578 when the top 6,000 features are selected

Full size table

Table 10 The multi-label accuracy of different text representations for 20 newsgroups when the top 5,000 features are selected

Full size table

Table 11 The multi-label accuracy of different text representations for 20 newsgroups when the top 7,500 features are selected

Full size table

Table 12 The multi-label accuracy of different text representations for 20 newsgroups when the top 10,000 features are selected

Full size table

Table 13 The multi-label accuracy of different text representations for 20 newsgroups when the top 20,000 features are selected

Full size table

Table 14 The multi-label accuracy of different text representations for 20 newsgroups when the top 30,000 features are selected

Full size table

Originality and contribution

It is well known that the effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. Though a lot of text representation modes other than the bag of words, such as statistical phrase-based representation and ngram-based representation, have been examined previously without much success, variants of the bag of words and their effectiveness have not been studied systematically as known by the authors. There are still some questions left behind without answers:

Among the possible variants of the bag of words schemes which ones are probably the best?
Among the factors that may affect a text representation which ones are the most important and should be dealt with seriously?
Is “stop words removal” an indispensable step to represent a text document?
Does indexing a text document with term frequency always outperform indexing it with binary value?
Whether “word stemming” is harmful or beneficial for text categorization?
How should we represent text document the best?

By extensive experiments on two benchmark dataset Reuters-21578 and 20 Newsgroups and thorough statistical analyses of those results all of the above questions have been answered in this paper with some confidence.

The main contribution of this paper is that it clarifies some blurred cognition on text representation such that text representation can be more effective and efficient.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, F., Liu, S. & Yang, J. A comparative study on text representation schemes in text categorization. Pattern Anal Applic 8, 199–209 (2005). https://doi.org/10.1007/s10044-005-0256-3

Download citation

Received: 09 June 2003
Accepted: 02 December 2004
Published: 28 July 2005
Issue Date: September 2005
DOI: https://doi.org/10.1007/s10044-005-0256-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study on text representation schemes in text categorization

Abstract

Access this article

Similar content being viewed by others

Text categorization based on a new classification by thresholds

Improved Document Categorization Through Feature-Rich Combinations

An Extensive Selection of Features as Combinations for Automatic Text Categorization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Originality and contribution

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comparative study on text representation schemes in text categorization

Abstract

Access this article

Similar content being viewed by others

Text categorization based on a new classification by thresholds

Improved Document Categorization Through Feature-Rich Combinations

An Extensive Selection of Features as Combinations for Automatic Text Categorization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Originality and contribution

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation