Enhancement of DTP Feature Selection Method for Text Categorization

Moyotl-Hernández, Edgar; Jiménez-Salazar, Héctor

doi:10.1007/978-3-540-30586-6_80

Edgar Moyotl-Hernández¹⁷ &
Héctor Jiménez-Salazar¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2242 Accesses
4 Citations

Abstract

This paper studies the structure of vectors obtained by using term selection methods in high-dimensional text collection. We found that the distance to transition point (DTP) method omits commonly occurring terms, which are poor discriminators between documents, but which convey important information about a collection. Experimental results obtained on the Reuters-21578 collection with the k-NN classifier show that feature selection by DTP combined with common terms outperforms slightly simple document frequency.

This work was supported by VIEP-BUAP, grant III9-04/ING/G.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Booth, A.: A Law of Occurrences for Words of Low Frequency. Information and Control 10(4), 386–393 (1967)
Article MATH Google Scholar
Debole, F., Sebastiani, F.: An Analysis of the Relative Difficulty of Reuters-21578 Subsets. In: Proceedings of LREC 2004, 4th International Conference on Language Resources and Evaluation, Lisbon, PT, pp. 971–974 (2004)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data using Clustering. Mach. Learn. 42(1-2), 143–175 (2001)
Article MATH Google Scholar
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)
Chapter Google Scholar
Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Proc. of SEPLN 2004 (2004)
Google Scholar
Moyotl, E., Jiménez, H.: Experiments in Text Categorization using Term Selection by Distance to Transition Point. In: Proc. of CIC 2004 (2004)
Google Scholar
Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Urbizagástegui-Alvarado, R.: Las posibilidades de la ley de Zipf en la indización automática. Reporte de la Universidad de California Riverside (1999)
Google Scholar
Yang, Y., Pedersen, P.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of ICML 1997, 14th Int. Conf. on Machine Learning, pp. 412–420 (1997)
Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proc. of SIGIR 1999, 22nd ACM Int. Conf. on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar
Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Facultad de Ciencias de la Computación, B. Universidad Autónoma de Puebla,
Edgar Moyotl-Hernández & Héctor Jiménez-Salazar

Authors

Edgar Moyotl-Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Jiménez-Salazar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moyotl-Hernández, E., Jiménez-Salazar, H. (2005). Enhancement of DTP Feature Selection Method for Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_80

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_80
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics