Skip to main content

Enhancement of DTP Feature Selection Method for Text Categorization

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2005)

Abstract

This paper studies the structure of vectors obtained by using term selection methods in high-dimensional text collection. We found that the distance to transition point (DTP) method omits commonly occurring terms, which are poor discriminators between documents, but which convey important information about a collection. Experimental results obtained on the Reuters-21578 collection with the k-NN classifier show that feature selection by DTP combined with common terms outperforms slightly simple document frequency.

This work was supported by VIEP-BUAP, grant III9-04/ING/G.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Booth, A.: A Law of Occurrences for Words of Low Frequency. Information and Control 10(4), 386–393 (1967)

    Article  MATH  Google Scholar 

  2. Debole, F., Sebastiani, F.: An Analysis of the Relative Difficulty of Reuters-21578 Subsets. In: Proceedings of LREC 2004, 4th International Conference on Language Resources and Evaluation, Lisbon, PT, pp. 971–974 (2004)

    Google Scholar 

  3. Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data using Clustering. Mach. Learn. 42(1-2), 143–175 (2001)

    Article  MATH  Google Scholar 

  4. Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  5. Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Proc. of SEPLN 2004 (2004)

    Google Scholar 

  6. Moyotl, E., Jiménez, H.: Experiments in Text Categorization using Term Selection by Distance to Transition Point. In: Proc. of CIC 2004 (2004)

    Google Scholar 

  7. Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  8. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  9. Urbizagástegui-Alvarado, R.: Las posibilidades de la ley de Zipf en la indización automática. Reporte de la Universidad de California Riverside (1999)

    Google Scholar 

  10. Yang, Y., Pedersen, P.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of ICML 1997, 14th Int. Conf. on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  11. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proc. of SIGIR 1999, 22nd ACM Int. Conf. on Research and Development in Information Retrieval, pp. 42–49 (1999)

    Google Scholar 

  12. Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Moyotl-Hernández, E., Jiménez-Salazar, H. (2005). Enhancement of DTP Feature Selection Method for Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_80

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30586-6_80

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24523-0

  • Online ISBN: 978-3-540-30586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics