Skip to main content
Log in

Optimization of dependency and pruning usage in text classification

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In this study, a comprehensive analysis of the lexical dependency and pruning concepts for the text classification problem is presented. Dependencies are included in the feature vector as an extension to the standard bag-of-words approach. The pruning process filters features with low frequencies so that fewer but more informative features remain in the solution vector. The pruning levels for words, dependencies, and dependency combinations for different datasets are analyzed in detail. The main motivation in this work is to make use of dependencies and pruning efficiently in text classification and to achieve more successful results using much smaller feature vector sizes. Three different datasets were used in the experiments and statistically significant improvements for most of the proposed approaches were obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Stevenson M, Greenwood M (2005) A semantic approach to IE pattern induction. In: Proceedings of the 43rd annual meeting of the ACL, Ann Arbor

  2. Stevenson M, Greenwood M (2006) Comparing information extraction pattern models. In: Proceedings of the workshop on information extraction beyond the document, Sydney, pp 12–19

  3. Marneffe MC, MacCartney B, Manning C (2006) Generating typed dependency parses from phrase structure parses. LREC2006

  4. Finch A, Black A, Hwang YS, Sumita E (2006) Using lexical dependency and ontological knowledge to improve a detailed syntactic and semantic tagger of English. In: Proceedings of the COLING/ACL on main conference poster sessions, Sydney, pp 215–222

  5. Cahill A, Heid U, Rohrer C, Weller M (2009) Using tri-lexical dependencies in LFG parse disambiguation. In: The 14th international LFG conference, July 2009. Trinity College, Cambridge

  6. Bach J, Witten IH (1999) Lexical attraction for text compression. In: Proceedings of the conference on data compression, DCC 1999

  7. Charniak E et al. (2003) Syntax-based language models for statistical machine translation. In: Proceedings of the MT summit 2003

  8. Herrera J, Penas A, Verdejo F (2006) Textual entailment recognition based on dependency analysis and WordNet. In: Lecture notes in computer science, vol 3944/2006. Springer, Berlin

  9. Basili R, Pazienza MT, Mazzucchelli L (2000) An adaptive and distributed framework for advanced IR. RIAO 2000, pp 908–922

  10. Miller G (1995) WordNet: a lexical database for English. communications of the ACM, vol 38, no 11, pp 39–41

  11. Mansuy T, Hilderman R (2006) A characterization of WordNet features in Boolean models for text classification. In: Proceedings of the 5th Australasian data mining conference (AusDM’06), Sydney, November, pp 103–109

  12. Hidalgo JMG, Rodriguez MB (1997) Integrating a lexical database and a training collection for text categorization. In: ACL/EACL workshop on automatic extraction and building of lexical semantic resources for natural language applications

  13. Bloehdorn S, Moschitti A (2007) Combined syntactic and semantic kernels for text classification. ECIR 2007, pp 307–318

  14. Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR-92, Copenhagen, pp 37–50

  15. Furnkranz J, Mitchell T, Rilof E (1998) A case study in using linguistic phrases for text categorization on the WWW. AAAI-98 workshop on learning for text categorization

  16. KÖnig AC, Brill E (2006) Reducing the human overhead in text categorization. In: Proceedings of KDD 2006, Association for Computing Machinery Inc

  17. Moschitti A, Basili R (2004) Complex linguistic features for text classification. In: A comprehensive study. ECIR 2004, pp 181–196

  18. Moschitti A (2008) Kernel methods, syntax and semantics for relational text categorization. In: Proceeding of ACM 17th conference on information and knowledge management (CIKM), Napa Valley

  19. Ghanem M, Guo Y, Lodhi H, Zhang Y (2002) Automatic scientific text classification using local patterns. In: KDD CUP 2002 (Task1), SIGKDD Explorations, vol 4, no 2, pp 95–96

  20. Nastase V, Shirabad JS, Caropreso MF (2006) Using dependency relations for text classification. In: AI 2006, the nineteenth Canadian conference on artificial intelligence, Quebec City

  21. Özgür L, Güngör T (2009) Analysis of stemming alternatives and ddependency pattern support in text classification. In: CICLing 2009, the tenth international conference on intelligent text processing and computational linguistics. Research in computing science, vol 41, Mexico City, Mexico.

  22. Manning C.D, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

  23. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  24. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420

  25. Shoushan L, Rui X, Chengqing Z, Huang CR (2009) A framework of feature selection methods for text categorization. In: Proceedings of the 47th annual meeting of the ACL and the 4t IJCNLP of the AFNLP, Suntec, pp 692–700

  26. Dasgupta A, Drineas P, Harb B, Josifovski V, Mahoney MW (2007) Feature selection methods for text classification. In: Proceedings of 13th annual SIGKDD, pp 230–239

  27. Asuncion A, Newman D (2007) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://www.ics.uci.edu/mlearn/MLRepository.html

  28. Yang Y, Liu X (1999) A Re-examination of text categorization methods. In: Proceedings of SIGIR-99. 22nd ACM international conference on research and development in information retrieval, Berkeley

  29. Özgür A, Özgür L and Güngör T (2005) Text categorization with class-based and corpus-based keyword selection. Lecture notes in computer science, vol 3733. Springer, Berlin, pp 606–615

  30. Porter M (1980) An algorithm for suffix stripping. In: Program 14, pp 130–137

  31. Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523

    Article  Google Scholar 

  32. Joachims T (1999) Advances in kernel methods-support vector learning. Making large-scale SVM learning practical. MIT Press, Cambridge

  33. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning (ECML). Springer, Berlin, pp 137–142

  34. Klein D, Manning C (2003) Fast exact inference with a factored model for natural language parsing, vol 15, NIPS. MIT Press, Cambridge

  35. Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. 5th international conference on language resources and evaluation (LREC 2006)

  36. Larson R, Farber B (2000) Elementary statistics: picturing the World. Prentice Hall, Englewood Cliffs

  37. Montgomery DC (2001) Design and analysis of experiments. Wiley, New York

Download references

Acknowledgments

This work was supported by the Boğaziçi University Research Fund under the Grant Number 05A103D and the Turkish State Planning Organization (DPT) under the TAM Project, number 2007K120610.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Levent Özgür.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Özgür, L., Güngör, T. Optimization of dependency and pruning usage in text classification. Pattern Anal Applic 15, 45–58 (2012). https://doi.org/10.1007/s10044-010-0195-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-010-0195-5

Keywords

Navigation