Abstract
In this study, a comprehensive analysis of the lexical dependency and pruning concepts for the text classification problem is presented. Dependencies are included in the feature vector as an extension to the standard bag-of-words approach. The pruning process filters features with low frequencies so that fewer but more informative features remain in the solution vector. The pruning levels for words, dependencies, and dependency combinations for different datasets are analyzed in detail. The main motivation in this work is to make use of dependencies and pruning efficiently in text classification and to achieve more successful results using much smaller feature vector sizes. Three different datasets were used in the experiments and statistically significant improvements for most of the proposed approaches were obtained.



Similar content being viewed by others
References
Stevenson M, Greenwood M (2005) A semantic approach to IE pattern induction. In: Proceedings of the 43rd annual meeting of the ACL, Ann Arbor
Stevenson M, Greenwood M (2006) Comparing information extraction pattern models. In: Proceedings of the workshop on information extraction beyond the document, Sydney, pp 12–19
Marneffe MC, MacCartney B, Manning C (2006) Generating typed dependency parses from phrase structure parses. LREC2006
Finch A, Black A, Hwang YS, Sumita E (2006) Using lexical dependency and ontological knowledge to improve a detailed syntactic and semantic tagger of English. In: Proceedings of the COLING/ACL on main conference poster sessions, Sydney, pp 215–222
Cahill A, Heid U, Rohrer C, Weller M (2009) Using tri-lexical dependencies in LFG parse disambiguation. In: The 14th international LFG conference, July 2009. Trinity College, Cambridge
Bach J, Witten IH (1999) Lexical attraction for text compression. In: Proceedings of the conference on data compression, DCC 1999
Charniak E et al. (2003) Syntax-based language models for statistical machine translation. In: Proceedings of the MT summit 2003
Herrera J, Penas A, Verdejo F (2006) Textual entailment recognition based on dependency analysis and WordNet. In: Lecture notes in computer science, vol 3944/2006. Springer, Berlin
Basili R, Pazienza MT, Mazzucchelli L (2000) An adaptive and distributed framework for advanced IR. RIAO 2000, pp 908–922
Miller G (1995) WordNet: a lexical database for English. communications of the ACM, vol 38, no 11, pp 39–41
Mansuy T, Hilderman R (2006) A characterization of WordNet features in Boolean models for text classification. In: Proceedings of the 5th Australasian data mining conference (AusDM’06), Sydney, November, pp 103–109
Hidalgo JMG, Rodriguez MB (1997) Integrating a lexical database and a training collection for text categorization. In: ACL/EACL workshop on automatic extraction and building of lexical semantic resources for natural language applications
Bloehdorn S, Moschitti A (2007) Combined syntactic and semantic kernels for text classification. ECIR 2007, pp 307–318
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR-92, Copenhagen, pp 37–50
Furnkranz J, Mitchell T, Rilof E (1998) A case study in using linguistic phrases for text categorization on the WWW. AAAI-98 workshop on learning for text categorization
KÖnig AC, Brill E (2006) Reducing the human overhead in text categorization. In: Proceedings of KDD 2006, Association for Computing Machinery Inc
Moschitti A, Basili R (2004) Complex linguistic features for text classification. In: A comprehensive study. ECIR 2004, pp 181–196
Moschitti A (2008) Kernel methods, syntax and semantics for relational text categorization. In: Proceeding of ACM 17th conference on information and knowledge management (CIKM), Napa Valley
Ghanem M, Guo Y, Lodhi H, Zhang Y (2002) Automatic scientific text classification using local patterns. In: KDD CUP 2002 (Task1), SIGKDD Explorations, vol 4, no 2, pp 95–96
Nastase V, Shirabad JS, Caropreso MF (2006) Using dependency relations for text classification. In: AI 2006, the nineteenth Canadian conference on artificial intelligence, Quebec City
Özgür L, Güngör T (2009) Analysis of stemming alternatives and ddependency pattern support in text classification. In: CICLing 2009, the tenth international conference on intelligent text processing and computational linguistics. Research in computing science, vol 41, Mexico City, Mexico.
Manning C.D, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420
Shoushan L, Rui X, Chengqing Z, Huang CR (2009) A framework of feature selection methods for text categorization. In: Proceedings of the 47th annual meeting of the ACL and the 4t IJCNLP of the AFNLP, Suntec, pp 692–700
Dasgupta A, Drineas P, Harb B, Josifovski V, Mahoney MW (2007) Feature selection methods for text classification. In: Proceedings of 13th annual SIGKDD, pp 230–239
Asuncion A, Newman D (2007) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://www.ics.uci.edu/mlearn/MLRepository.html
Yang Y, Liu X (1999) A Re-examination of text categorization methods. In: Proceedings of SIGIR-99. 22nd ACM international conference on research and development in information retrieval, Berkeley
Özgür A, Özgür L and Güngör T (2005) Text categorization with class-based and corpus-based keyword selection. Lecture notes in computer science, vol 3733. Springer, Berlin, pp 606–615
Porter M (1980) An algorithm for suffix stripping. In: Program 14, pp 130–137
Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523
Joachims T (1999) Advances in kernel methods-support vector learning. Making large-scale SVM learning practical. MIT Press, Cambridge
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning (ECML). Springer, Berlin, pp 137–142
Klein D, Manning C (2003) Fast exact inference with a factored model for natural language parsing, vol 15, NIPS. MIT Press, Cambridge
Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. 5th international conference on language resources and evaluation (LREC 2006)
Larson R, Farber B (2000) Elementary statistics: picturing the World. Prentice Hall, Englewood Cliffs
Montgomery DC (2001) Design and analysis of experiments. Wiley, New York
Acknowledgments
This work was supported by the Boğaziçi University Research Fund under the Grant Number 05A103D and the Turkish State Planning Organization (DPT) under the TAM Project, number 2007K120610.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Özgür, L., Güngör, T. Optimization of dependency and pruning usage in text classification. Pattern Anal Applic 15, 45–58 (2012). https://doi.org/10.1007/s10044-010-0195-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-010-0195-5