Optimization of dependency and pruning usage in text classification

Özgür, Levent; Güngör, Tunga

doi:10.1007/s10044-010-0195-5

Optimization of dependency and pruning usage in text classification

Theoretical Advances
Published: 23 December 2010

Volume 15, pages 45–58, (2012)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Levent Özgür¹ &
Tunga Güngör¹

198 Accesses
6 Citations
Explore all metrics

Abstract

In this study, a comprehensive analysis of the lexical dependency and pruning concepts for the text classification problem is presented. Dependencies are included in the feature vector as an extension to the standard bag-of-words approach. The pruning process filters features with low frequencies so that fewer but more informative features remain in the solution vector. The pruning levels for words, dependencies, and dependency combinations for different datasets are analyzed in detail. The main motivation in this work is to make use of dependencies and pruning efficiently in text classification and to achieve more successful results using much smaller feature vector sizes. Three different datasets were used in the experiments and statistically significant improvements for most of the proposed approaches were obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-Stage Feature Selection for Text Classification

ARTC: feature selection using association rules for text classification

Article 07 September 2022

An automatic classification of text documents based on correlative association of words

Article 14 August 2017

References

Stevenson M, Greenwood M (2005) A semantic approach to IE pattern induction. In: Proceedings of the 43rd annual meeting of the ACL, Ann Arbor
Stevenson M, Greenwood M (2006) Comparing information extraction pattern models. In: Proceedings of the workshop on information extraction beyond the document, Sydney, pp 12–19
Marneffe MC, MacCartney B, Manning C (2006) Generating typed dependency parses from phrase structure parses. LREC2006
Finch A, Black A, Hwang YS, Sumita E (2006) Using lexical dependency and ontological knowledge to improve a detailed syntactic and semantic tagger of English. In: Proceedings of the COLING/ACL on main conference poster sessions, Sydney, pp 215–222
Cahill A, Heid U, Rohrer C, Weller M (2009) Using tri-lexical dependencies in LFG parse disambiguation. In: The 14th international LFG conference, July 2009. Trinity College, Cambridge
Bach J, Witten IH (1999) Lexical attraction for text compression. In: Proceedings of the conference on data compression, DCC 1999
Charniak E et al. (2003) Syntax-based language models for statistical machine translation. In: Proceedings of the MT summit 2003
Herrera J, Penas A, Verdejo F (2006) Textual entailment recognition based on dependency analysis and WordNet. In: Lecture notes in computer science, vol 3944/2006. Springer, Berlin
Basili R, Pazienza MT, Mazzucchelli L (2000) An adaptive and distributed framework for advanced IR. RIAO 2000, pp 908–922
Miller G (1995) WordNet: a lexical database for English. communications of the ACM, vol 38, no 11, pp 39–41
Mansuy T, Hilderman R (2006) A characterization of WordNet features in Boolean models for text classification. In: Proceedings of the 5th Australasian data mining conference (AusDM’06), Sydney, November, pp 103–109
Hidalgo JMG, Rodriguez MB (1997) Integrating a lexical database and a training collection for text categorization. In: ACL/EACL workshop on automatic extraction and building of lexical semantic resources for natural language applications
Bloehdorn S, Moschitti A (2007) Combined syntactic and semantic kernels for text classification. ECIR 2007, pp 307–318
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR-92, Copenhagen, pp 37–50
Furnkranz J, Mitchell T, Rilof E (1998) A case study in using linguistic phrases for text categorization on the WWW. AAAI-98 workshop on learning for text categorization
KÖnig AC, Brill E (2006) Reducing the human overhead in text categorization. In: Proceedings of KDD 2006, Association for Computing Machinery Inc
Moschitti A, Basili R (2004) Complex linguistic features for text classification. In: A comprehensive study. ECIR 2004, pp 181–196
Moschitti A (2008) Kernel methods, syntax and semantics for relational text categorization. In: Proceeding of ACM 17th conference on information and knowledge management (CIKM), Napa Valley
Ghanem M, Guo Y, Lodhi H, Zhang Y (2002) Automatic scientific text classification using local patterns. In: KDD CUP 2002 (Task1), SIGKDD Explorations, vol 4, no 2, pp 95–96
Nastase V, Shirabad JS, Caropreso MF (2006) Using dependency relations for text classification. In: AI 2006, the nineteenth Canadian conference on artificial intelligence, Quebec City
Özgür L, Güngör T (2009) Analysis of stemming alternatives and ddependency pattern support in text classification. In: CICLing 2009, the tenth international conference on intelligent text processing and computational linguistics. Research in computing science, vol 41, Mexico City, Mexico.
Manning C.D, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420
Shoushan L, Rui X, Chengqing Z, Huang CR (2009) A framework of feature selection methods for text categorization. In: Proceedings of the 47th annual meeting of the ACL and the 4t IJCNLP of the AFNLP, Suntec, pp 692–700
Dasgupta A, Drineas P, Harb B, Josifovski V, Mahoney MW (2007) Feature selection methods for text classification. In: Proceedings of 13th annual SIGKDD, pp 230–239
Asuncion A, Newman D (2007) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://www.ics.uci.edu/mlearn/MLRepository.html
Yang Y, Liu X (1999) A Re-examination of text categorization methods. In: Proceedings of SIGIR-99. 22nd ACM international conference on research and development in information retrieval, Berkeley
Özgür A, Özgür L and Güngör T (2005) Text categorization with class-based and corpus-based keyword selection. Lecture notes in computer science, vol 3733. Springer, Berlin, pp 606–615
Porter M (1980) An algorithm for suffix stripping. In: Program 14, pp 130–137
Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523
Article Google Scholar
Joachims T (1999) Advances in kernel methods-support vector learning. Making large-scale SVM learning practical. MIT Press, Cambridge
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning (ECML). Springer, Berlin, pp 137–142
Klein D, Manning C (2003) Fast exact inference with a factored model for natural language parsing, vol 15, NIPS. MIT Press, Cambridge
Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. 5th international conference on language resources and evaluation (LREC 2006)
Larson R, Farber B (2000) Elementary statistics: picturing the World. Prentice Hall, Englewood Cliffs
Montgomery DC (2001) Design and analysis of experiments. Wiley, New York

Download references

Acknowledgments

This work was supported by the Boğaziçi University Research Fund under the Grant Number 05A103D and the Turkish State Planning Organization (DPT) under the TAM Project, number 2007K120610.

Author information

Authors and Affiliations

Department of Computer Engineering, Boğaziçi University, Bebek, 34342, Istanbul, Turkey
Levent Özgür & Tunga Güngör

Authors

Levent Özgür
View author publications
You can also search for this author inPubMed Google Scholar
Tunga Güngör
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Levent Özgür.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Özgür, L., Güngör, T. Optimization of dependency and pruning usage in text classification. Pattern Anal Applic 15, 45–58 (2012). https://doi.org/10.1007/s10044-010-0195-5

Download citation

Received: 17 November 2009
Accepted: 29 November 2010
Published: 23 December 2010
Issue Date: February 2012
DOI: https://doi.org/10.1007/s10044-010-0195-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of dependency and pruning usage in text classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Two-Stage Feature Selection for Text Classification

ARTC: feature selection using association rules for text classification

An automatic classification of text documents based on correlative association of words

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now