Skip to main content

Two-Stage Feature Selection for Text Classification

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 363))

Abstract

In this paper, we focus on feature coverage policies used for feature selection in the text classification domain. Two alternative policies are discussed and compared: corpus-based and class-based selection of features. We make a detailed analysis of pruning and keyword selection by varying the parameters of the policies and obtain the optimal usage patterns. In addition, by combining the optimal forms of these methods, we propose a novel two-stage feature selection approach. The experiments on three independent datasets showed that the proposed method results in a statistically significant increase over the traditional methods in the success rates of the classifier.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the European Conference on Machine Learning (ECML), 137–142, Springer (1998)

    Google Scholar 

  2. Aizawa, A., Linguistic techniques to improve the performance of automatic text categorization. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, 307–314, Tokyo (2001)

    Google Scholar 

  3. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  4. Forman, G.: Feature selection for text classification, in Computational methods of feature selection, ed. Liu, H., Hiroshi, M.: Chapman and Hall/CRC Press (2007)

    Google Scholar 

  5. Singh, S.R., Murthy, H.A., Gonsalves, T.A.: Feature selection for text classification based on Gini coefficient of inequality. In: Proceedings of the 4th International Workshop on Feature Selection in Data Mining, 76–85, India (2010)

    Google Scholar 

  6. Shoushan, L., Rui, X., Chengqing, Z., Huang, C.R.: A framework of feature selection methods for text categorization. In: Proceedings of the 47th Annual Meeting of the ACL, 692–700, Singapore (2009)

    Google Scholar 

  7. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature selection methods for text classification, In: Proceedings of 13th International Conference on Knowledge Discovery and Data Mining, 230–239, ACM, San Jose (2007)

    Google Scholar 

  8. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Zhihai, W.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33, 1–5 (2007)

    Article  Google Scholar 

  9. Özgür, L., Güngör, T.: Text classification with the support of pruned dependency patterns. Pattern Recogn. Lett. 31, 1598–1607 (2010)

    Article  Google Scholar 

  10. Özgür, Arzucan, Özgür, Levent, Güngör, Tunga: Text Categorization with Class-Based and Corpus-Based Keyword Selection. In: Yolum, pInar, Güngör, Tunga, Gürgen, Fikret, Özturan, Can (eds.) Computer and Information Sciences - ISCIS 2005. Lecture Notes in Computer Science, vol. 3733, pp. 606–615. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  11. Ghiassi, M., Olschimke, M., Moon, B., Arnaudo, P.: Automated text classificiation using a dynamic artificial neural network model. Expert Syst. Appl. 39(12), 10967–10976 (2012)

    Article  MATH  Google Scholar 

  12. Frank, A., Asuncion, A.: UCI machine learning repository, University of California, School of Information and Computer Science, Irvine, CA (2010). http://archive.ics.uci.edu/ml

  13. Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval, Cambridge University Press (2008)

    Google Scholar 

  14. Gao, Y., Sun, S.: An empirical evaluation of linear and nonlinear kernels for text classification using support vector machines. In: Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 1502–1505, China (2010)

    Google Scholar 

  15. Joachims, T.: Advances in kernel methods: support vector learning, MIT Press (1999)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the Boğaziçi University Research Fund under the grant number 05A103D and the Turkish State Planning Organization (DPT) under the TAM Project, number 200K120610.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Levent Özgür .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Özgür, L., Güngör, T. (2016). Two-Stage Feature Selection for Text Classification. In: Abdelrahman, O., Gelenbe, E., Gorbil, G., Lent, R. (eds) Information Sciences and Systems 2015. Lecture Notes in Electrical Engineering, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-319-22635-4_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22635-4_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22634-7

  • Online ISBN: 978-3-319-22635-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics