Skip to main content
Log in

Using the absolute difference of term occurrence probabilities in binary text categorization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this study, the differences among widely used weighting schemes are studied by means of ordering terms according to their discriminative abilities using a recently developed framework which expresses term weights in terms of the ratio and absolute difference of term occurrence probabilities. Having observed that the ordering of terms is dependent on the weighting scheme under concern, it is emphasized that this can be explained by the way different schemes use term occurrence differences in generating term weights. Then, it is proposed that the relevance frequency which is shown to provide the best scores on several datasets can be improved by taking into account the way absolute difference values are used in other widely used schemes. Experimental results on two different datasets have shown that improved F 1 scores can be achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Obied A, Alhajj R (2009) Fraudulent and malicious sites on the web. Appl Intell 30(2):112–120

    Article  Google Scholar 

  2. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  3. Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naive Bayes. Expert Syst Appl 36:5432–5435

    Article  Google Scholar 

  4. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML’97, 14th international conference on machine learning. Morgan Kaufmann, San Mateo, pp 412–420

    Google Scholar 

  5. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89

    Article  Google Scholar 

  6. Chen CM, Lee HM, Hwang CW (2005) A hierarchical neural network document classifier with linguistic feature selection. Appl Intell 23:277–294

    Article  Google Scholar 

  7. Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36:690–701

    Article  Google Scholar 

  8. Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87

    Article  Google Scholar 

  9. He J, Tan AH, Tan CL (2003) On machine learning methods for Chinese document categorization. Appl Intell 18:311–322

    Article  MATH  Google Scholar 

  10. Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Article  Google Scholar 

  11. Debole F, Sebastiani F (2004) An analysis of the relative hardness of Reuters-21578 subsets. J Am Soc Inform Sci Technol 56(6):584–596

    Article  Google Scholar 

  12. Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1–2):109–123

    Article  MATH  Google Scholar 

  13. Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: SAC’03: Proceedings of the 2003 ACM symposium on applied computing. ACM, New York, pp 784–788

    Chapter  Google Scholar 

  14. Tsai RT, Hung H, Dai H, Lin Y, Hsu W (2008) Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. BMC Bioinform 9(Suppl 1):S3

    Article  Google Scholar 

  15. Altınçay H, Erenel Z (2010) Analytical evaluation of term weighting schemes for text categorization. Pattern Recogn Lett 31:1310–1323

    Article  Google Scholar 

  16. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the fifteenth national conference on artificial intelligence, Madison, Wisconsin, United States, pp 509–516

  17. Buckley C (1985) Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA

  18. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  19. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference of machine learning, pp 137–142

  20. Joachims T (1999) Making large-scale SVM learning practical. In: Schölkoph B, Burges CJC, Smola AJ (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184

    Google Scholar 

  21. Tan CL, Huang W, Sung SY, Yui Z, Xu Y (2003) Text retrieval from document images based on word shape analysis. Appl Intell 18(3):257–270

    Article  MATH  Google Scholar 

  22. Wu CH, Tsai CH (2009) Robust classification for spam filtering by back-propagation neural networks using behavior-based features. Appl Intell 31(2):107–121

    Article  Google Scholar 

  23. Erenel Z, Altınçay H, Varoğlu E (2009) A symmetric term weighting scheme for text categorization based on term occurrence probabilities. In: Proceedings of fifth international conference on soft computing, computing with words and perceptions in system analysis, decision and control (ICSCCW), Famagusta, Northern Cyprus

  24. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  25. Sun A, Lim E, Liu Y (2009) On strategies for imbalanced text classification using svm: a comparative study. Decis Support Syst 48(1):191–201

    Article  Google Scholar 

  26. Wu S-H, Lin K-P, Chen C-M, Chen M-S (2008) Asymmetric support vector machines: Low false-positive learning under the user tolerance. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA, August 2008

  27. Kolcz A, Yih W (2007) Raising the baseline for high-precision text classifiers. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, California, USA, pp 400–409

  28. Tian J, Gu H, Liu W (2010) Imbalanced classification using support vector machine ensemble. Neural Comput Appl. doi:10.1007/s00521-010-0349-9

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hakan Altınçay.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altınçay, H., Erenel, Z. Using the absolute difference of term occurrence probabilities in binary text categorization. Appl Intell 36, 148–160 (2012). https://doi.org/10.1007/s10489-010-0250-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-010-0250-3

Keywords

Navigation