Abstract
In this study, the differences among widely used weighting schemes are studied by means of ordering terms according to their discriminative abilities using a recently developed framework which expresses term weights in terms of the ratio and absolute difference of term occurrence probabilities. Having observed that the ordering of terms is dependent on the weighting scheme under concern, it is emphasized that this can be explained by the way different schemes use term occurrence differences in generating term weights. Then, it is proposed that the relevance frequency which is shown to provide the best scores on several datasets can be improved by taking into account the way absolute difference values are used in other widely used schemes. Experimental results on two different datasets have shown that improved F 1 scores can be achieved.
Similar content being viewed by others
References
Obied A, Alhajj R (2009) Fraudulent and malicious sites on the web. Appl Intell 30(2):112–120
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naive Bayes. Expert Syst Appl 36:5432–5435
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML’97, 14th international conference on machine learning. Morgan Kaufmann, San Mateo, pp 412–420
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89
Chen CM, Lee HM, Hwang CW (2005) A hierarchical neural network document classifier with linguistic feature selection. Appl Intell 23:277–294
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36:690–701
Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87
He J, Tan AH, Tan CL (2003) On machine learning methods for Chinese document categorization. Appl Intell 18:311–322
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Debole F, Sebastiani F (2004) An analysis of the relative hardness of Reuters-21578 subsets. J Am Soc Inform Sci Technol 56(6):584–596
Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1–2):109–123
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: SAC’03: Proceedings of the 2003 ACM symposium on applied computing. ACM, New York, pp 784–788
Tsai RT, Hung H, Dai H, Lin Y, Hsu W (2008) Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. BMC Bioinform 9(Suppl 1):S3
Altınçay H, Erenel Z (2010) Analytical evaluation of term weighting schemes for text categorization. Pattern Recogn Lett 31:1310–1323
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the fifteenth national conference on artificial intelligence, Madison, Wisconsin, United States, pp 509–516
Buckley C (1985) Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference of machine learning, pp 137–142
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkoph B, Burges CJC, Smola AJ (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184
Tan CL, Huang W, Sung SY, Yui Z, Xu Y (2003) Text retrieval from document images based on word shape analysis. Appl Intell 18(3):257–270
Wu CH, Tsai CH (2009) Robust classification for spam filtering by back-propagation neural networks using behavior-based features. Appl Intell 31(2):107–121
Erenel Z, Altınçay H, Varoğlu E (2009) A symmetric term weighting scheme for text categorization based on term occurrence probabilities. In: Proceedings of fifth international conference on soft computing, computing with words and perceptions in system analysis, decision and control (ICSCCW), Famagusta, Northern Cyprus
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Sun A, Lim E, Liu Y (2009) On strategies for imbalanced text classification using svm: a comparative study. Decis Support Syst 48(1):191–201
Wu S-H, Lin K-P, Chen C-M, Chen M-S (2008) Asymmetric support vector machines: Low false-positive learning under the user tolerance. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA, August 2008
Kolcz A, Yih W (2007) Raising the baseline for high-precision text classifiers. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, California, USA, pp 400–409
Tian J, Gu H, Liu W (2010) Imbalanced classification using support vector machine ensemble. Neural Comput Appl. doi:10.1007/s00521-010-0349-9
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Altınçay, H., Erenel, Z. Using the absolute difference of term occurrence probabilities in binary text categorization. Appl Intell 36, 148–160 (2012). https://doi.org/10.1007/s10489-010-0250-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-010-0250-3