Using the absolute difference of term occurrence probabilities in binary text categorization

Altınçay, Hakan; Erenel, Zafer

doi:10.1007/s10489-010-0250-3

Using the absolute difference of term occurrence probabilities in binary text categorization

Published: 19 August 2010

Volume 36, pages 148–160, (2012)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Hakan Altınçay¹ &
Zafer Erenel¹

153 Accesses
5 Citations
Explore all metrics

Abstract

In this study, the differences among widely used weighting schemes are studied by means of ordering terms according to their discriminative abilities using a recently developed framework which expresses term weights in terms of the ratio and absolute difference of term occurrence probabilities. Having observed that the ordering of terms is dependent on the weighting scheme under concern, it is emphasized that this can be explained by the way different schemes use term occurrence differences in generating term weights. Then, it is proposed that the relevance frequency which is shown to provide the best scores on several datasets can be improved by taking into account the way absolute difference values are used in other widely used schemes. Experimental results on two different datasets have shown that improved F ₁ scores can be achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Obied A, Alhajj R (2009) Fraudulent and malicious sites on the web. Appl Intell 30(2):112–120
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naive Bayes. Expert Syst Appl 36:5432–5435
Article Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML’97, 14th international conference on machine learning. Morgan Kaufmann, San Mateo, pp 412–420
Google Scholar
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor Newsl 6(1):80–89
Article Google Scholar
Chen CM, Lee HM, Hwang CW (2005) A hierarchical neural network document classifier with linguistic feature selection. Appl Intell 23:277–294
Article Google Scholar
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36:690–701
Article Google Scholar
Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87
Article Google Scholar
He J, Tan AH, Tan CL (2003) On machine learning methods for Chinese document categorization. Appl Intell 18:311–322
Article MATH Google Scholar
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Debole F, Sebastiani F (2004) An analysis of the relative hardness of Reuters-21578 subsets. J Am Soc Inform Sci Technol 56(6):584–596
Article Google Scholar
Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1–2):109–123
Article MATH Google Scholar
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: SAC’03: Proceedings of the 2003 ACM symposium on applied computing. ACM, New York, pp 784–788
Chapter Google Scholar
Tsai RT, Hung H, Dai H, Lin Y, Hsu W (2008) Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. BMC Bioinform 9(Suppl 1):S3
Article Google Scholar
Altınçay H, Erenel Z (2010) Analytical evaluation of term weighting schemes for text categorization. Pattern Recogn Lett 31:1310–1323
Article Google Scholar
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the fifteenth national conference on artificial intelligence, Madison, Wisconsin, United States, pp 509–516
Buckley C (1985) Implementation of the smart information retrieval system. Technical report, Cornell University, Ithaca, USA
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference of machine learning, pp 137–142
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkoph B, Burges CJC, Smola AJ (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184
Google Scholar
Tan CL, Huang W, Sung SY, Yui Z, Xu Y (2003) Text retrieval from document images based on word shape analysis. Appl Intell 18(3):257–270
Article MATH Google Scholar
Wu CH, Tsai CH (2009) Robust classification for spam filtering by back-propagation neural networks using behavior-based features. Appl Intell 31(2):107–121
Article Google Scholar
Erenel Z, Altınçay H, Varoğlu E (2009) A symmetric term weighting scheme for text categorization based on term occurrence probabilities. In: Proceedings of fifth international conference on soft computing, computing with words and perceptions in system analysis, decision and control (ICSCCW), Famagusta, Northern Cyprus
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Sun A, Lim E, Liu Y (2009) On strategies for imbalanced text classification using svm: a comparative study. Decis Support Syst 48(1):191–201
Article Google Scholar
Wu S-H, Lin K-P, Chen C-M, Chen M-S (2008) Asymmetric support vector machines: Low false-positive learning under the user tolerance. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, USA, August 2008
Kolcz A, Yih W (2007) Raising the baseline for high-precision text classifiers. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, California, USA, pp 400–409
Tian J, Gu H, Liu W (2010) Imbalanced classification using support vector machine ensemble. Neural Comput Appl. doi:10.1007/s00521-010-0349-9
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Eastern Mediterranean University, Famagusta, Northern Cyprus
Hakan Altınçay & Zafer Erenel

Authors

Hakan Altınçay
View author publications
You can also search for this author in PubMed Google Scholar
Zafer Erenel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hakan Altınçay.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altınçay, H., Erenel, Z. Using the absolute difference of term occurrence probabilities in binary text categorization. Appl Intell 36, 148–160 (2012). https://doi.org/10.1007/s10489-010-0250-3

Download citation

Published: 19 August 2010
Issue Date: January 2012
DOI: https://doi.org/10.1007/s10489-010-0250-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using the absolute difference of term occurrence probabilities in binary text categorization

Abstract

Access this article

Similar content being viewed by others

A New Improved Term Weighting Scheme for Text Categorization

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

On entropy-based term weighting schemes for text categorization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using the absolute difference of term occurrence probabilities in binary text categorization

Abstract

Access this article

Similar content being viewed by others

A New Improved Term Weighting Scheme for Text Categorization

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

On entropy-based term weighting schemes for text categorization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation