Combining supervised term-weighting metrics for SVM text classification with extended term representation

Haddoud, Mounia; Mokhtari, Aïcha; Lecroq, Thierry; Abdeddaïm, Saïd

doi:10.1007/s10115-016-0924-1

Combining supervised term-weighting metrics for SVM text classification with extended term representation

Regular Paper
Published: 19 February 2016

Volume 49, pages 909–931, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Mounia Haddoud ORCID: orcid.org/0000-0003-3556-7755^1,2,
Aïcha Mokhtari¹,
Thierry Lecroq² &
…
Saïd Abdeddaïm²

1204 Accesses
58 Citations
Explore all metrics

Abstract

The accuracy of a text classification method based on a SVM learner depends on the weighting metric used in order to assign a weight to a term. Weighting metrics can be classified as supervised or unsupervised according to whether they use prior information on the number of documents belonging to each category. A supervised metric should be highly informative about the relation of a document term to a category, and discriminative in separating the positive documents from the negative documents for this category. In this paper, we propose 80 metrics never used for the term-weighting problem and compare them to 16 functions of the literature. A large number of these metrics were initially proposed for other data mining problems: feature selection, classification rules and term collocations. While many previous works have shown the merits of using a particular metric, our experience suggests that the results obtained by such metrics can be highly dependent on the label distribution on the corpus and on the performance measures used (microaveraged or macroaveraged $F_1$-Score). The solution that we propose consists in combining the metrics in order to improve the classification. More precisely, we show that using a SVM classifier which combines the outputs of SVM classifiers that utilize different metrics performs well in all situations. The second main contribution of this paper is an extended term representation for the vector space model that improves significantly the prediction of the text classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-induced term-weighting schemes for text classification

Article 15 January 2016

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

Article 21 May 2019

A Supervised Term Weighting Scheme for Multi-class Text Categorization

References

Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. In: Aggarwal CC, Zhai C (eds) Mining text data. Springer, New York, pp 163–222
Google Scholar
Altinçay H, Erenel Z (2010) Analytical evaluation of term weighting schemes for text categorization. Pattern Recognit Lett 31(11):1310–1323
Google Scholar
Altinçay H, Erenel Z (2012) Using the absolute difference of term occurrence probabilities in binary text categorization. Appl Intell 36(1):148–160
Google Scholar
Badawi D, Altinçay H (2014) A novel framework for termset selection and weighting in binary text classification. Eng Appl Artif Intell 35:38–53
Google Scholar
Batal I, Hauskrecht M (2009) Boosting KNN text classification accuracy by using supervised term weighting schemes. In: Cheung DW-L, Song I-Y, Chu WW, Hu X, Lin JJ (eds), Proceedings of the 18th ACM conference on information and knowledge management, CIKM 2009. Hong Kong, China, November 2–6, 2009. ACM, pp 2041–2044
Bouillot F, Poncelet P, Roche M (2014) Classification of small datasets: why using class-based weighting measures?. In: Andreasen T, Christiansen H, Talavera JCC, Ras ZW (eds), Foundations of intelligent systems–21st international symposium, ISMIS 2014, Roskilde, Denmark, June 25–27, 2014. Proceedings, vol 8502 of Lecture notes in computer science, Springer, pp 345–354
Debole F, Sebastiani F (2002) Supervised term weighting for automated text categorization, Technical Report Technical Report 2002-TR-08. Istituto di Scienza e Tecnologie dellInformazione, Consiglio Nazionale delle Ricerche, Pisa, IT
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on applied computing (SAC), March 9–12, 2003. Melbourne, FL, USA. ACM, pp 784–788
Deng Z-H, Luo K-H, Yu H (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41(7):3506–3513
Google Scholar
Deng Z-H, Tang S, Yang D, Zhang M, Li L, Xie K (2004) A comparative study on feature weight in text categorization. In: Yu JX, Lin X, Lu H, Zhang Y (eds), Advanced web technologies and applications, 6th Asia-Pacific web conference, APWeb 2004, Hangzhou, China, April 14–17, 2004, Proceedings, vol 3007 of Lecture notes in computer science, Springer, pp 588–597
Escalante HJ, García-Limón MA, Morales-Reyes A, Graff M, Montes-y-Gómez M, Morales EF, Martínez-Carranza J (2015) Term-weighting learning via genetic programming for text classification. Knowl Based Syst 83:176–189
Google Scholar
Fattah MA (2015) New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing 167:434–442
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Forman G (2008) BNS feature scaling: an improved representation over tf-idf for svm text classification. In: Shanahan JG, Amer-Yahia S, Manolescu I, Zhang Y, Evans DA, Kolcz A, Choi K-S, Chowdhury A (eds), Proceedings of the 17th ACM conference on information and knowledge management, CIKM 2008, Napa Valley, California, USA, October 26–30, 2008. ACM, pp 263–270
Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):9
Google Scholar
Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: Quemada J, León G, Maarek YS, Nejdl W (eds), Proceedings of the 18th international conference on world wide web, WWW 2009, Madrid, Spain, April 20–24, 2009. ACM, pp 201–210
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods–support vector learning. MIT Press, Cambridge, pp 169–184 (Chapter 11)
Google Scholar
Joachims T (2006) Training linear SVMs in linear time. In: Eliassi-Rad T, Ungar LH, Craven M, Gunopulos D (eds), Proceedings of the Twelfth ACM SIGKDD international conference on knowledge discovery and data mining. Philadelphia, PA, USA, August 20–23, 2006. ACM, pp 217–226
Ko Y (2015) A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J Assoc Inf Sci Technol 66:2553–2565
Google Scholar
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Google Scholar
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Google Scholar
Madjarov G, Kocev D, Gjorgjevikj D, Dzeroski S (2012) An extensive experimental comparison of methods for multi-label learning. Pattern Recognit 45(9):3084–3104
Google Scholar
Martineau J, Finin T, Joshi A, Patel S (2009) Improving binary classification on text problems using differential word features. In: Cheung DW-L, Song I-Y, Chu WW, Hu X, Lin JJ (eds), Proceedings of the 18th ACM conference on information and knowledge management, CIKM 2009. Hong Kong, China, November 2–6, 2009. ACM, pp 2019–2024
Nguyen TT, Chang K, Hui SC (2013) Supervised term weighting centroid-based classifiers for text categorization. Knowl Inf Syst 35(1):61–85
Google Scholar
Pecina P (2010) Lexical association measures and collocation extraction. Lang Resour Eval 44(1–2):137–158
Google Scholar
Rehman A, Javed K, Babri HA, Saeed M (2015) Relative discrimination criterion–a novel feature ranking method for text data. Expert Syst Appl 42(7):3670–3681
Google Scholar
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125
Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Google Scholar
Tsoumakas G, Katakis I, Vlahavas IP (2010) Mining multi-label data. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer, New York, pp 667–685
Google Scholar
Tulyakov S, Jaeger S, Govindaraju V, Doermann DS (2008) Review of classifier combination methods. In: Marinai S, Fujisawa H (eds) Machine learning in document analysis and recognition, vol 90 of Studies in computational intelligence. Springer, New York, pp 361–386
Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Fisher DH (eds), Proceedings of the fourteenth international conference on machine learning (ICML 1997), Nashville, Tennessee, USA, July 8–12, 1997. Morgan Kaufmann, pp 412–420
Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Google Scholar

Download references

Author information

Authors and Affiliations

RIIMA, USTHB, BP 32, El-Alia, Bab-Ezzouar, 16111, Algiers, Algeria
Mounia Haddoud & Aïcha Mokhtari
LITIS, Université de Rouen, 76821, Mont-Saint-Aignan Cedex, France
Mounia Haddoud, Thierry Lecroq & Saïd Abdeddaïm

Authors

Mounia Haddoud
View author publications
You can also search for this author inPubMed Google Scholar
Aïcha Mokhtari
View author publications
You can also search for this author inPubMed Google Scholar
Thierry Lecroq
View author publications
You can also search for this author inPubMed Google Scholar
Saïd Abdeddaïm
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Mounia Haddoud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haddoud, M., Mokhtari, A., Lecroq, T. et al. Combining supervised term-weighting metrics for SVM text classification with extended term representation. Knowl Inf Syst 49, 909–931 (2016). https://doi.org/10.1007/s10115-016-0924-1

Download citation

Received: 18 March 2015
Revised: 25 November 2015
Accepted: 03 February 2016
Published: 19 February 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10115-016-0924-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining supervised term-weighting metrics for SVM text classification with extended term representation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Model-induced term-weighting schemes for text classification

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

A Supervised Term Weighting Scheme for Multi-class Text Categorization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now