Abstract
Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.
Similar content being viewed by others
References
Miller Z, Dickinson B, Deitrick W et al (2014) Twitter spammer detection using data stream clustering. Inf Sci (NY) 260:64–73. https://doi.org/10.1016/j.ins.2013.11.016
Faguo Z, Fan Z, Bingru Y, Xingang Y (2010) Research on short text classification algorithm based on statistics and rules. In: 2010 Third international symposium on electronic commerce and security, pp 3–7. https://doi.org/10.1109/isecs.2010.9
Quan X, Wenyin L, Qiu B (2011) Term weighting schemes for question categorization. IEEE Trans Pattern Anal Mach Intell 33:1009–1021. https://doi.org/10.1109/TPAMI.2010.154
Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI international joint conference on artificial intelligence, pp 1776–1781. https://doi.org/10.5591/978-1-57735-516-8/ijcai11-298
Irani D, Webb S, Pu C et al (2010) Study of trend-stuffing on Twitter through text classification. In: CEAS, seventh annual collaboration, electronic messaging, anti-abuse and spam conference, cited 11
Speriosu M, Sudan N, Upadhyay S, Baldridge J (2011) Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of conference on empirical methods natural language processing, pp 53–56
Tsuchida Y, Yoshioka M, Yanagimoto H, Isaji S (2012) Incident detection from Tweets by neural network with GPGPU. In: 2012 IEEE international conference on fuzzy systems, pp 1–6. https://doi.org/10.1109/fuzz-ieee.2012.6251239
Jiang L, Yu M, Zhou M et al (2011) Target-dependent Twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, pp 151–160
Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38:12708–12716. https://doi.org/10.1016/j.eswa.2011.04.058
Scott S, Matwin S (1999) Feature engineering for text classification. Mach Learn Work 6:1–13. https://doi.org/10.1016/j.jbi.2012.04.010
Bekkerman R, Allan J (2003) Using bigrams in text categorization. Work 1003:1–10
Tsai FS, Kwee AT (2011) Experiments in term weighting for novelty mining. Expert Syst Appl 38:14094–14101. https://doi.org/10.1016/j.eswa.2011.04.218
Erenel Z, Altinçay H, Varoǧlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27:819–834
CLiao YLY (2010) A text classification model based on training sample selection and weight adjustment. In: 2010 2nd International conference on advanced computer control ICACC. https://doi.org/10.1109/icacc.2010.5486615
Martineau J, Martineau J, Finin T et al (2008) Delta TFIDF: an improved feature space for sentiment analysis. In: Proceedings of second international conference on weblogs and social media (ICWSM), vol 29, pp 490–497
Shi K, He J, Liu H et al (2011) Efficient text classification method based on improved term reduction and term weighting. J China Univ Posts Telecommun 18:131–135. https://doi.org/10.1016/S1005-8885(10)60196-3
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci (Ny) 236:109–125. https://doi.org/10.1016/j.ins.2013.02.029
Debole F, Sebastiani F (2003) Supervised team weightening for automated text categorization. Ist di Sci e Tecnol dell’Informazione 784–788
Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: IJCAI international joint conference on artificial intelligence, pp 1130–1135
Wu H, Gu X (2014) Reducing over-weighting in supervised term weighting for sentiment analysis. In: Proceedings of COLING 2014, 25th international conference on computational linguistics technical papers, pp 1322–1330
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31:721–735. https://doi.org/10.1109/TPAMI.2008.110
Deng Z-H, Luo K-H, Yu H-L (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41:3506–3513. https://doi.org/10.1016/j.eswa.2013.10.056
Man Y (2014) Feature extension for short text categorization using frequent term sets. Procedia Comput Sci 31:663–670. https://doi.org/10.1016/j.procs.2014.05.314
da Silva NFF, Hruschka ER, Hruschka ER (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179. https://doi.org/10.1016/j.dss.2014.07.003
Timonen M (2013) Term weighting in short documents for document categorization, keyword extraction and query expansion. Publications A. [online] 2013: University of Helsinki, Finland. http://www.cs.helsinki.fi
Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47. https://doi.org/10.1145/505282.505283
Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York
Günal S (2012) Hybrid feature selection for text classification. Turk J Electr Eng Comput Sci 20:1296–1311. https://doi.org/10.3906/elk-1101-1064
Uǧuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 24:1024–1032. https://doi.org/10.1016/j.knosys.2011.04.014
Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Lin J, Kolcz A (2012) Large-scale machine learning at Twitter. In: Proceedings of 2012 international conference of data management SIGMOD, vol 12, pp 793–804. https://doi.org/10.1145/2213836.2213958
Taşcı Ş, Güngör T (2013) Comparison of text feature selection policies and using an adaptive framework. Expert Syst Appl 40:4871–4886. https://doi.org/10.1016/j.eswa.2013.02.019
Emmanuel M, Khatri SM, Babu DRR (2013) A novel scheme for term weighting in text categorization: positive impact factor. In: 2013 IEEE international conference on systems, man, and cybernetics, pp 2292–2297. https://doi.org/10.1109/smc.2013.392
Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves M, Meira W Jr (2011) Word co-occurrence features for text classification. Inf Syst 36(5):843–858. https://doi.org/10.1016/j.is.2011.02.002
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Rights and permissions
About this article
Cite this article
Alsmadi, I., Hoon, G.K. Term weighting scheme for short-text classification: Twitter corpuses. Neural Comput & Applic 31, 3819–3831 (2019). https://doi.org/10.1007/s00521-017-3298-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-017-3298-8