Term weighting scheme for short-text classification: Twitter corpuses

Alsmadi, Issa; Hoon, Gan Keng

doi:10.1007/s00521-017-3298-8

Term weighting scheme for short-text classification: Twitter corpuses

Original Article
Published: 06 January 2018

Volume 31, pages 3819–3831, (2019)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

2098 Accesses
32 Citations
Explore all metrics

Abstract

Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

Bogdan Batrinca & Philip C. Treleaven

A review of semi-supervised learning for text classification

Article 31 January 2023

José Marcio Duarte & Lilian Berton

Machine learning-based social media bot detection: a comprehensive literature review

Article Open access 05 January 2023

Malak Aljabri, Rachid Zagrouba, … Dorieh M. Alomari

References

Miller Z, Dickinson B, Deitrick W et al (2014) Twitter spammer detection using data stream clustering. Inf Sci (NY) 260:64–73. https://doi.org/10.1016/j.ins.2013.11.016
Article Google Scholar
Faguo Z, Fan Z, Bingru Y, Xingang Y (2010) Research on short text classification algorithm based on statistics and rules. In: 2010 Third international symposium on electronic commerce and security, pp 3–7. https://doi.org/10.1109/isecs.2010.9
Quan X, Wenyin L, Qiu B (2011) Term weighting schemes for question categorization. IEEE Trans Pattern Anal Mach Intell 33:1009–1021. https://doi.org/10.1109/TPAMI.2010.154
Article Google Scholar
Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: IJCAI international joint conference on artificial intelligence, pp 1776–1781. https://doi.org/10.5591/978-1-57735-516-8/ijcai11-298
Irani D, Webb S, Pu C et al (2010) Study of trend-stuffing on Twitter through text classification. In: CEAS, seventh annual collaboration, electronic messaging, anti-abuse and spam conference, cited 11
Speriosu M, Sudan N, Upadhyay S, Baldridge J (2011) Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of conference on empirical methods natural language processing, pp 53–56
Tsuchida Y, Yoshioka M, Yanagimoto H, Isaji S (2012) Incident detection from Tweets by neural network with GPGPU. In: 2012 IEEE international conference on fuzzy systems, pp 1–6. https://doi.org/10.1109/fuzz-ieee.2012.6251239
Jiang L, Yu M, Zhou M et al (2011) Target-dependent Twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, pp 151–160
Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38:12708–12716. https://doi.org/10.1016/j.eswa.2011.04.058
Article Google Scholar
Scott S, Matwin S (1999) Feature engineering for text classification. Mach Learn Work 6:1–13. https://doi.org/10.1016/j.jbi.2012.04.010
Google Scholar
Bekkerman R, Allan J (2003) Using bigrams in text categorization. Work 1003:1–10
Google Scholar
Tsai FS, Kwee AT (2011) Experiments in term weighting for novelty mining. Expert Syst Appl 38:14094–14101. https://doi.org/10.1016/j.eswa.2011.04.218
Google Scholar
Erenel Z, Altinçay H, Varoǧlu E (2011) Explicit use of term occurrence probabilities for term weighting in text categorization. J Inf Sci Eng 27:819–834
Google Scholar
CLiao YLY (2010) A text classification model based on training sample selection and weight adjustment. In: 2010 2nd International conference on advanced computer control ICACC. https://doi.org/10.1109/icacc.2010.5486615
Martineau J, Martineau J, Finin T et al (2008) Delta TFIDF: an improved feature space for sentiment analysis. In: Proceedings of second international conference on weblogs and social media (ICWSM), vol 29, pp 490–497
Shi K, He J, Liu H et al (2011) Efficient text classification method based on improved term reduction and term weighting. J China Univ Posts Telecommun 18:131–135. https://doi.org/10.1016/S1005-8885(10)60196-3
Article Google Scholar
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci (Ny) 236:109–125. https://doi.org/10.1016/j.ins.2013.02.029
Article Google Scholar
Debole F, Sebastiani F (2003) Supervised team weightening for automated text categorization. Ist di Sci e Tecnol dell’Informazione 784–788
Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: IJCAI international joint conference on artificial intelligence, pp 1130–1135
Wu H, Gu X (2014) Reducing over-weighting in supervised term weighting for sentiment analysis. In: Proceedings of COLING 2014, 25th international conference on computational linguistics technical papers, pp 1322–1330
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31:721–735. https://doi.org/10.1109/TPAMI.2008.110
Article Google Scholar
Deng Z-H, Luo K-H, Yu H-L (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41:3506–3513. https://doi.org/10.1016/j.eswa.2013.10.056
Article Google Scholar
Man Y (2014) Feature extension for short text categorization using frequent term sets. Procedia Comput Sci 31:663–670. https://doi.org/10.1016/j.procs.2014.05.314
Article Google Scholar
da Silva NFF, Hruschka ER, Hruschka ER (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179. https://doi.org/10.1016/j.dss.2014.07.003
Article Google Scholar
Timonen M (2013) Term weighting in short documents for document categorization, keyword extraction and query expansion. Publications A. [online] 2013: University of Helsinki, Finland. http://www.cs.helsinki.fi
Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47. https://doi.org/10.1145/505282.505283
Article Google Scholar
Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York
Google Scholar
Günal S (2012) Hybrid feature selection for text classification. Turk J Electr Eng Comput Sci 20:1296–1311. https://doi.org/10.3906/elk-1101-1064
Google Scholar
Uǧuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 24:1024–1032. https://doi.org/10.1016/j.knosys.2011.04.014
Article Google Scholar
Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Lin J, Kolcz A (2012) Large-scale machine learning at Twitter. In: Proceedings of 2012 international conference of data management SIGMOD, vol 12, pp 793–804. https://doi.org/10.1145/2213836.2213958
Taşcı Ş, Güngör T (2013) Comparison of text feature selection policies and using an adaptive framework. Expert Syst Appl 40:4871–4886. https://doi.org/10.1016/j.eswa.2013.02.019
Article Google Scholar
Emmanuel M, Khatri SM, Babu DRR (2013) A novel scheme for term weighting in text categorization: positive impact factor. In: 2013 IEEE international conference on systems, man, and cybernetics, pp 2292–2297. https://doi.org/10.1109/smc.2013.392
Figueiredo F, Rocha L, Couto T, Salles T, Gonçalves M, Meira W Jr (2011) Word co-occurrence features for text classification. Inf Syst 36(5):843–858. https://doi.org/10.1016/j.is.2011.02.002
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Sciences, Universiti Sains Malaysia, USM, 11800, Gelugor, Pulau Pinang, Malaysia
Issa Alsmadi & Gan Keng Hoon

Authors

Issa Alsmadi
View author publications
You can also search for this author in PubMed Google Scholar
Gan Keng Hoon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Issa Alsmadi.

Ethics declarations

Conflict of interest

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alsmadi, I., Hoon, G.K. Term weighting scheme for short-text classification: Twitter corpuses. Neural Comput & Applic 31, 3819–3831 (2019). https://doi.org/10.1007/s00521-017-3298-8

Download citation

Received: 19 February 2017
Accepted: 21 December 2017
Published: 06 January 2018
Issue Date: August 2019
DOI: https://doi.org/10.1007/s00521-017-3298-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Term weighting scheme for short-text classification: Twitter corpuses

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

A review of semi-supervised learning for text classification

Machine learning-based social media bot detection: a comprehensive literature review

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Term weighting scheme for short-text classification: Twitter corpuses

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

A review of semi-supervised learning for text classification

Machine learning-based social media bot detection: a comprehensive literature review

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation