Towards filtering undesired short text messages using an online learning approach with semantic indexing
Introduction
Combating spam is an important problem in the online world. Since the first use of the “word” spam to describe an unsolicited bulk message, this plague has “infected” almost all popular types of electronic communication by text. Although the email spam is the most widely recognized form of spam, this is also spreading to applications of short text messages, such as blogs, instant messaging, mobile phone (SMS), and social media.
The volume of text information is increasing rapidly and a significant amount is spam, which makes manual selection unpractical and grants automatic expert and intelligent systems for spam detection an important role for filtering undesired content (Alsaleh, Alarifi, Al-Quayed, & Al-Salman, 2015). This problem is an example of adversarial classification, in which the spammers constantly attempt to evade filtering, while the predictive models try to adapt to continuously evolving spamming techniques (Bratko, Filipič, Cormack, Lynam, Zupan, 2006, Dalvi, Domingos, Mausam, Sanghai, Verma, 2004).
According to Akismet1, a spam filtering service for blog comments, their systems have kept spam off the web with an average of about 7.5 million spam comments per hour2. The volume of legitimate blog comments on average is less than 5% of the total messages published3. A report by Nexgate, a computer security company, showed that in social media sites, such as Facebook and YouTube, 1 in 200 messages contains spam, including lures to adult content and malwares4. In fact, experts estimate that as many as 40% of social network accounts are used to disseminate spam5.
Some types of spam can cause damage to the users. On blogs, for instance, the text comments represent between 15% and 30% of the total blog content, therefore consisting in an inseparable part of each blog and a motivation for the authors to keep publishing (Alberto, Lochter, Almeida, 2015a, Mishne, Glance, 2006). If this interaction is flooded with undesired comments, it can reduce the quality of the information and also confuse search engines which impacts directly the traffic of readers (Mishne & Glance, 2006).
While some users are aware of spam, most of them lack the knowledge to deal with it, often being lured into problems such as hacking and phishing (Ridzuan, Potdar, & Hui, 2012). Traditional methods for preventing it, such as user registration, CAPTCHA, and IP blacklisting might limit the ability of automatic spam bots, but they also tend to hinder legitimate users’ experience (Alsaleh et al., 2015). Moreover, spam messages are not only sent by bots, but also by people who pose as legitimate users and attempt to post messages with links and advertisements (Alberto et al., 2015a). To make it harder, these messages are usually very short and rife with slangs, acronyms, symbols and misspelled words that difficult the computational representation of their content and the learning process necessary to automatically filter these messages.
Many of the traditional text categorization techniques cannot be employed to deal with real spam problems in short text messages because they require that all the examples should be stored in memory, or they should be simultaneously presented in a process known as batch or offline learning. The predictive model created by offline classification methods is static, which harms the spam detection performance, since the spammers tend to adapt and change the messages style to slip through filtering techniques (Bratko et al., 2006). Moreover, since the messages are usually very short and written with an arbitrary grammar, it can lead to text problems of redundancies, polysemy, and synonymy, which make the sample computational representation more difficult, thus impacting the learning process.
Given this scenario, in this study we evaluated the MDLText, a new text classification approach based on the minimum description length (MDL) principle (Rissanen, 1978), to filter spam on short and noisy text messages. This method can be easily deployed in an expert system for spam detection and offers many desirable characteristics, such as (1) incremental learning necessary for online and dynamic scenarios and (2) inherent ability to prevent overfitting because it selects the model that fits the data well, while it naturally favors less complex models.
We conducted a comprehensive performance evaluation using the proposed text classifier in online spam detection, and compared our approach with benchmark online learning methods. We also investigated the impact of applying text normalization and semantic indexing techniques to avoid common text problems and improve sample computational representation. In addition, based on our findings, we proposed a new ensemble approach that combines the predictions obtained by the classifiers using the original text messages and their variations generated after applying text normalization and semantic indexing.
In summary, the robust and online proposed MDLText categorization method is assisted by text processing techniques that remove noise and enrich the text samples by using background expertise. The approaches proposed in this paper have an expert-level competence and provide powerful and flexible means for obtaining solutions to the spam detection problem on short text messages.
The remainder of this study is organized as follows: in Section 2, we briefly describe the related work available in the literature. The basic concepts of the MDL principle are given in Section 3. In Section 4, we present the text classification approach. In Section 5, we discuss the main concepts about text normalization and semantic indexing techniques. In Section 6 we present the ensemble approach. Section 7 describes our experimental setup. Section 8 is devoted to our experimental results. Finally, Section 9 concludes the study and offers guidelines for future work.
Section snippets
Related work
Some years ago, the main target of the spammers was the email. However, with its decreasing popularity and mainly due the popularization of smartphones, spam has invaded all electronic platforms across all media and new types of spam have been emerging nowadays. Many of them are spreading to applications of short text messages, such as short message service (SMS), online instant messages (IM), comments on blogs, and social media. In this section, we first discuss about the main environments
The MDL principle
The MDL principle was introduced by Rissanen (1978); 1983) for the problem of model selection and it is based on the idea the model that fits better the data can also provide a more compact description for the data. The more regularity detected, the better the model learned about the data (Grünwald, 2005). In terms of coding, this means the best model is the one which provides the shortest description length for the given data.
Mathematically, given a set of potential models the
Mathematical basis of MDLText
Given an unlabeled text document d, the MDLText (Silva et al., 2017) uses the main equation of the MDL principle (Eq. 1) to predict the class of the document. The set of potential classes represents the set of potential models M, while d represents the data X. Therefore, d receives the label j, which corresponds to class cj with the minimum overall description length related to d:
We have ignored the description length of the potential classes (models) because
Text normalization and semantic indexing
Messages propagated in recent electronic means of communication, over Internet or smartphone, are usually very short and rife with idioms, slangs, symbols, emoticons, and abbreviations. With such characteristics, established text categorization approaches have their performance seriously degraded when applied to filter spam on these messages. However, in a recent study, Almeida et al. (2016) demonstrated that traditional spam filters can have their performance highly increased by the employment
Ensemble of predictions by combining different expansions
Considering we can create ten new processed text documents from each single original message, we can combine them in an ensemble of classifiers instead of using them individually. Therefore, in this study, we evaluate a new ensemble approach that combines the individual predictions obtained using the original messages with the ones generated by the TextExpansion tool (Figure 5).
As shown in Figure 5, there is one predictive model generated using the original training samples and ten other
Experimental settings
To simulate a real scenario of a spam filter, we consider that just a small number of text messages are available to train the classifier (20% of the messages in each class). Next, one message is presented at time to the classifier, which made its prediction. Then, the classifier receives the user feedback and calculates the suffered loss. If the loss is bigger than 0, the training model is updated with the true label. The overall process is described in Algorithm 1.
In Algorithm 1, we consider
Results
Table 3 shows the average SC, BH, and MCC obtained in 50 runs of the experimental scheme described in Algorithm 1. For each evaluated method and dataset, we present the results obtained with: (1) the original text samples (column “Orig.”), (2) the expanded text samples in which the best MCC score was obtained (column “Exp.”), and (3) the ensemble approach (column “Ens.”). The results are sorted by MCC.
Bold values indicate the best score for each one of the columns for each dataset. The
Conclusions
Spam has once again become a real challenging problem nowadays. Besides being a classical type of adversarial classification problem, it demands more and more for online and dynamic prediction models. Due to the increase popularity of smartphones, this plague is migrating fast to new means of electronic communication characterized by short text messages. In these environments, the text documents are usually very short and rife with slangs, abbreviations, symbols, emoticons, and misspelled words
Acknowledgments
The authors are grateful for financial support from the Brazilian agencies FAPESP, Capes, and CNPq (grant 141089/2013-0).
References (57)
- et al.
Semi-supervised learning using frequent itemset and ensemble learning for SMS classification
Expert Systems with Applications
(2015) - et al.
Facing the spammers: A very effective approach to avoid junk e-mails
Expert Systems with Applications
(2012) - et al.
Combating comment spam with machine learning approaches
Proceedings of the 14th international conference on machine learning and applications (ICMLA’15)
(2015) - et al.
Libol: A library for online learning algorithms
Journal of Machine Learning Research
(2014) Fisher information and stochastic complexity
IEEE Transaction on Information Theory
(1996)- et al.
Svm-based spam filter with active and online learning
Proceedings of the 15th text retrieval conference (TREC’06)
(2006) - et al.
Post or block? Advances in automatically filtering undesired comments
Journal of Intelligent & Robotic Systems
(2015) - et al.
Tubespam: Comment spam filtering on Youtube
Proceedings of the 14th international conference on machine learning and applications (ICMLA’15)
(2015) - et al.
An autonomous online malicious spam email detection system using extended rbf network
Proceedings of the 2015 international joint conference on neural networks (IJCNN’15)
(2015) - et al.
Contributions to the study of SMS spam filtering: new collection and results
Proceedings of the 11th ACM symposium on document engineering (DOCENG’11)
(2011)
Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering
Knowledge-Based Systems
Exponential differential document count – a feature selection factor for improving Bayesian filters accuracy
Proceedings of the 2006 MIT spam conference (SP’06), Cambridge, MA, USA
A self-supervised approach to comment spam detection based on content analysis
International Journal of Information Security and Privacy
A trust and reputation based anti-spim method
The 27th IEEE conference on computer communications (INFOCOM’08)
Spam filtering using statistical data compression models
Journal of Machine Learning Research
A second-order perceptron algorithm
SIAM Journal on Computing
Contextual feature based one-class classifier approach for detecting video response spam on YouTube
Proceedings of the 11th annual international conference on privacy, security and trust (PST’13)
A data mining based spam detection system for YouTube
Proceedings of the 8th international conference on digital information management (ICDIM’13)
Spam filtering for short messages
Proceedings of the 16th ACM international conference on information and knowledge management (CIKM’07)
Exact convex confidence-weighted learning
Proceedings of the 21st international conference on neural information processing systems (NIPS’08)
Confidence-weighted linear classification for text categorization
Journal of Machine Learning Research
Adaptive regularization of weight vectors
Machine Learning
Adversarial classification
Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04)
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
Pattern Classification
Large margin classification using the perceptron algorithm
Machine Learning
A new approximate maximal margin classification algorithm
Journal of Machine Learning Research
Automated spam detection in short text messages
Cited by (25)
Analysis of concept drift in fake reviews detection
2021, Expert Systems with ApplicationsSDRS: A new lossless dimensionality reduction for text corpora
2020, Information Processing and ManagementTowards automatically filtering fake news in Portuguese
2020, Expert Systems with ApplicationsCitation Excerpt :We performed experiments with the following established classification methods: logistic regression (LR) (Yu, Huang, & Lin, 2011), support vector machines (SVM) (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995), decision trees (DT) (Breiman, Friedman, Olshen, & Stone, 1984), random forest (RF) (Breiman, 2001), bootstrap aggregating (bagging) (Breiman, 1996), and adaptive boosting (AdaBoost) (Freund & Schapire, 1996). To compare the results, we employed the following well-known performance measures for spam and other misleading content (Silva, Alberto, Almeida, & Yamakami, 2017): Legitimate news blocked rate (LBR) or false positive rate: proportion of legitimate news incorrectly labeled as fake news (the lower, the better);
Gaussian Mixture Descriptors Learner
2020, Knowledge-Based SystemsCitation Excerpt :Other problems contain a massive amount of data, which prevents the use of offline learning methods as all the examples should be processed at the same time. Online learning methods are more appropriate for these types of problems because they can incrementally update their predictive model [4]. Therefore, they are suitable for large-scale problems, they are efficient in handling dynamic changes in data distribution, and in general, they require less training time and smaller memory than offline learning methods [4,10,11].
A review of soft techniques for SMS spam classification: Methods, approaches and applications
2019, Engineering Applications of Artificial IntelligenceTowards automatic filtering of fake reviews
2018, NeurocomputingCitation Excerpt :Spam detection has been extensively studied in several types of media, such as email [12], webpage [13–15], blogs [16], microblogs [17,18], SMS [19,20], and YouTube [20]. Many traditional machine learning-based methods have been employed, such as support vector machines (SVM) [14,16,17,19–21], naïve Bayes [14,16,20,22], decision trees (DT) [14,16,20,22–24], and k-nearest neighbors (KNN) [14,16,20]. In general, spam filtering approaches are based on the textual content and spam detection is seen as a binary text categorization problem where the categories are spam or ham (non-spam) [19,20,23,25].