Elsevier

Expert Systems with Applications

Volume 83, 15 October 2017, Pages 314-325
Expert Systems with Applications

Towards filtering undesired short text messages using an online learning approach with semantic indexing

https://doi.org/10.1016/j.eswa.2017.04.055Get rights and content

Highlights

  • A new classifier is presented to detect undesired short text comments.

  • The proposed approach is light, fast, multinomial and offers incremental learning.

  • The impact of applying text normalization and semantic indexing is studied.

  • The results indicate the proposed techniques outperformed most of the approaches.

  • Text normalization and semantic indexing enhanced the classifiers performance.

Abstract

The popularity and reach of short text messages commonly used in electronic communication have led spammers to use them to propagate undesired content. This is often composed by misleading information, advertisements, viruses, and malwares that can be harmful and annoying to users. The dynamic nature of spam messages demands for knowledge-based systems with online learning and, therefore, the most traditional text categorization techniques can not be used. In this study, we introduce the MDLText, a text classifier based on the minimum description length principle, to the context of filtering undesired short text messages. The proposed approach supports incremental learning and, therefore, its predictive model is scalable and can adapt to continuously evolving spamming techniques. It is also fast, with computational cost increasing linearly with the number of samples and features, which is very desirable for expert systems applied to real-time electronic communication. In addition to the dynamic nature of these messages, they are also short and usually poorly written, rife with slangs, symbols, and abbreviations that difficult text representation, learning, and filtering. In this scenario, we also investigated the benefits of using text normalization and semantic indexing techniques. We showed these techniques can improve the text content quality and, consequently, enhance the performance of the expert systems for spamming detection. Based on these findings, we propose a new hybrid ensemble approach that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques. It has the advantages of being independent of the classification method and the results indicated it is efficient to filter undesired short text messages.

Introduction

Combating spam is an important problem in the online world. Since the first use of the “word” spam to describe an unsolicited bulk message, this plague has “infected” almost all popular types of electronic communication by text. Although the email spam is the most widely recognized form of spam, this is also spreading to applications of short text messages, such as blogs, instant messaging, mobile phone (SMS), and social media.

The volume of text information is increasing rapidly and a significant amount is spam, which makes manual selection unpractical and grants automatic expert and intelligent systems for spam detection an important role for filtering undesired content (Alsaleh, Alarifi, Al-Quayed, & Al-Salman, 2015). This problem is an example of adversarial classification, in which the spammers constantly attempt to evade filtering, while the predictive models try to adapt to continuously evolving spamming techniques (Bratko, Filipič, Cormack, Lynam, Zupan, 2006, Dalvi, Domingos, Mausam, Sanghai, Verma, 2004).

According to Akismet1, a spam filtering service for blog comments, their systems have kept spam off the web with an average of about 7.5 million spam comments per hour2. The volume of legitimate blog comments on average is less than 5% of the total messages published3. A report by Nexgate, a computer security company, showed that in social media sites, such as Facebook and YouTube, 1 in 200 messages contains spam, including lures to adult content and malwares4. In fact, experts estimate that as many as 40% of social network accounts are used to disseminate spam5.

Some types of spam can cause damage to the users. On blogs, for instance, the text comments represent between 15% and 30% of the total blog content, therefore consisting in an inseparable part of each blog and a motivation for the authors to keep publishing (Alberto, Lochter, Almeida, 2015a, Mishne, Glance, 2006). If this interaction is flooded with undesired comments, it can reduce the quality of the information and also confuse search engines which impacts directly the traffic of readers (Mishne & Glance, 2006).

While some users are aware of spam, most of them lack the knowledge to deal with it, often being lured into problems such as hacking and phishing (Ridzuan, Potdar, & Hui, 2012). Traditional methods for preventing it, such as user registration, CAPTCHA, and IP blacklisting might limit the ability of automatic spam bots, but they also tend to hinder legitimate users’ experience (Alsaleh et al., 2015). Moreover, spam messages are not only sent by bots, but also by people who pose as legitimate users and attempt to post messages with links and advertisements (Alberto et al., 2015a). To make it harder, these messages are usually very short and rife with slangs, acronyms, symbols and misspelled words that difficult the computational representation of their content and the learning process necessary to automatically filter these messages.

Many of the traditional text categorization techniques cannot be employed to deal with real spam problems in short text messages because they require that all the examples should be stored in memory, or they should be simultaneously presented in a process known as batch or offline learning. The predictive model created by offline classification methods is static, which harms the spam detection performance, since the spammers tend to adapt and change the messages style to slip through filtering techniques (Bratko et al., 2006). Moreover, since the messages are usually very short and written with an arbitrary grammar, it can lead to text problems of redundancies, polysemy, and synonymy, which make the sample computational representation more difficult, thus impacting the learning process.

Given this scenario, in this study we evaluated the MDLText, a new text classification approach based on the minimum description length (MDL) principle (Rissanen, 1978), to filter spam on short and noisy text messages. This method can be easily deployed in an expert system for spam detection and offers many desirable characteristics, such as (1) incremental learning necessary for online and dynamic scenarios and (2) inherent ability to prevent overfitting because it selects the model that fits the data well, while it naturally favors less complex models.

We conducted a comprehensive performance evaluation using the proposed text classifier in online spam detection, and compared our approach with benchmark online learning methods. We also investigated the impact of applying text normalization and semantic indexing techniques to avoid common text problems and improve sample computational representation. In addition, based on our findings, we proposed a new ensemble approach that combines the predictions obtained by the classifiers using the original text messages and their variations generated after applying text normalization and semantic indexing.

In summary, the robust and online proposed MDLText categorization method is assisted by text processing techniques that remove noise and enrich the text samples by using background expertise. The approaches proposed in this paper have an expert-level competence and provide powerful and flexible means for obtaining solutions to the spam detection problem on short text messages.

The remainder of this study is organized as follows: in Section 2, we briefly describe the related work available in the literature. The basic concepts of the MDL principle are given in Section 3. In Section 4, we present the text classification approach. In Section 5, we discuss the main concepts about text normalization and semantic indexing techniques. In Section 6 we present the ensemble approach. Section 7 describes our experimental setup. Section 8 is devoted to our experimental results. Finally, Section 9 concludes the study and offers guidelines for future work.

Section snippets

Related work

Some years ago, the main target of the spammers was the email. However, with its decreasing popularity and mainly due the popularization of smartphones, spam has invaded all electronic platforms across all media and new types of spam have been emerging nowadays. Many of them are spreading to applications of short text messages, such as short message service (SMS), online instant messages (IM), comments on blogs, and social media. In this section, we first discuss about the main environments

The MDL principle

The MDL principle was introduced by Rissanen (1978); 1983) for the problem of model selection and it is based on the idea the model that fits better the data can also provide a more compact description for the data. The more regularity detected, the better the model learned about the data (Grünwald, 2005). In terms of coding, this means the best model is the one which provides the shortest description length for the given data.

Mathematically, given a set of potential models M1,M2,,M|M|, the

Mathematical basis of MDLText

Given an unlabeled text document d, the MDLText (Silva et al., 2017) uses the main equation of the MDL principle (Eq. 1) to predict the class of the document. The set of potential classes c1,c2,,c|C| represents the set of potential models M, while d represents the data X. Therefore, d receives the label j, which corresponds to class cj with the minimum overall description length related to d: c(d)=argmincL(d|cj).

We have ignored the description length of the potential classes (models) because

Text normalization and semantic indexing

Messages propagated in recent electronic means of communication, over Internet or smartphone, are usually very short and rife with idioms, slangs, symbols, emoticons, and abbreviations. With such characteristics, established text categorization approaches have their performance seriously degraded when applied to filter spam on these messages. However, in a recent study, Almeida et al. (2016) demonstrated that traditional spam filters can have their performance highly increased by the employment

Ensemble of predictions by combining different expansions

Considering we can create ten new processed text documents from each single original message, we can combine them in an ensemble of classifiers instead of using them individually. Therefore, in this study, we evaluate a new ensemble approach that combines the individual predictions obtained using the original messages with the ones generated by the TextExpansion tool (Figure 5).

As shown in Figure 5, there is one predictive model generated using the original training samples and ten other

Experimental settings

To simulate a real scenario of a spam filter, we consider that just a small number of text messages are available to train the classifier (20% of the messages in each class). Next, one message is presented at time to the classifier, which made its prediction. Then, the classifier receives the user feedback and calculates the suffered loss. If the loss is bigger than 0, the training model is updated with the true label. The overall process is described in Algorithm 1.

In Algorithm 1, we consider

Results

Table 3 shows the average SC, BH, and MCC obtained in 50 runs of the experimental scheme described in Algorithm 1. For each evaluated method and dataset, we present the results obtained with: (1) the original text samples (column “Orig.”), (2) the expanded text samples in which the best MCC score was obtained (column “Exp.”), and (3) the ensemble approach (column “Ens.”). The results are sorted by MCC.

Bold values indicate the best score for each one of the columns for each dataset. The

Conclusions

Spam has once again become a real challenging problem nowadays. Besides being a classical type of adversarial classification problem, it demands more and more for online and dynamic prediction models. Due to the increase popularity of smartphones, this plague is migrating fast to new means of electronic communication characterized by short text messages. In these environments, the text documents are usually very short and rife with slangs, abbreviations, symbols, emoticons, and misspelled words

Acknowledgments

The authors are grateful for financial support from the Brazilian agencies FAPESP, Capes, and CNPq (grant 141089/2013-0).

References (57)

  • T.A. Almeida et al.

    Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering

    Knowledge-Based Systems

    (2016)
  • F. Assis et al.

    Exponential differential document count – a feature selection factor for improving Bayesian filters accuracy

    Proceedings of the 2006 MIT spam conference (SP’06), Cambridge, MA, USA

    (2006)
  • A. Bhattarai et al.

    A self-supervised approach to comment spam detection based on content analysis

    International Journal of Information Security and Privacy

    (2011)
  • J. Bi et al.

    A trust and reputation based anti-spim method

    The 27th IEEE conference on computer communications (INFOCOM’08)

    (2008)
  • A. Bratko et al.

    Spam filtering using statistical data compression models

    Journal of Machine Learning Research

    (2006)
  • N. Cesa-Bianchi et al.

    A second-order perceptron algorithm

    SIAM Journal on Computing

    (2005)
  • V. Chaudhary et al.

    Contextual feature based one-class classifier approach for detecting video response spam on YouTube

    Proceedings of the 11th annual international conference on privacy, security and trust (PST’13)

    (2013)
  • R. Chowdury et al.

    A data mining based spam detection system for YouTube

    Proceedings of the 8th international conference on digital information management (ICDIM’13)

    (2013)
  • G.V. Cormack et al.

    Spam filtering for short messages

    Proceedings of the 16th ACM international conference on information and knowledge management (CIKM’07)

    (2007)
  • K. Crammer et al.

    Exact convex confidence-weighted learning

    Proceedings of the 21st international conference on neural information processing systems (NIPS’08)

    (2008)
  • K. Crammer et al.

    Confidence-weighted linear classification for text categorization

    Journal of Machine Learning Research

    (2012)
  • K. Crammer et al.

    Adaptive regularization of weight vectors

    Machine Learning

    (2013)
  • N. Dalvi et al.

    Adversarial classification

    Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04)

    (2004)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • Y. Freund et al.

    Large margin classification using the perceptron algorithm

    Machine Learning

    (1999)
  • C. Gentile

    A new approximate maximal margin classification algorithm

    Journal of Machine Learning Research

    (2002)
  • G. Goswami et al.

    Automated spam detection in short text messages

  • Cited by (25)

    • Analysis of concept drift in fake reviews detection

      2021, Expert Systems with Applications
    • SDRS: A new lossless dimensionality reduction for text corpora

      2020, Information Processing and Management
    • Towards automatically filtering fake news in Portuguese

      2020, Expert Systems with Applications
      Citation Excerpt :

      We performed experiments with the following established classification methods: logistic regression (LR) (Yu, Huang, & Lin, 2011), support vector machines (SVM) (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995), decision trees (DT) (Breiman, Friedman, Olshen, & Stone, 1984), random forest (RF) (Breiman, 2001), bootstrap aggregating (bagging) (Breiman, 1996), and adaptive boosting (AdaBoost) (Freund & Schapire, 1996). To compare the results, we employed the following well-known performance measures for spam and other misleading content (Silva, Alberto, Almeida, & Yamakami, 2017): Legitimate news blocked rate (LBR) or false positive rate: proportion of legitimate news incorrectly labeled as fake news (the lower, the better);

    • Gaussian Mixture Descriptors Learner

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Other problems contain a massive amount of data, which prevents the use of offline learning methods as all the examples should be processed at the same time. Online learning methods are more appropriate for these types of problems because they can incrementally update their predictive model [4]. Therefore, they are suitable for large-scale problems, they are efficient in handling dynamic changes in data distribution, and in general, they require less training time and smaller memory than offline learning methods [4,10,11].

    • Towards automatic filtering of fake reviews

      2018, Neurocomputing
      Citation Excerpt :

      Spam detection has been extensively studied in several types of media, such as email [12], webpage [13–15], blogs [16], microblogs [17,18], SMS [19,20], and YouTube [20]. Many traditional machine learning-based methods have been employed, such as support vector machines (SVM) [14,16,17,19–21], naïve Bayes [14,16,20,22], decision trees (DT) [14,16,20,22–24], and k-nearest neighbors (KNN) [14,16,20]. In general, spam filtering approaches are based on the textual content and spam detection is seen as a binary text categorization problem where the categories are spam or ham (non-spam) [19,20,23,25].

    View all citing articles on Scopus
    View full text