Elsevier

Pattern Recognition Letters

Volume 33, Issue 3, 1 February 2012, Pages 364-369
Pattern Recognition Letters

Content-based mobile spam classification using stylistically motivated features

https://doi.org/10.1016/j.patrec.2011.10.017Get rights and content

Abstract

The feature of brevity in mobile phone messages makes it difficult to distinguish lexical patterns to identify spam. This paper proposes a novel approach to spam classification of extremely short messages using not only lexical features that reflect the content of a message but new stylistic features that indicate the manner in which the message is written. Experiments on two mobile phone message collections in two different languages show that the approach outperforms previous content-based approaches significantly, regardless of language.

Highlights

► Propose an mobile spam classification using a variety of new, stylistically motivated features. ► Stylistic features improve the performance of mobile spam classifiers. ► Stylistic features is both effective in English and Korean SMS datasets.

Introduction

Mobile spam, which most often refers to the abuse of the Short Message Service (SMS) to indiscriminately send out unsolicited (and unwanted) bulk messages, has progressively become a major issue from the early 2000s with the increasing popularity of mobile phones. Spam messages may be particularly irritating for many recipients because of the inconvenience that they cause, and also because of the fees that may apply for every message received according to the market; for example, fees apply for recipients in the United States. In order to reduce mobile spam, governments and many mobile service providers have taken various countermeasures (e.g. by imposing substantial fines on spammers, blocking specific phone numbers, creating an alias address, etc.). Nevertheless, the mobile spam rate is still on the rise.

Recently, a content-based approach for spam classification, which had previously brought enormous success in detecting e-mail spam, has started to gain attention among mobile spam researchers. Gómez Hidalgo et al. (2006) explored the use of statistical learning-based classifiers that are trained with lexical features, such as character and word n-grams, for mobile spam classification; the approach explored classification similar to e-mail spam classification (Sahami et al., 1998, Cormack and Lynam, 2007). Unlike for e-mails, however, mobile phone messages are often much shorter, because of the limited size of mobile devices. The feature of brevity causes each message to have fare less information for content-based spam classifiers, which makes the task very challenging. Therefore, succeeding studies on mobile spam classification focused on expanding the feature set for learning the classifiers with features additionally engineered from the message content, in order to consider the contextual information with relative word positions (Cormack et al., 2007a, Cormack et al., 2007b).

Collectively, the features used in earlier content-based approaches for mobile spam classification were topical terms and phrases that statistically indicated the spamness of a message, such as “% off sale,” “poker,” and “no pay”. However, there is no guarantee that legitimate messages would not contain such expressions. For example, an average person may send messages such as:

Danny, ABC mall is having a 56% off sale tomorrow!!

Jane, we are planning to play poker with no pay tomorrow night, got it?

Current content-based mobile spam filters may label such legitimate messages as spam messages. Such a misclassification can cause serious problems for instant mobile communication. We have thus decided to not only depend on the message content itself, but also to incorporate new features that indicate how the messages are written from a linguistic point of view. We propose the use of new features that reflect this “style”, or the manner in which the content is expressed, namely stylistic features.

In this paper, we introduce a number of stylistic features (Sohn et al., 2009)2 that originated mostly from authorship classification, which involves identifying the author of a given text, for learning mobile spam classifiers. The features include readily extractable and countable information from message texts, such as word and sentence lengths (Mendenhall, 1887), function word counts (Mosteller and Wallace, 1984), part-of-speech tags (Argamon-Engelson et al., 1998), and syntactic information (Stamatatos et al., 2000). We then use a machine learning-based classifier to learn such features. We conduct an empirical evaluation that uses two real-world SMS test collections.

The remainder of this paper is organized as follows. Our mobile spam classification algorithm is briefly described in Section 2. In Section 3 we introduce the proposed stylistic features. In Section 4 Experimental design, 5 Results and discussion, we report the setting of our experiments and discuss the results. Finally, we conclude the paper in Section 6.

Section snippets

Mobile spam classifier

In this paper, we take a supervised learning approach to mobile spam classification. In particular, we use the maximum entropy framework (Berger et al., 1996) to learn a classifier to test our approach to mobile spam classification. The maximum entropy framework is a strong learning model that has been commonly used in various text classification tasks including e-mail and mobile spam classification. The main advantage of the maximum entropy framework is that it is robust and statistically

Stylistic features for mobile spam classification

We propose the use of stylistic features to improve content-based mobile spam classifiers under following assumptions:

  • There are two types of mobile phone message senders, namely spammers and non-spammers.

  • Spammers have distinctive linguistic styles and writing behaviors (as opposed to non-spammers) and use them consistently.

  • The SMS message, as an end product, carries the author’s “fingerprints”.

These assumptions are acceptable, because the purpose of most spammers is to advertise products or

Data

We use two sets of SMS messages in two different languages, English and Korean, for the evaluation of the proposed method.

The English dataset consists of 1,125 (67%) legitimate messages and 552 (33%) spam messages. This set was derived from freely available resources on the Web, similarly as in the work of Gómez Hidalgo et al. (2006). The legitimate ones are messages randomly chosen from the SMS corpus built by the National University of Singapore, which contains roughly 10,000 legitimate

Overall performance

The ROC curves corresponding to all feature settings applied to the English and the Korean test collections are shown in Fig. 1, Fig. 2 respectively. The 1-AUC (%) summary result of the comparison experiment for each feature setting is shown in Table 2.

On both the English and Korean test collections, the Style shows results comparable to the Baseline. This result is very surprising, because the Style setting does not use any of the lexical information that has played key roles in most

Conclusion

This paper focuses on the task of mobile spam classification, which involves distinguishing spam messages from legitimate messages. The main contributions of this paper is twofold:

  • We propose an approach to mobile spam classification using a variety of new, stylistically motivated features that do not require high computation cost. It aims at improving the performance of content-based spam classifiers for brief written SMS messages having relatively less lexical information.

  • We empirically

Acknowledgements

This work was supported by the 2nd Brain Korea 21 Project. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsor.

References (20)

  • Argamon, S., Levitan, S., 2005. Measuring the usefulness of function words for authorship attribution. In: Proc. 2005...
  • Argamon-Engelson, S., Koppel, M., Avneri, G., 1998. Style-based text categorization: What newspaper am i reading. In:...
  • A.L. Berger et al.

    A maximum entropy approach to natural language processing

    Computational Linguistics

    (1996)
  • Cormack, G., Lynam, T., 2005. Trec 2005 spam track overview. In: Proc. 14th Text Retrieval...
  • Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P., 2007a. Feature engineering for mobile (sms) spam filtering. In: Proc....
  • Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P., 2007b. Spam filtering for short messages. In: Proc. 16th ACM Conf....
  • G.V. Cormack et al.

    Online supervised spam filter evaluation

    ACM Trans. Inform. Syst.

    (2007)
  • Gómez Hidalgo, J.M., Bringas, G.C., Sánz, E.P., García, F.C., 2006. Content based sms spam filtering. In: Proc. 2006...
  • How, Y., Kan, M.-Y., 2005. Optimizing predictive text entry for short message service on mobile phones. In: Proc. 11th...
  • T. Joachims

    Making large-scale support vector machine learning practical

    Advances in kernel methods: support vector learning

    (1999)
There are more references available in the full text version of this article.

Cited by (28)

  • Spam filtering framework for multimodal mobile communication based on dendritic cell algorithm

    2016, Future Generation Computer Systems
    Citation Excerpt :

    Content-based filtering has received considerable attention over the past years but the major focus was on spam emails. Relatively recent, some methods have been proposed for SMS spam [3–7]. However, the accuracy is still relatively low and further research is required to investigate new features and new lightweight ways of calculating and utilizing them.

  • Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering

    2016, Knowledge-Based Systems
    Citation Excerpt :

    However, to the best of our knowledge, there is no work available in the literature that has used semantic and/or conceptual information in text representation of IM and SMS spam filtering. As an exception, and instead of basically using words as features for representing SMS messages, Sohn et al., [53] proposes to make use of stylistic features in message representation, while Xu et al., [62] make use of non-content features like time and network traffic in the same learning-based approach. Shallow text representations like simple bag-of-words have often been shown to be limiting the performance of machine learning algorithms in text categorization problems [23].

  • Semi-supervised learning using frequent itemset and ensemble learning for SMS classification

    2015, Expert Systems with Applications
    Citation Excerpt :

    Najadat, Abdulla, Abooraig, and Nawasrah (2014) have used a new classifier by mixing different classifiers with no altering of their original algorithm to have better performance. In order to have better accuracy, Sohn, Lee, Han, and Rim (2012) have proposed a new approach by combining the lexical features and new stylish features. Lee, Yeom, Choi, and Kang (2011) proposed distributed spam filter model to use less resources of mobile phones with the help of Naive Bayes and Support Vector Machine.

  • A language processing-free unified spam detection framework using byte histograms and deep learning

    2022, Proceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022
View all citing articles on Scopus
1

Present address: 5th Fl., NHN Green Factory, 178-1, Jeongja-dong, Bundang-gu, Seongnam-si, Gyeonggi-do 463-867, South Korea.

View full text