Content-based mobile spam classification using stylistically motivated features

doi:10.1016/j.patrec.2011.10.017

Pattern Recognition Letters

Volume 33, Issue 3, 1 February 2012, Pages 364-369

https://doi.org/10.1016/j.patrec.2011.10.017 Get rights and content

Abstract

The feature of brevity in mobile phone messages makes it difficult to distinguish lexical patterns to identify spam. This paper proposes a novel approach to spam classification of extremely short messages using not only lexical features that reflect the content of a message but new stylistic features that indicate the manner in which the message is written. Experiments on two mobile phone message collections in two different languages show that the approach outperforms previous content-based approaches significantly, regardless of language.

Highlights

► Propose an mobile spam classification using a variety of new, stylistically motivated features. ► Stylistic features improve the performance of mobile spam classifiers. ► Stylistic features is both effective in English and Korean SMS datasets.

Introduction

Mobile spam, which most often refers to the abuse of the Short Message Service (SMS) to indiscriminately send out unsolicited (and unwanted) bulk messages, has progressively become a major issue from the early 2000s with the increasing popularity of mobile phones. Spam messages may be particularly irritating for many recipients because of the inconvenience that they cause, and also because of the fees that may apply for every message received according to the market; for example, fees apply for recipients in the United States. In order to reduce mobile spam, governments and many mobile service providers have taken various countermeasures (e.g. by imposing substantial fines on spammers, blocking specific phone numbers, creating an alias address, etc.). Nevertheless, the mobile spam rate is still on the rise.

Recently, a content-based approach for spam classification, which had previously brought enormous success in detecting e-mail spam, has started to gain attention among mobile spam researchers. Gómez Hidalgo et al. (2006) explored the use of statistical learning-based classifiers that are trained with lexical features, such as character and word n-grams, for mobile spam classification; the approach explored classification similar to e-mail spam classification (Sahami et al., 1998, Cormack and Lynam, 2007). Unlike for e-mails, however, mobile phone messages are often much shorter, because of the limited size of mobile devices. The feature of brevity causes each message to have fare less information for content-based spam classifiers, which makes the task very challenging. Therefore, succeeding studies on mobile spam classification focused on expanding the feature set for learning the classifiers with features additionally engineered from the message content, in order to consider the contextual information with relative word positions (Cormack et al., 2007a, Cormack et al., 2007b).

Collectively, the features used in earlier content-based approaches for mobile spam classification were topical terms and phrases that statistically indicated the spamness of a message, such as “% off sale,” “poker,” and “no pay”. However, there is no guarantee that legitimate messages would not contain such expressions. For example, an average person may send messages such as:

“Danny, ABC mall is having a 56% off sale tomorrow!!”

“Jane, we are planning to play poker with no pay tomorrow night, got it?”

Current content-based mobile spam filters may label such legitimate messages as spam messages. Such a misclassification can cause serious problems for instant mobile communication. We have thus decided to not only depend on the message content itself, but also to incorporate new features that indicate how the messages are written from a linguistic point of view. We propose the use of new features that reflect this “style”, or the manner in which the content is expressed, namely stylistic features.

In this paper, we introduce a number of stylistic features (Sohn et al., 2009)² that originated mostly from authorship classification, which involves identifying the author of a given text, for learning mobile spam classifiers. The features include readily extractable and countable information from message texts, such as word and sentence lengths (Mendenhall, 1887), function word counts (Mosteller and Wallace, 1984), part-of-speech tags (Argamon-Engelson et al., 1998), and syntactic information (Stamatatos et al., 2000). We then use a machine learning-based classifier to learn such features. We conduct an empirical evaluation that uses two real-world SMS test collections.

The remainder of this paper is organized as follows. Our mobile spam classification algorithm is briefly described in Section 2. In Section 3 we introduce the proposed stylistic features. In Section 4 Experimental design, 5 Results and discussion, we report the setting of our experiments and discuss the results. Finally, we conclude the paper in Section 6.

Section snippets

Mobile spam classifier

In this paper, we take a supervised learning approach to mobile spam classification. In particular, we use the maximum entropy framework (Berger et al., 1996) to learn a classifier to test our approach to mobile spam classification. The maximum entropy framework is a strong learning model that has been commonly used in various text classification tasks including e-mail and mobile spam classification. The main advantage of the maximum entropy framework is that it is robust and statistically

Stylistic features for mobile spam classification

We propose the use of stylistic features to improve content-based mobile spam classifiers under following assumptions:

•
There are two types of mobile phone message senders, namely spammers and non-spammers.
•
Spammers have distinctive linguistic styles and writing behaviors (as opposed to non-spammers) and use them consistently.
•
The SMS message, as an end product, carries the author’s “fingerprints”.

These assumptions are acceptable, because the purpose of most spammers is to advertise products or

Data

We use two sets of SMS messages in two different languages, English and Korean, for the evaluation of the proposed method.

The English dataset consists of 1,125 (67%) legitimate messages and 552 (33%) spam messages. This set was derived from freely available resources on the Web, similarly as in the work of Gómez Hidalgo et al. (2006). The legitimate ones are messages randomly chosen from the SMS corpus built by the National University of Singapore, which contains roughly 10,000 legitimate

Overall performance

The ROC curves corresponding to all feature settings applied to the English and the Korean test collections are shown in Fig. 1, Fig. 2 respectively. The 1-AUC (%) summary result of the comparison experiment for each feature setting is shown in Table 2.

On both the English and Korean test collections, the Style shows results comparable to the Baseline. This result is very surprising, because the Style setting does not use any of the lexical information that has played key roles in most

Conclusion

This paper focuses on the task of mobile spam classification, which involves distinguishing spam messages from legitimate messages. The main contributions of this paper is twofold:

•
We propose an approach to mobile spam classification using a variety of new, stylistically motivated features that do not require high computation cost. It aims at improving the performance of content-based spam classifiers for brief written SMS messages having relatively less lexical information.
•
We empirically

Acknowledgements

This work was supported by the 2nd Brain Korea 21 Project. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsor.

References (20)

Argamon, S., Levitan, S., 2005. Measuring the usefulness of function words for authorship attribution. In: Proc. 2005...
Argamon-Engelson, S., Koppel, M., Avneri, G., 1998. Style-based text categorization: What newspaper am i reading. In:...
A.L. Berger et al.
A maximum entropy approach to natural language processing
Computational Linguistics
(1996)
Cormack, G., Lynam, T., 2005. Trec 2005 spam track overview. In: Proc. 14th Text Retrieval...
Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P., 2007a. Feature engineering for mobile (sms) spam filtering. In: Proc....
Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P., 2007b. Spam filtering for short messages. In: Proc. 16th ACM Conf....
G.V. Cormack et al.
Online supervised spam filter evaluation
ACM Trans. Inform. Syst.
(2007)
Gómez Hidalgo, J.M., Bringas, G.C., Sánz, E.P., García, F.C., 2006. Content based sms spam filtering. In: Proc. 2006...
How, Y., Kan, M.-Y., 2005. Optimizing predictive text entry for short message service on mobile phones. In: Proc. 11th...
T. Joachims
Making large-scale support vector machine learning practical
Advances in kernel methods: support vector learning
(1999)

There are more references available in the full text version of this article.

Cited by (28)

A review of soft techniques for SMS spam classification: Methods, approaches and applications
2019, Engineering Applications of Artificial Intelligence
The easy accessibility and simplicity of Short Message Services (SMS) have made it attractive to malicious users thereby incurring unnecessary costing on the mobile users and the Network providers’ resources.
The aim of this paper is to identify and review existing state of the art methodology for SMS spam based on some certain metrics: AI methods and techniques, approaches and deployed environment and the overall acceptability of existing SMS applications.
This study explored eleven databases which include IEEE, Science Direct, Springer, Wiley, ACM, DBLP, Emerald, SU, Sage, Google Scholar, and Taylor and Francis, a total number of 1198 publications were found. Several screening criteria were conducted for relevant papers such as duplicate removal, removal based on irrelevancy, abstract eligibility based on the removal of papers with ambiguity (undefined methodology). Finally, 83 papers were identified for depth analysis and relevance. A quantitative evaluation was conducted on the selected studies using seven search strategies (SS): source, methods/ techniques, AI approach, architecture, status, datasets and SMS spam mobile applications.
A Quantitative Analysis (QA) was conducted on the selected studies and the result based on existing methodology for classification shows that machine learning gave the highest result with 49% with algorithms such as Bayesian and support vector machines showing highest usage. Unlike statistical analysis with 39% and evolutionary algorithms gave 12%. However, the QA for feature selection methods shows that more studies utilized document frequency, term frequency and n-grams techniques for effective features selection process. Result based on existing approaches for content-based, non-content and hybrid approaches is 83%, 5%, and 12% respectively. The QA based on architecture shows that 25% of existing solutions are deployed on the client side, 19% on server-side, 6% collaborative and 50% unspecified. This survey was able to identify the status of existing SMS spam research as 35% of existing study was based on proposed new methods using existing algorithms and 29% based on only evaluation of existing algorithms, 20% was based on proposed methods only.
This study concludes with very interesting findings which shows that the majority of existing SMS spam filtering solutions are still between the “Proposed” status or “Proposed and Evaluated” status. In addition, the taxonomy of existing state of the art methodologies is developed and it is concluded that 8.23% of Android users actually utilize this existing SMS anti-spam applications. Our study also concludes that there is a need for researchers to exploit all security methods and algorithm to secure SMS thus enhancing further classification in other short message platforms. A new English SMS spam dataset is also generated for future research efforts in Text mining, Tele-marketing for reducing global spam activities.
Spam filtering framework for multimodal mobile communication based on dendritic cell algorithm
2016, Future Generation Computer Systems
Citation Excerpt :
Content-based filtering has received considerable attention over the past years but the major focus was on spam emails. Relatively recent, some methods have been proposed for SMS spam [3–7]. However, the accuracy is still relatively low and further research is required to investigate new features and new lightweight ways of calculating and utilizing them.
With the continual growth of mobile devices, they become a universal portable platform for effective business and personal communication. They enable a plethora of textual communication modes including electronic mails, instant messaging, and short messaging services. A downside of such great technology is the alarming rate of spam messages that are not only annoying to end-users but raises security concerns as well. This paper presents an intelligent framework for filtering multimodal textual communication including emails and short messages. We explore a novel methodology for information fusion inspired by the human immune system and hybrid approaches of machines learning. We study a number of methods to extract and select more relevant features to reduce the complexity of the proposed model to suite mobile applications while preserving good performance. The proposed framework is intensively evaluated on a number of benchmark datasets with remarkable results achieved.
Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering
2016, Knowledge-Based Systems
Citation Excerpt :
However, to the best of our knowledge, there is no work available in the literature that has used semantic and/or conceptual information in text representation of IM and SMS spam filtering. As an exception, and instead of basically using words as features for representing SMS messages, Sohn et al., [53] proposes to make use of stylistic features in message representation, while Xu et al., [62] make use of non-content features like time and network traffic in the same learning-based approach. Shallow text representations like simple bag-of-words have often been shown to be limiting the performance of machine learning algorithms in text categorization problems [23].
The rapid popularization of smartphones has contributed to the growth of online Instant Messaging and SMS usage as an alternative way of communication. The increasing number of users, along with the trust they inherently have in their devices, makes such messages a propitious environment for spammers. In fact, reports clearly indicate that volume of spam over Instant Messaging and SMS is dramatically increasing year by year. It represents a challenging problem for traditional filtering methods nowadays, since such messages are usually fairly short and normally rife with slangs, idioms, symbols and acronyms that make even tokenization a difficult task. In this scenario, this paper proposes and then evaluates a method to normalize and expand original short and messy text messages in order to acquire better attributes and enhance the classification performance. The proposed text processing approach is based on lexicographic and semantic dictionaries along with state-of-the-art techniques for semantic analysis and context detection. This technique is used to normalize terms and create new attributes in order to change and expand original text samples aiming to alleviate factors that can degrade the algorithms performance, such as redundancies and inconsistencies. We have evaluated our approach with a public, real and non-encoded data-set along with several established machine learning methods. Our experiments were diligently designed to ensure statistically sound results which indicate that the proposed text processing techniques can in fact enhance Instant Messaging and SMS spam filtering.
Semi-supervised learning using frequent itemset and ensemble learning for SMS classification
2015, Expert Systems with Applications
Citation Excerpt :
Najadat, Abdulla, Abooraig, and Nawasrah (2014) have used a new classifier by mixing different classifiers with no altering of their original algorithm to have better performance. In order to have better accuracy, Sohn, Lee, Han, and Rim (2012) have proposed a new approach by combining the lexical features and new stylish features. Lee, Yeom, Choi, and Kang (2011) proposed distributed spam filter model to use less resources of mobile phones with the help of Naive Bayes and Support Vector Machine.
Short Message Service (SMS) has become one of the most important media of communications due to the rapid increase of mobile users and it’s easy to use operating mechanism. This flood of SMS goes with the problem of spam SMS that are generated by spurious users. The detection of spam SMS has gotten more attention of researchers in recent times and is treated with a number of different machine learning approaches. Supervised machine learning approaches, used so far, demands a large amount of labeled data which is not always available in real applications. The traditional semi-supervised methods can alleviate this problem but may not produce good results if they are provided with only positive and unlabeled data. In this paper, we have proposed a novel semi-supervised learning method which makes use of frequent itemset and ensemble learning $(FIEL)$ to overcome this limitation. In this approach, Apriori algorithm has been used for finding the frequent itemset while Multinomial Naive Bayes, Random Forest and LibSVM are used as base learners for ensemble learning which uses majority voting scheme. Our proposed approach works well with small number of positive data and different amounts of unlabeled dataset with higher accuracy. Extensive experiments have been conducted over UCI SMS spam collection data set, SMS spam collection Corpus v.0.1 Small and Big which show significant improvements in accuracy with very small amount of positive data. We have compared our proposed FIEL approach with the existing SPY-EM and PEBL approaches and the results show that our approach is more stable than the compared approaches with minimum support.
Dendritic cell algorithm for mobile phone spam filtering
2015, Procedia Computer Science
With the revolution of mobile devices and their applications, significant improvements have been witnessed over years to support new features in addition to normal phone communication including web browsing, social networking and entertainment, mobile payment, medical and personal records, e-learning, and rich connectivity to multiple networks. As mobile devices continue to evolve, the volume of hacking activities targeting them also increases drastically. Receiving short message spam is one of the common vectors for security breaches. Besides wasting resources and being annoying to end-users, it can be used for phishing attacks and as a vehicle for other malware types such as worms, backdoors, and key loggers. The next generation of mobile technologies has more emphasis on security-related issues to protect confidentiality, integrity and availability. This paper explores a number of content-based feature sets to enhance the mobile phone text messaging services in filtering unwanted messages (a.k.a. spam). Moreover, it develops a more effective spam filtering model using a combination of most relevant features and by fusing decisions of two machine learning algorithms with the Dendritic Cell Algorithm (DCA). The performance has been evaluated empirically on two SMS spam datasets. The results showed that significant improvements can be achieved in the overall accuracy, recall and precision of spam and legitimate messages due to the application of the proposed DCA-based model.
A language processing-free unified spam detection framework using byte histograms and deep learning
2022, Proceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022

View all citing articles on Scopus

¹: Present address: 5th Fl., NHN Green Factory, 178-1, Jeongja-dong, Bundang-gu, Seongnam-si, Gyeonggi-do 463-867, South Korea.

View full text

Content-based mobile spam classification using stylistically motivated features

Abstract

Highlights

Introduction

Section snippets

Mobile spam classifier

Stylistic features for mobile spam classification

Data

Overall performance

Conclusion

Acknowledgements

A maximum entropy approach to natural language processing

Computational Linguistics

Online supervised spam filter evaluation

ACM Trans. Inform. Syst.

Making large-scale support vector machine learning practical

Advances in kernel methods: support vector learning