On the properties of spam-advertised URL addresses
Introduction
Spam e-mail refers to unsolicited e-mail messages that are sent with automated methods to millions of recipients (The Spamhaus Project, 2006). Spam messages are annoying, offensive, fraudulent and incur significant cost to their recipients, in terms of wasted processing time, bandwidth, storage space and loss of productivity (Goodman et al., 2005). Typically, e-mail spammers seek financial profit through the promotion of products and services. Sending large amounts of e-mail is neither difficult nor expensive. However, spammers need to do more than that: as e-mail users try to protect themselves from spam by installing filters that seek to identify spam e-mails either by their source or by their content, spammers need to invent new ways to hide their identity and avoid filter detection.
Currently, the majority of spam messages are encoded in the hypertext markup language (HTML). The use of HTML can improve the presentation of e-mail content on HTML-aware e-mail clients, making it more appealing to its recipients thanks to the use of different fonts, colors, pictures, etc. HTML encoding helps also in the evasion from spam filters in a number of exploits described as HTML-based obfuscation: using HTML, spammers can render their e-mail content undetectable by inserting in it invisible text with zero font size, by splitting up the e-mail content inside HTML tables, and so on. Hence, spammers manage to alter the lexical patterns that are detectable by text-based spam filters while maintaining the information they wish to present to e-mail recipients intact. Last, but not least, spammers adopt HTML encoding to entice e-mail recipients into visiting web sites that are advertised through spam. To this end, HTML-encoded spam e-mails carry URL addresses hidden behind “call-to-action” text or image anchors. E-mail recipients are lured into clicking upon these URLs when reading their e-mail on web-enabled e-mail clients, in order to reach spam-advertised web-sites and services.
Arguably, e-mail spam serves nowadays as the cheapest and easiest mechanism for disseminating spam-advertised URLs and their respective web sites to millions of Internet users. Consequently, the advertisement of these web sites can be considered as the root cause behind the problem of e-mail spam. Nevertheless, little attention has been given to the properties and the characteristics of URLs contained inside spam messages, although such an investigation could lead to a better understanding of the spam problem. For example, it would be interesting to estimate the lifetime of spam-advertised sites and the recurrence of URL advertisements; also, to identify any distinct characteristics of URLs circulated through spam e-mails: are they short and mnemonic or long and cryptic? Do they refer to static or dynamic content? Do they point directly to the advertised content or hide behind redirection? Furthermore, it would be interesting to investigate the statistics that spam messages have vis-à-vis the URLs contained therein: for instance, the average number of URLs found inside spam messages, the co-existence of trustworthy or random URLs inside the spam messages, and so on.
This information could help us understand the mechanisms that spammers use to advertise web sites and the tricks they employ to avoid spam-filter detection or prosecution by legal authorities. Moreover, it could expose distinctive features of spam messages. Such an understanding could be proven useful for policy makers who seek effective strategies to regulate e-mail spam (Moustakas et al., 2005), to spam-filter developers who are looking into extending the coverage of spam filters (Albrecht et al., 2005), and to researchers who look for improved ways to fight spam (Androutsopoulos et al., 2005, Li and Hsieh, 2006, Nelson et al., 2006).
In this paper, we present a characterization study that focuses on the characteristics of HTML-encoded spam e-mails and the URLs disseminated through such messages. To this end, we analyze four (4) sets of spam messages. To the best of our knowledge, this is the only study so far that focuses on the properties of URL addresses advertised through spam. The remaining of this paper is organized as follows: Section 2 presents a brief overview of the problem of e-mail spam and discusses related work. In Section 3, we present a system that we built to analyze HTML-encoded spam e-mails and extract properties of the URLs carried inside those messages; we also describe the spam archives used in our study. In Section 4, we present our statistical analysis. A summary of our findings is given in Section 5. Finally, Section 6, presents our main conclusions and suggestions for future work.
Section snippets
E-mail spamming
Despite strong efforts to regulate and eventually eliminate spam, during the last few years the volume of spam messages has been increasing continuously. In June 2003, BrightMail reported that 49% of all e-mail was spam. In May 2004, this figure had increased to 64%, whereas, according to Postini, 75–80% of all e-mail was spam in 2004. Recent estimates suggest that currently 81% of e-mail traffic is spam. Due to its volume, spam not only is annoying to individual e-mail users, but also incurs a
Spam data sets
To derive the characteristics of spam messages and of spam-advertised URLs, we need to have access to a large corpus of spam e-mails. To this end, we collected e-mails from two main sources: (i) personal spam folders, contributed by users of the University of Cyprus e-mail relay server. This server is protected by the Spam Assassin filter (The Apache SpamAssassin Project), which filters all incoming messages, computes a “SPAM score,” and labels e-mail messages as spam when their score is higher
Data set statistics
We used SPAT to isolate the spam messages that carry URLs. We found out that over 73% of e-mails contained in the four data sets carry URLs; also, that these e-mails contain a total of 1,048,040 URLs, 340,308 of which are distinct and belong to 166,324 distinct domains. The details of these remarks for each separate data set are given in Table 2. In the following sections, we focus on the subset of spam e-mails that carry URLs in their body.
In Table 3, we present statistics about the
Summary of findings
In this work we examined four logs of spam e-mails, focusing on the characteristics of the URL addresses found inside most of these messages. From this analysis, we observe that:
The large majority of spam messages are encoded in HTML and/or carry URL addresses in their bodies. In the data sets examined, 73–90% of spam messages carry at least one URL. It seems that the dissemination of URL addresses to e-mail recipients is the main driver behind e-mail spam. This trend is expected to continue,
Conclusions
The distribution of URL addresses to e-mail recipients is becoming the root cause behind the existence and the expansion of spam e-mail. URL addresses are found in all kinds of spam, from promotional to fraudulent (phishing and Web spamming). Therefore, spamming is essentially an incessant campaign that advertises URL addresses at a massive scale and at minimum cost for the advertisers (spammers) and those advertised. There are, however, notable differences between spamming and conventional
References (43)
- et al.
Tightening the neta review of current and next generation spam filtering tools
Computers & Security
(2006) - et al.
An anti-spam scheme using pre-challenges
Computer Communications
(2006) - Abadi M, Birrell A, Burrows M, Dabek F, Wobber T. Bankable postage for network services. In: Advances in computing...
- Albrecht K, Burri N, Wattenhofer R, Spamato. An extendable spam filter system. In: Proceedings of the second conference...
- Androutsopoulos I, Magirou EF, Vassilakis DK. A game theoretic model of spam e-mailing. In: Proceedings of the second...
The CAN-SPAM actan inadequate attempt to deal with a growing problem
Columbia Journal of Law and Social Problems
(2006)- Available online at:...
- Berners-Lee T, Fielding RT, Masinter L. Uniform resource identifier (URI): generic syntax. Internet draft, September...
- Brightmail inc....
- de Freitas S, Levene M. Spam on the Internet: is it here to stay or can it be eradicated? Technical Report, Joint...
Cited by (15)
Extended DMTP: A new protocol for improved graylist categorization
2014, Computers and SecurityCitation Excerpt :Therefore, the research of anti-spam has become one of the most important areas, which affects the development of internet. A large number of anti-spam measures have been proposed, including numerous e-mail spam filters (Guillermo, 2006; Thiago and Walmir, 2009; Hu et al., 2010; Eleni et al., 2008; James and Ray, 2006; Isaac et al., 2009; Almeidal and Yamakami, 2010), sender authentication schemes (Sajad and Ali, 2007; Peng et al., 2009) and sender-discouragement mechanisms (http://www.hashcash.org/). Some of these measurements have been applied to current e-mail delivery system.
Social network analysis of web links to eliminate false positives in collaborative anti-spam systems
2011, Journal of Network and Computer ApplicationsCitation Excerpt :The proportion of email that is spam has significantly increased in recent years (Boykin and Roychowdhury, 2005; Goodman et al., 2007). Anti-spam systems filter incoming email either at the level of the email server or at the level of the email client program (Georgiou et al., 2008). These filtering techniques are generally judged according to the dual metrics of false positives and false negatives.
Using GMDH-based networks for improved spam detection and email feature analysis
2011, Applied Soft Computing JournalA scalable intelligent non-content-based spam-filtering framework
2010, Expert Systems with ApplicationsCitation Excerpt :Although some of the studies on content-based methods have been fruitful, judging spam by analyzing text may cause some challenging problems. For instance, Georgiou, Dikaiakos, and Stassopoulou (2008) suggested that the widely used HTML technology creates troubles for content-based spam filters. To try to avoid these problems, components other than the contents of email messages have been carefully studied, and non-content-based approaches have been proposed.
Practical web spam lifelong machine learning system with automatic adjustment to current lifecycle phase
2019, Security and Communication Networks