On the properties of spam-advertised URL addresses

https://doi.org/10.1016/j.jnca.2007.01.003Get rights and content

Abstract

The main purpose of most spam e-mail messages distributed on Internet today is to entice recipients into visiting World Wide Web pages that are advertised through spam. In essence, e-mail spamming is a campaign that advertises URL addresses at a massive scale and at minimum cost for the advertisers and those advertised. Nevertheless, the characteristics of URL addresses and of web sites advertised through spam have not been studied extensively. In this paper, we investigate the properties of URL-dissemination through spam e-mail, and the characteristics of URL addresses disseminated through spam. We conclude that spammers advertise URL addresses non-repetitively and that spam-advertised URLs are short-lived, elusive, and therefore hard to detect and filter. We also observe that reputable URL addresses are sometimes used as decoys against e-mail users and spam filters. These observations can be valuable for the configuration of spam filters and in order to drive the development of new techniques to fight spam.

Introduction

Spam e-mail refers to unsolicited e-mail messages that are sent with automated methods to millions of recipients (The Spamhaus Project, 2006). Spam messages are annoying, offensive, fraudulent and incur significant cost to their recipients, in terms of wasted processing time, bandwidth, storage space and loss of productivity (Goodman et al., 2005). Typically, e-mail spammers seek financial profit through the promotion of products and services. Sending large amounts of e-mail is neither difficult nor expensive. However, spammers need to do more than that: as e-mail users try to protect themselves from spam by installing filters that seek to identify spam e-mails either by their source or by their content, spammers need to invent new ways to hide their identity and avoid filter detection.

Currently, the majority of spam messages are encoded in the hypertext markup language (HTML). The use of HTML can improve the presentation of e-mail content on HTML-aware e-mail clients, making it more appealing to its recipients thanks to the use of different fonts, colors, pictures, etc. HTML encoding helps also in the evasion from spam filters in a number of exploits described as HTML-based obfuscation: using HTML, spammers can render their e-mail content undetectable by inserting in it invisible text with zero font size, by splitting up the e-mail content inside HTML tables, and so on. Hence, spammers manage to alter the lexical patterns that are detectable by text-based spam filters while maintaining the information they wish to present to e-mail recipients intact. Last, but not least, spammers adopt HTML encoding to entice e-mail recipients into visiting web sites that are advertised through spam. To this end, HTML-encoded spam e-mails carry URL addresses hidden behind “call-to-action” text or image anchors. E-mail recipients are lured into clicking upon these URLs when reading their e-mail on web-enabled e-mail clients, in order to reach spam-advertised web-sites and services.

Arguably, e-mail spam serves nowadays as the cheapest and easiest mechanism for disseminating spam-advertised URLs and their respective web sites to millions of Internet users. Consequently, the advertisement of these web sites can be considered as the root cause behind the problem of e-mail spam. Nevertheless, little attention has been given to the properties and the characteristics of URLs contained inside spam messages, although such an investigation could lead to a better understanding of the spam problem. For example, it would be interesting to estimate the lifetime of spam-advertised sites and the recurrence of URL advertisements; also, to identify any distinct characteristics of URLs circulated through spam e-mails: are they short and mnemonic or long and cryptic? Do they refer to static or dynamic content? Do they point directly to the advertised content or hide behind redirection? Furthermore, it would be interesting to investigate the statistics that spam messages have vis-à-vis the URLs contained therein: for instance, the average number of URLs found inside spam messages, the co-existence of trustworthy or random URLs inside the spam messages, and so on.

This information could help us understand the mechanisms that spammers use to advertise web sites and the tricks they employ to avoid spam-filter detection or prosecution by legal authorities. Moreover, it could expose distinctive features of spam messages. Such an understanding could be proven useful for policy makers who seek effective strategies to regulate e-mail spam (Moustakas et al., 2005), to spam-filter developers who are looking into extending the coverage of spam filters (Albrecht et al., 2005), and to researchers who look for improved ways to fight spam (Androutsopoulos et al., 2005, Li and Hsieh, 2006, Nelson et al., 2006).

In this paper, we present a characterization study that focuses on the characteristics of HTML-encoded spam e-mails and the URLs disseminated through such messages. To this end, we analyze four (4) sets of spam messages. To the best of our knowledge, this is the only study so far that focuses on the properties of URL addresses advertised through spam. The remaining of this paper is organized as follows: Section 2 presents a brief overview of the problem of e-mail spam and discusses related work. In Section 3, we present a system that we built to analyze HTML-encoded spam e-mails and extract properties of the URLs carried inside those messages; we also describe the spam archives used in our study. In Section 4, we present our statistical analysis. A summary of our findings is given in Section 5. Finally, Section 6, presents our main conclusions and suggestions for future work.

Section snippets

E-mail spamming

Despite strong efforts to regulate and eventually eliminate spam, during the last few years the volume of spam messages has been increasing continuously. In June 2003, BrightMail reported that 49% of all e-mail was spam. In May 2004, this figure had increased to 64%, whereas, according to Postini, 75–80% of all e-mail was spam in 2004. Recent estimates suggest that currently 81% of e-mail traffic is spam. Due to its volume, spam not only is annoying to individual e-mail users, but also incurs a

Spam data sets

To derive the characteristics of spam messages and of spam-advertised URLs, we need to have access to a large corpus of spam e-mails. To this end, we collected e-mails from two main sources: (i) personal spam folders, contributed by users of the University of Cyprus e-mail relay server. This server is protected by the Spam Assassin filter (The Apache SpamAssassin Project), which filters all incoming messages, computes a “SPAM score,” and labels e-mail messages as spam when their score is higher

Data set statistics

We used SPAT to isolate the spam messages that carry URLs. We found out that over 73% of e-mails contained in the four data sets carry URLs; also, that these e-mails contain a total of 1,048,040 URLs, 340,308 of which are distinct and belong to 166,324 distinct domains. The details of these remarks for each separate data set are given in Table 2. In the following sections, we focus on the subset of spam e-mails that carry URLs in their body.

In Table 3, we present statistics about the

Summary of findings

In this work we examined four logs of spam e-mails, focusing on the characteristics of the URL addresses found inside most of these messages. From this analysis, we observe that:

  • The large majority of spam messages are encoded in HTML and/or carry URL addresses in their bodies. In the data sets examined, 73–90% of spam messages carry at least one URL. It seems that the dissemination of URL addresses to e-mail recipients is the main driver behind e-mail spam. This trend is expected to continue,

Conclusions

The distribution of URL addresses to e-mail recipients is becoming the root cause behind the existence and the expansion of spam e-mail. URL addresses are found in all kinds of spam, from promotional to fraudulent (phishing and Web spamming). Therefore, spamming is essentially an incessant campaign that advertises URL addresses at a massive scale and at minimum cost for the advertisers (spammers) and those advertised. There are, however, notable differences between spamming and conventional

References (43)

  • J. Carpinter et al.

    Tightening the neta review of current and next generation spam filtering tools

    Computers & Security

    (2006)
  • R. Roman et al.

    An anti-spam scheme using pre-challenges

    Computer Communications

    (2006)
  • Abadi M, Birrell A, Burrows M, Dabek F, Wobber T. Bankable postage for network services. In: Advances in computing...
  • Albrecht K, Burri N, Wattenhofer R, Spamato. An extendable spam filter system. In: Proceedings of the second conference...
  • Androutsopoulos I, Magirou EF, Vassilakis DK. A game theoretic model of spam e-mailing. In: Proceedings of the second...
  • V. Arora

    The CAN-SPAM actan inadequate attempt to deal with a growing problem

    Columbia Journal of Law and Social Problems

    (2006)
  • Available online at:...
  • Berners-Lee T, Fielding RT, Masinter L. Uniform resource identifier (URI): generic syntax. Internet draft, September...
  • Brightmail inc....
  • de Freitas S, Levene M. Spam on the Internet: is it here to stay or can it be eradicated? Technical Report, Joint...
  • Drake CE, Oliver JJ, Koontz EJ. Anatomy of a phishing email. In: Proceedings of the first conference on email and...
  • Gomes LH, Cazita C, Almeida JM, Almeida V, Meira Jr W. Characterizing a spam traffic. In: Proceedings of the fourth ACM...
  • Gomes LH, Almeida RB, Bettencourt LMA, Almeida V, Almeida JM. Comparative graph theoretical characterization of...
  • Goodman J, Heckerman D, Rounthwaite R. Stopping spam. Scientific American, April...
  • Graham-Cumming J. The Spammers’ compendium. Spam conference; 2003. Available online at:...
  • Hird S. Technical solutions for controlling spam. In: Proceedings of the annual technical conference of the Australian...
  • Hulten G, Penta A, Seshadrinathan G, Mishra M. Trends in spam products and methods. In: Proceedings of the first...
  • Ioannidis J. Fighting spam by encapsulating policy in email addresses. In: Proceedings of the network and distributed...
  • Iwanaga M, Tabata T, Sakurai K. Evaluation of anti-spam method combining Bayesian filtering and strong challenge and...
  • Klensin J. Simple mail transfer protocol; 2001. IETF, RFC 2821....
  • Koutsioupis C. Anti-spam at the University of Cyprus. Available online at:...
  • Cited by (15)

    • Extended DMTP: A new protocol for improved graylist categorization

      2014, Computers and Security
      Citation Excerpt :

      Therefore, the research of anti-spam has become one of the most important areas, which affects the development of internet. A large number of anti-spam measures have been proposed, including numerous e-mail spam filters (Guillermo, 2006; Thiago and Walmir, 2009; Hu et al., 2010; Eleni et al., 2008; James and Ray, 2006; Isaac et al., 2009; Almeidal and Yamakami, 2010), sender authentication schemes (Sajad and Ali, 2007; Peng et al., 2009) and sender-discouragement mechanisms (http://www.hashcash.org/). Some of these measurements have been applied to current e-mail delivery system.

    • Social network analysis of web links to eliminate false positives in collaborative anti-spam systems

      2011, Journal of Network and Computer Applications
      Citation Excerpt :

      The proportion of email that is spam has significantly increased in recent years (Boykin and Roychowdhury, 2005; Goodman et al., 2007). Anti-spam systems filter incoming email either at the level of the email server or at the level of the email client program (Georgiou et al., 2008). These filtering techniques are generally judged according to the dual metrics of false positives and false negatives.

    • A scalable intelligent non-content-based spam-filtering framework

      2010, Expert Systems with Applications
      Citation Excerpt :

      Although some of the studies on content-based methods have been fruitful, judging spam by analyzing text may cause some challenging problems. For instance, Georgiou, Dikaiakos, and Stassopoulou (2008) suggested that the widely used HTML technology creates troubles for content-based spam filters. To try to avoid these problems, components other than the contents of email messages have been carefully studied, and non-content-based approaches have been proposed.

    View all citing articles on Scopus
    View full text