Malicious web content detection by machine learning

https://doi.org/10.1016/j.eswa.2009.05.023Get rights and content

Abstract

The recent development of the dynamic HTML gives attackers a new and powerful technique to compromise computer systems. A malicious dynamic HTML code is usually embedded in a normal webpage. The malicious webpage infects the victim when a user browses it. Furthermore, such DHTML code can disguise itself easily through obfuscation or transformation, which makes the detection even harder. Anti-virus software packages commonly use signature-based approaches which might not be able to efficiently identify camouflaged malicious HTML codes. Therefore, our paper proposes a malicious web page detection using the technique of machine learning. Our study analyzes the characteristic of a malicious webpage systematically and presents important features for machine learning. Experimental results demonstrate that our method is resilient to code obfuscations and can correctly determine whether a webpage is malicious or not.

Introduction

As the Internet services increasingly prevail, more and more applications are put into web sites that can be directly accessed via web browsers over the network. To achieve the interaction between the user and the service, the content provided by a web application usually embeds itself with executable codes. Such codes are written in browser-supported script languages such as JavaScript or VBScript. Those scripting techniques make the service easier to use but they also bring a new tool for attackers. In the overall network operations, the user’s personal computer seems the weakest security point because the personal system is usually neither patched nor updated. By the security vulnerabilities of web browsers, an attacker needs only to inject a malicious code into a webpage and a victim who visits the page will be compromised. Although malicious DHTML codes have been concerned for years, the problem is not solved and the threat becomes larger. The rising number of security events and huge economic loss warn us that malicious DHTML codes have become a great challenge not only for individuals but also for enterprise and government in the Internet era (McGraw & Morrisett, 2000).

Common users still depend on the use of anti-virus software to detect DHTML-code attacks till today. Most anti-virus packages focus on the detection of binary executables and the detection mechanism is based on pattern or signature matching (Christodorescu & Jha, 2004). Therefore, their efficacy relies mostly on the updating frequency of the signature database. However, transforming a DHTML code is much easier than transforming a binary executable program. The updating frequency of signatures is always slower than the transformation frequency of malicious DHTML codes. This makes signature-based techniques ineffective to detect variants of malicious codes. According to the experiment in (Christodorescu & Jha, 2004) which measured the tolerance of anti-virus software against commonly used obfuscation, the average false negative rate of anti-virus packages ranges from 40% to 80% and there even exists a product with 0% detective rate. Therefore, we need a new detection method other than the signature-based method to detect malicious DHTML codes.

There are many related researches of malwares (Bergeron et al., 2001, Christodorescu and Jha, 2004, Christodorescu et al., 2005, Kinder et al., 2005, Kolter and Maloof, 2004) in recent years and most of them focused on binary executable programs. Although the techniques used in those studies can be applied to the detection of DHTML codes, they need some essential modification according to the characteristic of DHTML codes. The features and obfuscation of an executable program is very different from that of a DHTML code. The distinct features of a DHTML code are: (1) the codes are in the form of pure text, (2) web pages may have multiple layers of links to remote pages and (3) the obfuscation is easy by using garbage insertion, code reordering, data encapsulation, and other skills. In general, to write a variant of a DHTML code is much easier and faster than to write a binary executable program. For that reason, we need a different mechanism for the detection of pure-text malicious DHTML codes.

Our paper proposes a malicious DHTML detection method based on machine learning. We explore and analyze malicious dynamic web pages to identify important features for machine learning. By using those chosen features, the classifier can efficiently judge a webpage and still has the resilience against obfuscation of malicious web pages.

Section snippets

Related work

Malicious web content poses a significant threat to personal computer systems and becomes an important rising issue. Moshchuk, Bragin, Gribble, and Levy (2006) examined the spyware problem from the Internet perspective. They used a crawler to perform a large-scale longitudinal study of the Web. Their results show that the density of spyware on the Web is substantial: on average, 1 out of 62 domains contained at least one scripted drive-by-download attack. Provos, McNamee, Mavrommatis, Wang, and

Overview

We use machine learning approach to detect malicious web pages. A classifier is used to distinguish malicious pages from benign ones. We collected web pages from the Internet to be the training data for the classifier. The data are processed through a feature extraction engine to get the features for the classifier. The framework is shown in Fig. 1. We will describe each part in details in the following sections.

Dataset

We collected benign and malicious DHTML web pages from web sites based on the malicious URL list from the site StopBadWare. Totally, we had 965 benign samples and 176 malicious samples. To test the ability to detect variants of attacks, we manually categorize the malicious samples into nine different pre-defined types (as listed in the Appendix A) according to the skills used by the hackers. Each category of samples contains malicious HTML codes and their variants.

Comparison of features

We describe the experiment

Conclusions

In this paper, we propose a malicious webpage detection method based on machine learning. We analyze the characteristic of malicious web pages and present relevant features for machine learning. The chosen features can not only present effectively a malicious webpage but also keep the resilience against the obfuscation of DHTML malicious codes. We compared four classification algorithms and the method of Boosted Decision Tree has the best performance among them. Experimental results demonstrate

References (14)

  • Bergeron, J., Debbabi, M., Desharnais, J., Erhioui, M. M., Lavoie, Y., & Tawbi N. (2001). Static detection of malicious...
  • Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of...
  • Christodorescu, M., & Jha, S. (2004). Testing malware detectors. In Proceedings of the ACM SIGSOFT international...
  • Christodorescu, M., Jha, S., Seshia, S. A., Song, D., & Bryant R. E. (2005). Semantics-aware malware detection. In...
  • Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the thirteenth...
  • Kinder, J., Katzenbeisser, S., Schallhart, C., & Veith, H. (2005). Detecting malicious code by model checking. In...
  • Kolter, J. Z., & Maloof, M. A. (2004). Learning to detect malicious executables in the wild. In Proceedings of the...
There are more references available in the full text version of this article.

Cited by (103)

View all citing articles on Scopus
View full text