Malicious web content detection by machine learning

doi:10.1016/j.eswa.2009.05.023

Expert Systems with Applications

Volume 37, Issue 1, January 2010, Pages 55-60

https://doi.org/10.1016/j.eswa.2009.05.023 Get rights and content

Abstract

The recent development of the dynamic HTML gives attackers a new and powerful technique to compromise computer systems. A malicious dynamic HTML code is usually embedded in a normal webpage. The malicious webpage infects the victim when a user browses it. Furthermore, such DHTML code can disguise itself easily through obfuscation or transformation, which makes the detection even harder. Anti-virus software packages commonly use signature-based approaches which might not be able to efficiently identify camouflaged malicious HTML codes. Therefore, our paper proposes a malicious web page detection using the technique of machine learning. Our study analyzes the characteristic of a malicious webpage systematically and presents important features for machine learning. Experimental results demonstrate that our method is resilient to code obfuscations and can correctly determine whether a webpage is malicious or not.

Introduction

As the Internet services increasingly prevail, more and more applications are put into web sites that can be directly accessed via web browsers over the network. To achieve the interaction between the user and the service, the content provided by a web application usually embeds itself with executable codes. Such codes are written in browser-supported script languages such as JavaScript or VBScript. Those scripting techniques make the service easier to use but they also bring a new tool for attackers. In the overall network operations, the user’s personal computer seems the weakest security point because the personal system is usually neither patched nor updated. By the security vulnerabilities of web browsers, an attacker needs only to inject a malicious code into a webpage and a victim who visits the page will be compromised. Although malicious DHTML codes have been concerned for years, the problem is not solved and the threat becomes larger. The rising number of security events and huge economic loss warn us that malicious DHTML codes have become a great challenge not only for individuals but also for enterprise and government in the Internet era (McGraw & Morrisett, 2000).

Common users still depend on the use of anti-virus software to detect DHTML-code attacks till today. Most anti-virus packages focus on the detection of binary executables and the detection mechanism is based on pattern or signature matching (Christodorescu & Jha, 2004). Therefore, their efficacy relies mostly on the updating frequency of the signature database. However, transforming a DHTML code is much easier than transforming a binary executable program. The updating frequency of signatures is always slower than the transformation frequency of malicious DHTML codes. This makes signature-based techniques ineffective to detect variants of malicious codes. According to the experiment in (Christodorescu & Jha, 2004) which measured the tolerance of anti-virus software against commonly used obfuscation, the average false negative rate of anti-virus packages ranges from 40% to 80% and there even exists a product with 0% detective rate. Therefore, we need a new detection method other than the signature-based method to detect malicious DHTML codes.

There are many related researches of malwares (Bergeron et al., 2001, Christodorescu and Jha, 2004, Christodorescu et al., 2005, Kinder et al., 2005, Kolter and Maloof, 2004) in recent years and most of them focused on binary executable programs. Although the techniques used in those studies can be applied to the detection of DHTML codes, they need some essential modification according to the characteristic of DHTML codes. The features and obfuscation of an executable program is very different from that of a DHTML code. The distinct features of a DHTML code are: (1) the codes are in the form of pure text, (2) web pages may have multiple layers of links to remote pages and (3) the obfuscation is easy by using garbage insertion, code reordering, data encapsulation, and other skills. In general, to write a variant of a DHTML code is much easier and faster than to write a binary executable program. For that reason, we need a different mechanism for the detection of pure-text malicious DHTML codes.

Our paper proposes a malicious DHTML detection method based on machine learning. We explore and analyze malicious dynamic web pages to identify important features for machine learning. By using those chosen features, the classifier can efficiently judge a webpage and still has the resilience against obfuscation of malicious web pages.

Section snippets

Related work

Malicious web content poses a significant threat to personal computer systems and becomes an important rising issue. Moshchuk, Bragin, Gribble, and Levy (2006) examined the spyware problem from the Internet perspective. They used a crawler to perform a large-scale longitudinal study of the Web. Their results show that the density of spyware on the Web is substantial: on average, 1 out of 62 domains contained at least one scripted drive-by-download attack. Provos, McNamee, Mavrommatis, Wang, and

Overview

We use machine learning approach to detect malicious web pages. A classifier is used to distinguish malicious pages from benign ones. We collected web pages from the Internet to be the training data for the classifier. The data are processed through a feature extraction engine to get the features for the classifier. The framework is shown in Fig. 1. We will describe each part in details in the following sections.

Dataset

We collected benign and malicious DHTML web pages from web sites based on the malicious URL list from the site StopBadWare. Totally, we had 965 benign samples and 176 malicious samples. To test the ability to detect variants of attacks, we manually categorize the malicious samples into nine different pre-defined types (as listed in the Appendix A) according to the skills used by the hackers. Each category of samples contains malicious HTML codes and their variants.

Comparison of features

We describe the experiment

Conclusions

In this paper, we propose a malicious webpage detection method based on machine learning. We analyze the characteristic of malicious web pages and present relevant features for machine learning. The chosen features can not only present effectively a malicious webpage but also keep the resilience against the obfuscation of DHTML malicious codes. We compared four classification algorithms and the method of Boosted Decision Tree has the best performance among them. Experimental results demonstrate

References (14)

Bergeron, J., Debbabi, M., Desharnais, J., Erhioui, M. M., Lavoie, Y., & Tawbi N. (2001). Static detection of malicious...
Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of...
Christodorescu, M., & Jha, S. (2004). Testing malware detectors. In Proceedings of the ACM SIGSOFT international...
Christodorescu, M., Jha, S., Seshia, S. A., Song, D., & Bryant R. E. (2005). Semantics-aware malware detection. In...
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the thirteenth...
Kinder, J., Katzenbeisser, S., Schallhart, C., & Veith, H. (2005). Detecting malicious code by model checking. In...
Kolter, J. Z., & Maloof, M. A. (2004). Learning to detect malicious executables in the wild. In Proceedings of the...

There are more references available in the full text version of this article.

Cited by (103)

The applicability of a hybrid framework for automated phishing detection
2024, Computers and Security
Phishing attacks are a critical and escalating cybersecurity threat in the modern digital landscape. As cybercriminals continually adapt their techniques, automated phishing detection systems have become essential for safeguarding Internet users. However, many current systems rely on single-analysis models, making them vulnerable to sophisticated bypass attempts by hackers. This research delves into the potential of hybrid approaches, which combine multiple models to enhance both the robustness and effectiveness of phishing detection. It highlights existing hybrid models' limitations that focus primarily on effectiveness while ignoring broader applicability. To address these gaps, we introduce a novel framework explicitly designed for applicability in the real world, which poses the foundation for practical and robust phishing detection architectures. We develop a proof of concept to evaluate its effectiveness, robustness, and detection speed. Additionally, we introduce an innovative methodology for simulating bypass attacks on single-analysis base models. Our experiments demonstrate that the proposed hybrid framework outperforms individual models, displaying higher effectiveness, robustness against bypassing attempts, and real-time detection capabilities. Our proof of concept achieves an accuracy of 97.44% thereby outperforming the current state-of-the-art approach while requiring less computational time. The results provide insights into the multifaceted factors of hybrid models, extending beyond mere effectiveness, and emphasize the importance of holistic applicability in hybrid approaches to address the critical need for robust defenses against phishing attacks.
A survey on machine learning techniques applied to source code
2024, Journal of Systems and Software
The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis, such as testing and vulnerability detection. Such a large number of studies hinders the community from understanding the current research landscape. This paper aims to summarize the current knowledge in applied machine learning for source code analysis. We review studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we conducted an extensive literature search and identified $494$ studies. We summarize our observations and findings with the help of the identified studies. Our findings suggest that the use of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task and summarize machine learning techniques employed. We identify a comprehensive list of available datasets and tools useable in this context. Finally, the paper discusses perceived challenges in this area, including the availability of standard datasets, reproducibility and replicability, and hardware resources.
Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.
ObfSec: Measuring the security of obfuscations from a testing perspective
2022, Expert Systems with Applications
Code obfuscation protects the intellectual property of software. However, systematically altering the control- and data-flow of a program can deteriorate the security of the resulting program. There are a wide-range of obfuscation methods available that alter the layout of the program in different ways. These modifications can introduce bugs in the program or modify the nature and the severity of an existing ones.
We propose a novel strategy, called ObfSec (Obfuscation Security), to understand the implications behind obfuscating software. ObfSec starts by detecting errors on software and exposes how the obfuscation can change the nature of those errors, looking in particular at transformations that turn software bugs into a exploitable vulnerable program. Our results, on a corpus of around 70,000 programs and obfuscations, show that obfuscation can deteriorate the security of a program.
A Comparative Evaluation of Ensemble Classifiers for Malicious Webpage Detection
2021, Procedia Computer Science
Malicious webpage is developed or manipulated to be used as attack tool where it is considered as one of the main reasons of Internet criminal activities. Thus, it is essential to detect such webpages and prevent end users form accessing it. The conventional malicious webpages detection techniques are based on searching through a blacklist that contains a list of webpages classified as malicious from the perspective of users. However, these techniques have high false-negative rates especially with aforesaid sophisticated attacks due to technical and computational limitations. Hence, machine learning techniques have been employed to classify webpages by systemically analyzing set of features that reflect the characteristics of a malicious webpage. This paper compares the prediction accuracy of several machine learning classification algorithms and ensemble techniques. A data set of 5000 instances of URLs, with 189 different features are used in the comparative study. The results show that the most accurate classification technique in MultiBoost and Adaboost is Support Vector Machine (SVM), while K-Nearest Neighbor (k-NN) technique in bagging and random subspace.
A Honeybee-Inspired Framework for a Smart City Free of Internet Scams
2023, Sensors
DISET: a distance based semi-supervised self-training for automated users’ agent activity detection from web access log
2023, Multimedia Tools and Applications

View all citing articles on Scopus

View full text