WebMon: ML- and YARA-based malicious webpage detection

doi:10.1016/j.comnet.2018.03.006

Computer Networks

Volume 137, 4 June 2018, Pages 119-131

https://doi.org/10.1016/j.comnet.2018.03.006 Get rights and content

Abstract

Attackers use the openness of the Internet to facilitate the dissemination of malware. Their attempts to infect target systems via the Web have increased with time and are unlikely to abate. In response to this threat, we present an automated, low-interaction malicious webpage detector, WebMon, that identifies invasive roots in Web resources loaded from WebKit2-based browsers using machine learning and YARA signatures. WebMon effectively detects hidden exploit codes by tracing linked URLs to confirm whether the relevant websites are malicious. WebMon detects a variety of attacks by running 250 containers simultaneously. In this configuration, the proposed model yields a detection rate of 98%, and is 7.6 times faster (with a container) than previously proposed models. Most importantly, WebMon’s focus on extracting malicious paths in a domain is a novel approach that has not been explored in previous studies.

Introduction

Since the initial development of Web browsers, there have been a growing number of attempts to infect online systems by transmitting malware through browsers. In this Web architecture, one malicious webpage can contaminate several thousand user PCs in minutes. Therefore, the concealment of malware on webpages is one of the most dangerous types of cyberattack, and poses a significant threat to the integrity of critical systems. More recent types of malware, such as ransomware, in conjunction with exploit toolkits, have evolved to become more complex, automated, and impossible to decrypt. Thus, detecting websites that propagate malware and developing techniques to neutralize them is crucial.

Malicious URLs that activate drive-by downloads are a popular form of exploitation and malware delivery. Hence, fast detection of malicious URLs is useful, because the URLs can then be distributed to blacklists maintained by various security systems. (At present, these databases might contain a great deal of outdated information.) Thus, rapidly finding malicious URLs from countless webpages, which are live only for very limited amounts of time, is a duty of security research.

Previous studies [1], [2], [3] have exposed limitations in the extraction of malicious URL paths within webpages–even if diverse malicious traces exist. Previous systems have shown the difficulty of detecting exploit kits (EKs) [1], and suffered from low performance [2] and unstable architectures [3]. High performance is especially important for detecting many malicious webpages because we cannot limitlessly increase our server capacity. The prevalent dynamic systems impose a significant processing burden, whereas static systems have low detection rates in browser-plugin-based attacks. In this circumstance, our approach provides effective architectural design that overcomes previous limitations.

In this paper, we propose a WebKit2- [4], machine learning (ML)-, and YARA-based [5] low-interaction webpage analyzer. Combining these components improves the functionality of state-of-the-art systems and classifies malicious redirects. The main contributions of this study are as follows:

•
We propose a practical model using a WebKit2-, ML- and YARA-based framework for large-scale malicious webpage detection. This framework utilizes Docker containers and a multiserver architecture to provide a secure, scalable, and stable structure.
•
We introduce WebMon, which is 7.6 times faster than conventional malicious webpage detection tools and has a detection rate of 98%; it can determine the maliciousness of 11.13 domains per second in single-server mode.
•
We present a call tree algorithm that constructs a malicious URL redirection tree that enables us to clearly determine malicious paths.

The remainder of this paper is organized as follows: Section 2 presents related work and describes our research. Section 3 describes a variety of features that can be used for malware detection. Section 4 provides an overview of the proposed model. In Section 5, we explain the technical details of our implementation; then, we discuss the dataset used, the experimental setup, and our experimental results in Section 6. The limitations of our model and our conclusions are outlined in Section 7.

Section snippets

Related work

Malicious webpage detection approaches can be categorized as feature-, pattern-, and behavior-based.

Detection methods

In this section, we explain features for determining whether webpages are malicious. In particular, we focus on the features of EKs, because most malicious content is central to EK servers that exploit user hosts. We also consider two metrics in feature selection: performance and detection rate. In this work, we introduce 11 feature classes for identifying EKs. We utilize a limited number of features to increase system performance.

Architecture

In this section, we describe WebMon’s framework. WebMon consists of a queue server, Docker with multiple containers, and a database, as shown in in Fig. 2. This system provides crawling-based malicious webpage detection.

Implementation

WebKit2 is a split-process model divided into user interface (UI) and Web processes between the API boundary. These processes communicate via the message method. In this section, we address the implementation of the call tree and YARA- and ML-based malware detection. The call tree operates in the UI process, and YARA matching and classification operate in the web process.

Evaluation

In this section, we choose an ML classifier optimized in our feature selection, and compare state-of-the-art systems with WebMon.

Discussion and conclusions

Though the proposed model has resolved some of the issues in previously proposed tools, we also face several problems. Our primary concern is that Flash Player or Java applet attacks are directly loaded without exploit traces using document.documentElement.nodeName, <object>, <embed>, or <div> elements, which creates (seemingly benign) attack URLs or runs obfuscated exploit scripts hidden in ActionScript. Then, evidence to characterize the malicious symptoms of webpages is insufficient (except

Sungjin Kim is currently a PhD student at KAIST. He received the BS and MS in computer science from Ohio State University and Sogang University respectively. His current research interests include various security issues relevant to malware, SNA, deep web, deep learning, big data, risk analysis, and network-based security systems.

References (29)

Y.T. Hou
Malicious web content detection by machine learning
Expert. Syst. Appl.
(2010)
M. Cova et al.
detection and analysis of drive-by-download attacks and malicious javascript code
Proceedings of the 19th International Conference on World Wide Web, ACM
(2010)
The Cuckoo Sandbox,...
Thug,...
WebKit2,...
YARA,...
D. Canali et al.
prophiler: A fast filter for a large number of detections of malicious webpage categories and subject descriptors
Proceedings of the International World Wide Web Conference (WWW)
(2011)
J. Ma
beyond blacklists: learning to detect malicious web sites from suspicious URLs
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM
(2009)
J. Ma et al.
identifying suspicious URLs: an application of large-scale online learning
Proceedings of the 26th Annual International Conference on Machine Learning, ACM
(2009)
M.A. Rajab
CAMP: Content-agnostic malware protection
NDSS
(2013)

C. Curtsinger

ZOZZLE: Fast and precise in-browser javascript malware detection

USENIX Security Symposium

(2011)

Anubis,...

S. Kim

Logos: internet-explorer-based malicious webpage detection

ETRI J.

(2017)

Docker,...

Cited by (22)

Industrial Control Systems: Cyberattack trends and countermeasures
2020, Computer Communications
It is generally understood that an attacker with limited resources would not be able to carry out targeted attacks on Industrial Control Systems. Breaking this general notion, we present case studies of major attacks on Industrial Control Systems (ICSs) in the last 20 years. The attacks chosen are the most prominent ones in terms of the economic loss inflicted, the potential to damage physical equipment and to cause human casualties. For each of these attacks, we describe the attack methodology used and suggest possible solutions to prevent such attacks. We analyze each case study to provide a better insight into the development of future cybersecurity techniques for ICSs. Finally, we suggest some recommendations on the best practices for protecting ICSs.
A multi-dimensional machine learning approach to predict advanced malware
2019, Computer Networks
Citation Excerpt :
Moreover, there are classification or detection solutions to cope with such malware that are based on machine learning [21] and data mining [22]. However, each of them considers only specific features or environments, such as self-organizing feature [23] and malicious URL [24]. The vast number of different computing platforms and their interconnections make malware detections with machine learning more challenging than ever.
The growth of cyber-attacks that are carried out with malware have become more sophisticated on almost all networks. Furthermore, attacks with advanced malware have the greatest complexity which makes them very hard to detect. Advanced malware is able to obfuscate much of their traces through many mechanisms, such as metamorphic engines. Therefore, predictions and detections of such malware have become significant challenge for malware analyses mechanisms. In this paper, we propose a multi-dimensional machine learning approach to predict Stuxnet like malware from a dataset that consists of malware samples by using five distinguishing features of advanced malware. We define the features by analyzing advanced malware samples in the wild. Our approach uses regression models to predict advanced malware. We create a malware dataset from existing datasets that contain real samples for experimental purposes. Analyses results show that there are high correlations among some features of advanced malware. These provide better predictions scores, such as $R^{2} = 0.8203$ score for Stuxnet closeness feature. Experimental analyses show that our approach is able to predict Stuxnet like advanced malware if prediction features defined.
Multi-Modal Features Representation-Based Convolutional Neural Network Model for Malicious Website Detection
2024, IEEE Access
Malware Analysis Based on Malicious Web URLs
2024, Lecture Notes in Networks and Systems
Integrating a Rule-Based Approach to Malware Detection with an LSTM-Based Feature Selection Technique
2023, SN Computer Science
Enhancing Malicious Url Detection: A Novel Framework Leveraging Priority Coefficient and Feature Evaluation
2023, SSRN

View all citing articles on Scopus

Jinkook Kim received his B.S. and M.S degrees from Arizona University and Cornell. His research interests include big data analysis and data mining for handing unorganized data. In particular, his research intent is to study the abnormal behaviors of people and adversaries on networks or online game.

Seokwoo Nam is currently working as a senior researcher at SGA solutions. He received a B.S. degree in computer science from Sejong Cyber University. His research interests are focused on large-scale malware analysis and threat analysis.

Dohoon Kim is currently an assistant professor at the Kyonggi University. Before KU, he received his BS in mathematics and a double degree in computer science & engineering at Korea University, Seoul, in 2005. He received a PhD degree from the College of Information and Communication at Korea University in 2012. He is an information security reader and senior researcher in the IT Management & Support Office at the Agency for Defense Development. His current research interests are network security, risk management, cognitive radio networks, software engineering, situational awareness, future Internet, and forecast engineering.

View full text

WebMon: ML- and YARA-based malicious webpage detection

Abstract

Introduction

Section snippets

Related work

Detection methods

Architecture

Implementation

Evaluation

Discussion and conclusions

Expert. Syst. Appl.

detection and analysis of drive-by-download attacks and malicious javascript code

Proceedings of the 19th International Conference on World Wide Web, ACM

prophiler: A fast filter for a large number of detections of malicious webpage categories and subject descriptors

Proceedings of the International World Wide Web Conference (WWW)

beyond blacklists: learning to detect malicious web sites from suspicious URLs

Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM

identifying suspicious URLs: an application of large-scale online learning

Proceedings of the 26th Annual International Conference on Machine Learning, ACM

CAMP: Content-agnostic malware protection

NDSS

ZOZZLE: Fast and precise in-browser javascript malware detection

USENIX Security Symposium

Logos: internet-explorer-based malicious webpage detection

ETRI J.