Elsevier

Computer Networks

Volume 137, 4 June 2018, Pages 119-131
Computer Networks

WebMon: ML- and YARA-based malicious webpage detection

https://doi.org/10.1016/j.comnet.2018.03.006Get rights and content

Abstract

Attackers use the openness of the Internet to facilitate the dissemination of malware. Their attempts to infect target systems via the Web have increased with time and are unlikely to abate. In response to this threat, we present an automated, low-interaction malicious webpage detector, WebMon, that identifies invasive roots in Web resources loaded from WebKit2-based browsers using machine learning and YARA signatures. WebMon effectively detects hidden exploit codes by tracing linked URLs to confirm whether the relevant websites are malicious. WebMon detects a variety of attacks by running 250 containers simultaneously. In this configuration, the proposed model yields a detection rate of 98%, and is 7.6 times faster (with a container) than previously proposed models. Most importantly, WebMon’s focus on extracting malicious paths in a domain is a novel approach that has not been explored in previous studies.

Introduction

Since the initial development of Web browsers, there have been a growing number of attempts to infect online systems by transmitting malware through browsers. In this Web architecture, one malicious webpage can contaminate several thousand user PCs in minutes. Therefore, the concealment of malware on webpages is one of the most dangerous types of cyberattack, and poses a significant threat to the integrity of critical systems. More recent types of malware, such as ransomware, in conjunction with exploit toolkits, have evolved to become more complex, automated, and impossible to decrypt. Thus, detecting websites that propagate malware and developing techniques to neutralize them is crucial.

Malicious URLs that activate drive-by downloads are a popular form of exploitation and malware delivery. Hence, fast detection of malicious URLs is useful, because the URLs can then be distributed to blacklists maintained by various security systems. (At present, these databases might contain a great deal of outdated information.) Thus, rapidly finding malicious URLs from countless webpages, which are live only for very limited amounts of time, is a duty of security research.

Previous studies [1], [2], [3] have exposed limitations in the extraction of malicious URL paths within webpages–even if diverse malicious traces exist. Previous systems have shown the difficulty of detecting exploit kits (EKs) [1], and suffered from low performance [2] and unstable architectures [3]. High performance is especially important for detecting many malicious webpages because we cannot limitlessly increase our server capacity. The prevalent dynamic systems impose a significant processing burden, whereas static systems have low detection rates in browser-plugin-based attacks. In this circumstance, our approach provides effective architectural design that overcomes previous limitations.

In this paper, we propose a WebKit2- [4], machine learning (ML)-, and YARA-based [5] low-interaction webpage analyzer. Combining these components improves the functionality of state-of-the-art systems and classifies malicious redirects. The main contributions of this study are as follows:

  • We propose a practical model using a WebKit2-, ML- and YARA-based framework for large-scale malicious webpage detection. This framework utilizes Docker containers and a multiserver architecture to provide a secure, scalable, and stable structure.

  • We introduce WebMon, which is 7.6 times faster than conventional malicious webpage detection tools and has a detection rate of 98%; it can determine the maliciousness of 11.13 domains per second in single-server mode.

  • We present a call tree algorithm that constructs a malicious URL redirection tree that enables us to clearly determine malicious paths.

The remainder of this paper is organized as follows: Section 2 presents related work and describes our research. Section 3 describes a variety of features that can be used for malware detection. Section 4 provides an overview of the proposed model. In Section 5, we explain the technical details of our implementation; then, we discuss the dataset used, the experimental setup, and our experimental results in Section 6. The limitations of our model and our conclusions are outlined in Section 7.

Section snippets

Related work

Malicious webpage detection approaches can be categorized as feature-, pattern-, and behavior-based.

Detection methods

In this section, we explain features for determining whether webpages are malicious. In particular, we focus on the features of EKs, because most malicious content is central to EK servers that exploit user hosts. We also consider two metrics in feature selection: performance and detection rate. In this work, we introduce 11 feature classes for identifying EKs. We utilize a limited number of features to increase system performance.

Architecture

In this section, we describe WebMon’s framework. WebMon consists of a queue server, Docker with multiple containers, and a database, as shown in in Fig. 2. This system provides crawling-based malicious webpage detection.

Implementation

WebKit2 is a split-process model divided into user interface (UI) and Web processes between the API boundary. These processes communicate via the message method. In this section, we address the implementation of the call tree and YARA- and ML-based malware detection. The call tree operates in the UI process, and YARA matching and classification operate in the web process.

Evaluation

In this section, we choose an ML classifier optimized in our feature selection, and compare state-of-the-art systems with WebMon.

Discussion and conclusions

Though the proposed model has resolved some of the issues in previously proposed tools, we also face several problems. Our primary concern is that Flash Player or Java applet attacks are directly loaded without exploit traces using document.documentElement.nodeName, <object>, <embed>, or <div> elements, which creates (seemingly benign) attack URLs or runs obfuscated exploit scripts hidden in ActionScript. Then, evidence to characterize the malicious symptoms of webpages is insufficient (except

Sungjin Kim is currently a PhD student at KAIST. He received the BS and MS in computer science from Ohio State University and Sogang University respectively. His current research interests include various security issues relevant to malware, SNA, deep web, deep learning, big data, risk analysis, and network-based security systems.

References (29)

  • Y.T. Hou

    Malicious web content detection by machine learning

    Expert. Syst. Appl.

    (2010)
  • M. Cova et al.

    detection and analysis of drive-by-download attacks and malicious javascript code

    Proceedings of the 19th International Conference on World Wide Web, ACM

    (2010)
  • The Cuckoo Sandbox,...
  • Thug,...
  • WebKit2,...
  • YARA,...
  • D. Canali et al.

    prophiler: A fast filter for a large number of detections of malicious webpage categories and subject descriptors

    Proceedings of the International World Wide Web Conference (WWW)

    (2011)
  • J. Ma

    beyond blacklists: learning to detect malicious web sites from suspicious URLs

    Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM

    (2009)
  • J. Ma et al.

    identifying suspicious URLs: an application of large-scale online learning

    Proceedings of the 26th Annual International Conference on Machine Learning, ACM

    (2009)
  • M.A. Rajab

    CAMP: Content-agnostic malware protection

    NDSS

    (2013)
  • C. Curtsinger

    ZOZZLE: Fast and precise in-browser javascript malware detection

    USENIX Security Symposium

    (2011)
  • Anubis,...
  • S. Kim

    Logos: internet-explorer-based malicious webpage detection

    ETRI J.

    (2017)
  • Docker,...
  • Cited by (22)

    • A multi-dimensional machine learning approach to predict advanced malware

      2019, Computer Networks
      Citation Excerpt :

      Moreover, there are classification or detection solutions to cope with such malware that are based on machine learning [21] and data mining [22]. However, each of them considers only specific features or environments, such as self-organizing feature [23] and malicious URL [24]. The vast number of different computing platforms and their interconnections make malware detections with machine learning more challenging than ever.

    • Malware Analysis Based on Malicious Web URLs

      2024, Lecture Notes in Networks and Systems
    View all citing articles on Scopus

    Sungjin Kim is currently a PhD student at KAIST. He received the BS and MS in computer science from Ohio State University and Sogang University respectively. His current research interests include various security issues relevant to malware, SNA, deep web, deep learning, big data, risk analysis, and network-based security systems.

    Jinkook Kim received his B.S. and M.S degrees from Arizona University and Cornell. His research interests include big data analysis and data mining for handing unorganized data. In particular, his research intent is to study the abnormal behaviors of people and adversaries on networks or online game.

    Seokwoo Nam is currently working as a senior researcher at SGA solutions. He received a B.S. degree in computer science from Sejong Cyber University. His research interests are focused on large-scale malware analysis and threat analysis.

    Dohoon Kim is currently an assistant professor at the Kyonggi University. Before KU, he received his BS in mathematics and a double degree in computer science & engineering at Korea University, Seoul, in 2005. He received a PhD degree from the College of Information and Communication at Korea University in 2012. He is an information security reader and senior researcher in the IT Management & Support Office at the Agency for Defense Development. His current research interests are network security, risk management, cognitive radio networks, software engineering, situational awareness, future Internet, and forecast engineering.

    View full text