Elsevier

Information Sciences

Volume 603, July 2022, Pages 42-59
Information Sciences

A lightweight data representation for phishing URLs detection in IoT environments

https://doi.org/10.1016/j.ins.2022.04.059Get rights and content

Highlights

  • A new feature set composed of 46 features for phishing URLs detection were proposed.

  • An updated feature selection algorithm to obtain the most representative features.

  • An optimum subset composed of 9 features for phishing URLs detection was proposed.

  • Random Forest obtains the best Accuracy results using the feature subset proposed.

  • The Accuracy results obtained outperforms those reported in the literature.

Abstract

Phishing is a cyber-attack that exploits victims’ technical ignorance or naivety and commonly involves a Uniform Resources Locator (URL). Hence, it is advantageous to detect a phishing attack by analyzing URLs before accessing them. With the raising of the Internet of Things (IoT), phishing attacks are moving to this field due to the number of IoT devices and the amount of personal information they handle. Although several approaches were proposed for phishing attacks detection, the URL-based Machine Learning approaches obtain better performance results, but all of them are dependent on the feature set used. Contradictorily, only a few works on selecting the best-suited feature set for improving the phishing detection process have been published. The present research explores how to obtain a feature set that substantially enhances the phishing detection rate in IoT environments. Hence, a feature selection algorithm was adopted and extended for getting the most representative feature set. When Random Forest is used with the proposed data representation, the phishing URL attacks discovery rate is 99.57%.

Introduction

With the Internet uprising in the early ’90s, it has been clear that a new technological revolution was happening: the Information Revolution. As part of this revolution, many areas of society evolved from traditional to online forms. Along with the Information Revolution and taking advantage of online life’s ubiquity, many offenses also moved to the online world (now known as cyber-offenses). One of the most profitable cyber-offenses is phishing. In 2010, the global online banking fraud was $1.692 million, and of these, $320 million correspond to phishing attacks [4], making it one of the most effective and profitable scams over the Internet [12]. According to the Anti-Phishing Working Group (APWG)1 [5], the total number of phishing sites detected in the third quarter of 2020 was 571,746. This was up 75% from the 146,994 seen in Q2 of the same year. One disturbing trend evinced in [5] is the growth of https used in phishing attacks. https is commonly associated with trusting and secure sites, and those two assumptions make the https websites for phishing attacks more difficult to detect by users, and therefore, more dangerous.

The victim is persuaded to hand over their data during a phishing attack, usually through a Uniform Resource Locator (URL). Generally, the URL involved in a phishing attack is mascaraed by using large sequences of alphanumerical characters or introducing characters similar to the original URL (e.g.,  www.yah00.com instead of  www.yahoo.com) among others. If the malicious URL is delivered to devices with small screens (e.g., cellphones, tablets, and smartwatches), the attack is even more effective due to the necessity to optimize the screen area. The address bar is commonly reduced or even reduced hidden. The devices mentioned above compose (among others) the so-named Internet of Things (IoT) [6]. Many IoT devices are used to share documents, purchase goods online, chat with friends, and record personal information such as heart beating and sleep quality. IoT devices store more personal information than any other device at any time. Considering those facts, it is expected that the targets of cyber-attacks will move to IoT devices [6] and their users. Also, the cyber-attack that is expected to grow more quickly than any others is phishing [12], [6], which is very attractive to cyber-offenders due to the physical features and security issues of IoT devices [6].

Considering the phishing taxonomy proposed in [34], the present research is focused on exploring new data representations for improving the performance of the Machine Learning algorithms for phishing detection. Although several works were reported for phishing URLs detection, some were focused on determining which classifier performs better considering pre-defined features obtained using third-party services. Those works also use complex data structures and data representations combined with computationally intense processes, making them unsuitable to be adopted in IoT devices [6]. Besides, some works obtain the features visiting the suspicious web page, implying being a victim of the attack. IoT devices are characterized by offering limited computing capabilities and low power consumption. In such cases, algorithms that run in IoT devices must be lightweight, using complex data structures must be avoided, and the data sources (and the features) employed must be as simple as they can [6]. Considering the above requirements, this paper describes a lightweight data representation for phishing URLs detection in IoT environments that maximizes the detection rate. Selecting the best feature set is crucial for proposing a phishing detection approach applicable in practice. Also, only a few works reported in the literature have focused on selecting the most compelling feature set for detecting phishing URLs attacks [40]. The current paper contributes to this end by proposing a new feature set and an optimized feature selection algorithm that improves the classifiers’ detection rate. The URL representation presented is evaluated using several algorithms reported in the literature, demonstrating its validity. Furthermore, the data representation proposed is language-independent, allows for real-time and zero-day attacks detection, is independent of third-party services, uses feature-rich classifiers, and is no need to inspect the website pointed by the suspicious URL.

In summary, the key contributions of this paper are:

  • 1.

    A lightweight data representation for phishing URLs detection suitable for IoT environments.

  • 2.

    An extended feature selection algorithm that gathers and ranks the Information Gain, Chi-Squared, and ReliefF algorithms. The Joint Score is a metric introduced in this feature selection algorithms for selecting the most valuable features.

  • 3.

    To serve as the starting point for researchers and practitioners to develop cyber-security solutions for IoT devices.

The Precision, Recall, F-Measure, and Accuracy measures were used to evaluate the classification quality obtained by the proposed lightweight URLs representation. These metrics were extensively used in the revised literature.

The remainder of this paper is organized as follows: in Section 2, the theoretical background needed for introducing this work is described. In Section 3, representatives works on phishing URLs detection are revised. The main idea of the present research is presented in Section 4 while experiments demonstrating the validity of the proposed lightweight URLs representation are presented in Section 5. Conclusions are drawn in Section 6.

Section snippets

Background

This section describes the main methods used for phishing attacks, and some concepts and definitions must be provided for a better understanding.

Related work

Considering the phishing attacks taxonomy proposed in [34], this paper is focused on proposing a lightweight URLs representation for detecting phishing URLs in IoT environments. During a phishing attack, it is very common to deliver a malicious URL to the victim by some media (usually through an email link, a Whatsapp or a Facebook message, or another instant messaging system, among others), and it is asked to access it. If the victim accesses the phishing URL, the malicious actions are

Lightweight data representation for URL phishing detection

Some of the reviewed papers inspect the content of the web pages by downloading them and using NLP techniques for analyzing their source code and visual features. Considering the aims pursued in this research, accessing the suspicious web page is avoided. This is because of the limitations of IoT devices, and improving security due to visiting suspicious web pages is, indeed, falling into the scams. Phishing scams have a limited lifetime (around 24 h or less). However, lexical features remain

Experiments

This section evaluates the viability of the proposed strategy for feature reduction and the Lightweight dataset for phishing URLs detection in IoT environments.

Conclusions

This paper proposes a Lightweight dataset for phishing detection in IoT environments. First, several features referred to the length of some parts of the URL, HTTP/S related aspects of the URL, some features related to NLP, and some rates were measured. For IoT environments, where the computational resources are limited, it is mandatory to use lightweight datasets, algorithms, and data structures. Because of this, an ensemble of feature selection methods was adopted and improved for obtaining a

CRediT authorship contribution statement

Lázaro Bustio-Martínez: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Miguel A. Álvarez-Carmona: Conceptualization, Formal analysis, Investigation, Writing – original draft, Writing – review & editing. Vitali Herrera-Semenets: Methodology, Software, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Claudia Feregrino-Uribe:

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was partially supported by the project 2017-01-7092 from CONACyT, Mexico.

References (40)

  • R.J. Anderson et al.
  • APGW, Phishing Activity Trends Report – 3rd Quarter 2020, 2020....
  • J. Bergstra et al.

    Random Search for Hyper-Parameter Optimization

    J. Mach. Learn. Res.

    (2012)
  • T. Berners-Lee, Uniform Resource Locators (URL), 2018. URL:...
  • S. Chan, P. Treleaven, Chapter 5 – Continuous Model Selection for Large-Scale Recommender Systems, in: Govindaraju, V.,...
  • M. Chatterjee, A. Namin, Detecting Phishing Websites through Deep Reinforcement Learning, in: 2019 IEEE 43rd Annual...
  • Q. Cui et al.

    Phishing Attacks Modifications and Evolutions

  • G. Forman

    An Extensive Empirical Study of Feature Selection Metrics for Text Classification

    J. Mach. Learn. Res.

    (2003)
  • M.A. Hall et al.

    Practical feature subset selection for machine learning

  • Y. Huang et al.

    Phishing URL Detection Via Capsule-Based Neural Network

  • Cited by (12)

    • XRRF: An eXplainable Reasonably Randomised Forest algorithm for classification and regression problems

      2022, Information Sciences
      Citation Excerpt :

      TEAs are advancing at an astounding rate, fueled by complex models. These models have a wide range of real-world applications [9–11]. Despite their success, TEAs have limitations and drawbacks.

    • An Intelligent System for Detecting Fake Materials on the Internet

      2023, International Journal of Modern Education and Computer Science
    View all citing articles on Scopus
    View full text