A lightweight data representation for phishing URLs detection in IoT environments
Introduction
With the Internet uprising in the early ’90s, it has been clear that a new technological revolution was happening: the Information Revolution. As part of this revolution, many areas of society evolved from traditional to online forms. Along with the Information Revolution and taking advantage of online life’s ubiquity, many offenses also moved to the online world (now known as cyber-offenses). One of the most profitable cyber-offenses is phishing. In 2010, the global online banking fraud was million, and of these, million correspond to phishing attacks [4], making it one of the most effective and profitable scams over the Internet [12]. According to the Anti-Phishing Working Group (APWG)1 [5], the total number of phishing sites detected in the third quarter of 2020 was . This was up from the seen in Q2 of the same year. One disturbing trend evinced in [5] is the growth of https used in phishing attacks. https is commonly associated with trusting and secure sites, and those two assumptions make the https websites for phishing attacks more difficult to detect by users, and therefore, more dangerous.
The victim is persuaded to hand over their data during a phishing attack, usually through a Uniform Resource Locator (URL). Generally, the URL involved in a phishing attack is mascaraed by using large sequences of alphanumerical characters or introducing characters similar to the original URL (e.g., www.yah00.com instead of www.yahoo.com) among others. If the malicious URL is delivered to devices with small screens (e.g., cellphones, tablets, and smartwatches), the attack is even more effective due to the necessity to optimize the screen area. The address bar is commonly reduced or even reduced hidden. The devices mentioned above compose (among others) the so-named Internet of Things (IoT) [6]. Many IoT devices are used to share documents, purchase goods online, chat with friends, and record personal information such as heart beating and sleep quality. IoT devices store more personal information than any other device at any time. Considering those facts, it is expected that the targets of cyber-attacks will move to IoT devices [6] and their users. Also, the cyber-attack that is expected to grow more quickly than any others is phishing [12], [6], which is very attractive to cyber-offenders due to the physical features and security issues of IoT devices [6].
Considering the phishing taxonomy proposed in [34], the present research is focused on exploring new data representations for improving the performance of the Machine Learning algorithms for phishing detection. Although several works were reported for phishing URLs detection, some were focused on determining which classifier performs better considering pre-defined features obtained using third-party services. Those works also use complex data structures and data representations combined with computationally intense processes, making them unsuitable to be adopted in IoT devices [6]. Besides, some works obtain the features visiting the suspicious web page, implying being a victim of the attack. IoT devices are characterized by offering limited computing capabilities and low power consumption. In such cases, algorithms that run in IoT devices must be lightweight, using complex data structures must be avoided, and the data sources (and the features) employed must be as simple as they can [6]. Considering the above requirements, this paper describes a lightweight data representation for phishing URLs detection in IoT environments that maximizes the detection rate. Selecting the best feature set is crucial for proposing a phishing detection approach applicable in practice. Also, only a few works reported in the literature have focused on selecting the most compelling feature set for detecting phishing URLs attacks [40]. The current paper contributes to this end by proposing a new feature set and an optimized feature selection algorithm that improves the classifiers’ detection rate. The URL representation presented is evaluated using several algorithms reported in the literature, demonstrating its validity. Furthermore, the data representation proposed is language-independent, allows for real-time and zero-day attacks detection, is independent of third-party services, uses feature-rich classifiers, and is no need to inspect the website pointed by the suspicious URL.
In summary, the key contributions of this paper are:
- 1.
A lightweight data representation for phishing URLs detection suitable for IoT environments.
- 2.
An extended feature selection algorithm that gathers and ranks the Information Gain, Chi-Squared, and ReliefF algorithms. The Joint Score is a metric introduced in this feature selection algorithms for selecting the most valuable features.
- 3.
To serve as the starting point for researchers and practitioners to develop cyber-security solutions for IoT devices.
The Precision, Recall, F-Measure, and Accuracy measures were used to evaluate the classification quality obtained by the proposed lightweight URLs representation. These metrics were extensively used in the revised literature.
The remainder of this paper is organized as follows: in Section 2, the theoretical background needed for introducing this work is described. In Section 3, representatives works on phishing URLs detection are revised. The main idea of the present research is presented in Section 4 while experiments demonstrating the validity of the proposed lightweight URLs representation are presented in Section 5. Conclusions are drawn in Section 6.
Section snippets
Background
This section describes the main methods used for phishing attacks, and some concepts and definitions must be provided for a better understanding.
Related work
Considering the phishing attacks taxonomy proposed in [34], this paper is focused on proposing a lightweight URLs representation for detecting phishing URLs in IoT environments. During a phishing attack, it is very common to deliver a malicious URL to the victim by some media (usually through an email link, a Whatsapp or a Facebook message, or another instant messaging system, among others), and it is asked to access it. If the victim accesses the phishing URL, the malicious actions are
Lightweight data representation for URL phishing detection
Some of the reviewed papers inspect the content of the web pages by downloading them and using NLP techniques for analyzing their source code and visual features. Considering the aims pursued in this research, accessing the suspicious web page is avoided. This is because of the limitations of IoT devices, and improving security due to visiting suspicious web pages is, indeed, falling into the scams. Phishing scams have a limited lifetime (around 24 h or less). However, lexical features remain
Experiments
This section evaluates the viability of the proposed strategy for feature reduction and the Lightweight dataset for phishing URLs detection in IoT environments.
Conclusions
This paper proposes a Lightweight dataset for phishing detection in IoT environments. First, several features referred to the length of some parts of the URL, HTTP/S related aspects of the URL, some features related to NLP, and some rates were measured. For IoT environments, where the computational resources are limited, it is mandatory to use lightweight datasets, algorithms, and data structures. Because of this, an ensemble of feature selection methods was adopted and improved for obtaining a
CRediT authorship contribution statement
Lázaro Bustio-Martínez: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Miguel A. Álvarez-Carmona: Conceptualization, Formal analysis, Investigation, Writing – original draft, Writing – review & editing. Vitali Herrera-Semenets: Methodology, Software, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Claudia Feregrino-Uribe:
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was partially supported by the project 2017-01-7092 from CONACyT, Mexico.
References (40)
- et al.
Understanding the Internet of Things: definition, potentials, and societal role of a fast evolving paradigm
Ad Hoc Netw.
(2017) - et al.
A survey of phishing attacks: Their types, vectors and technical approaches
Expert Syst. Appl.
(2018) - et al.
A novel approach for phishing urls detection using lexical based machine learning in a real-time environment
Comput. Commun.
(2021) - et al.
A data reduction strategy and its application on scan and backscatter detection using rule-based classifiers
Expert Syst. Appl.
(2018) - et al.
Machine learning based phishing detection from URLs
Expert Syst. Appl.
(2019) - et al.
User experiences of TORPEDO: TOoltip-poweRed Phishing Email DetectiOn
Comput. Secur.
(2017) - et al.
Accurate and fast url phishing detector: A convolutional neural network approach
Comput. Netw.
(2020) - et al.
Hybrid Rule-Based Model for Phishing URLs Detection
- et al.
Phishing attacks detection using machine learning approach
- Amazon, Alexa – The top 500 sites on the web, 2020. URL:...
Random Search for Hyper-Parameter Optimization
J. Mach. Learn. Res.
Phishing Attacks Modifications and Evolutions
An Extensive Empirical Study of Feature Selection Metrics for Text Classification
J. Mach. Learn. Res.
Practical feature subset selection for machine learning
Phishing URL Detection Via Capsule-Based Neural Network
Cited by (12)
CNN-Fusion: An effective and lightweight phishing detection method based on multi-variant ConvNet
2023, Information SciencesXRRF: An eXplainable Reasonably Randomised Forest algorithm for classification and regression problems
2022, Information SciencesCitation Excerpt :TEAs are advancing at an astounding rate, fueled by complex models. These models have a wide range of real-world applications [9–11]. Despite their success, TEAs have limitations and drawbacks.
An Intelligent System for Detecting Fake Materials on the Internet
2023, International Journal of Modern Education and Computer ScienceIntelligent Methods in Phishing Website Detection: A Systematic Literature Review
2023, Research Square