Skip to main content
Log in

Content-aware malicious webpage detection using convolutional neural network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Malicious websites often install malware on user devices to gather user information or to disrupt device operations, violate user privacy, or adversely affect company interests. Many commercial tools are available to prevent malicious webpages from accessing devices; however, current versions of these tools may become useless as soon as a new generation of malware is released. In this study, a content-aware malicious webpage detection (CAMD) method was developed; this CAMD method can verify whether a webpage is malicious by applying a novel webpage contextual visualization process, which retrieves the critical codes of webpages, transforms those codes into one-dimensional grayscale images, and applies convolutional neural networks to detect any malicious webpages. To verify the feasibility of proposed CAMD, 50000 normal and 50000 malicious webpages from the VirusTotal website were used. The results indicated that the proposed CAMD achieved an accuracy of > 98%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Abdi F, Wenjuan L (2017) Malicious URL detection using convolutional neural network. Int J Comput Sci Eng Inf Technol 7:01–08. https://doi.org/10.5121/ijcseit.2017.7601.7

    Google Scholar 

  2. Aleroud A, Zhou L (2017) Phishing environments, techniques, and countermeasures: a survey. Comput Secur 68:160–196

    Article  Google Scholar 

  3. Bakkouri I, Afdel K (2019) Multi-scale CNN based on region proposals for efficient breast abnormality recognition. Multimed Tools Appl 78:12939–12960

    Article  Google Scholar 

  4. Bakkouri I, Afdel K (2020) Computer-aided diagnosis (CAD) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79(29–30):20483–20518

    Article  Google Scholar 

  5. Bakkouri I, Afdel K (2022) MLCA2F: multi-level context attentional feature fusion for COVID-19 lesion segmentation from CT scans. SIViP :1–8

  6. Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Inproceedings of the 20th international conference on World Wide Web pp 197–206

  7. Chiba D, Tobe K, Mori T, Goto S (2012) Detecting malicious websites by learning IP address features. In: 2012 IEEE/IPSJ 12th international symposium on applications and the internet, IEEE, pp 29–39

  8. Cova M, Kruegel C, Vigna G (2010) Detection and analysis of drive-by-download attacks and malicious JavaScript code. In: Inproceedings of the 19th international conference on World wide web. pp 281–290

  9. Cui Q, Zhang Z, Shi Y, Ni W, Zeng M, Zhou M (2021) Dynamic multichannel access based on deep reinforcement learning in distributed wireless networks. IEEE Syst J 16(4):5831–5834

    Article  Google Scholar 

  10. Dos Santos C, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: Inproceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 69–78

  11. Fukushima Y, Hori Y, Sakurai K (2011) Proactive blacklisting for malicious web sites by reputation evaluation based on domain and IP address registration. In: 2011 IEEE 10th international conference on trust, security and privacy in computing and communications, IEEE, pp 352–361

  12. Goldberg Y, Levy O (2014) word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722

  13. Heartfield R, Loukas G (2015) A taxonomy of attacks and a survey of defence mechanisms for semantic social engineering attacks. ACM Comput Surv (CSUR) 48(3):1–39

    Article  Google Scholar 

  14. Huang LS, Moshchuk A, Wang HJ, Schecter S, Jackson C (2012) Clickjacking: attacks and defenses. In: In 21st USENIX, security symposium (USENIX Security), vol 12, pp 413–428

  15. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, PMLR, pp 448–456

  16. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  17. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  18. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

    Google Scholar 

  19. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  20. Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400

  21. Liu D, Lee JH (2020) CNN Based malicious website detection by invalidating multiple web spams. IEEE Access 8:97258–97266

    Article  Google Scholar 

  22. Manan WNW, Kahar MNM, Ali NM (2020) A survey on current malicious javascript behavior of infected web content in detection of malicious web pages. In: IOP conference series: materials science and engineering, IOP Publishing, vol 769, No. 1, p 012074

  23. Oh I, Rho S, Moon S, Son S, Lee H, Chung J (2021) Creating pro-level AI for a real-time fighting game using deep reinforcement learning. IEEE Trans Games 14(2):212–220

    Article  Google Scholar 

  24. Patil DR, Patil JB (2015) Survey on malicious web pages detection techniques. Int J u e-Serv Sci Technol 8(5):195–206

    Article  MathSciNet  Google Scholar 

  25. Peng T, Harris I, Sawa Y (2018) Detecting phishing attacks using natural language processing and machine learning. In: 2018 IEEE 12th international conference on semantic computing (ICSC), IEEE, pp 300–301

  26. Purkait S (2012) Phishing counter measures and their effectiveness–literature review. Information Management & Computer Security

  27. Sahoo D, Liu C, Hoi SC (2017) Malicious URL detection using machine learning: a survey. arXiv:1701.07179

  28. Saxe J, Berlin K (2017) eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys. arXiv:1702.08568

  29. Saxe J, Harang R, Wild C, Sanders H (2018) A deep learning approach to fast, format-agnostic detection of malicious web content. In: 2018 IEEE security and privacy workshops (SPW), IEEE, pp 8–14

  30. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  31. Sinha S, Bailey M, Jahanian F (2008) Shades of Grey: On the effectiveness of reputation-based “blacklists”. In: 2008 3rd international conference on malicious and unwanted software (MALWARE) IEEE, pp 57–64

  32. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  33. Total V (2019) Virus total. URL https://www.virustotal.com

  34. Verma R, Crane D, Gnawali O (2018) Phishing during and after disaster: hurricane harvey. In: 2018 Resilience Week (RWS) IEEE, pp 88–94

  35. Yan X, Xu Y, Cui B, Zhang S, Guo T, Li C (2020) Learning URL embedding for malicious website detection. IEEE Trans Ind Inf 16 (10):6673–6681

    Article  Google Scholar 

  36. Yang W, Zuo W, Cui B (2019) Detecting malicious URLs via a keyword-based convolutional gated-recurrent-unit neural network. IEEE Access 7:29891–29900

    Article  Google Scholar 

  37. Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. Adv Neural Inf Process Syst 28:649–657

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Cheng Jiang.

Ethics declarations

Conflict of Interests

Detailed information of all authors’ receives research support is listing as: This study was funded by the National Science and Technology Council, Taiwan, under Grant MOST 110-2634-F-005-006- and 110-2221-E-029-027-. No other author has reported a potential conflict of interest relevant to this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, YJ., Tsai, KL., Jiang, WC. et al. Content-aware malicious webpage detection using convolutional neural network. Multimed Tools Appl 83, 8145–8163 (2024). https://doi.org/10.1007/s11042-023-15559-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15559-8

Keywords

Navigation