Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection

Altay, Betul; Dokeroglu, Tansel; Cosar, Ahmet

doi:10.1007/s00500-018-3066-4

Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection

Methodologies and Application
Published: 10 February 2018

Volume 23, pages 4177–4191, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

1050 Accesses
27 Citations
1 Altmetric
Explore all metrics

Abstract

Conventional malicious webpage detection methods use blacklists in order to decide whether a webpage is malicious or not. The blacklists are generally maintained by third-party organizations. However, keeping a list of all malicious Web sites and updating this list regularly is not an easy task for the frequently changing and rapidly growing number of webpages on the web. In this study, we propose a novel context-sensitive and keyword density-based method for the classification of webpages by using three supervised machine learning techniques, support vector machine, maximum entropy, and extreme learning machine. Features (words) of webpages are obtained from HTML contents and information is extracted by using feature extraction methods: existence of words, keyword frequencies, and keyword density techniques. The performance of proposed machine learning models is evaluated by using a benchmark data set which consists of one hundred thousand webpages. Experimental results show that the proposed method can detect malicious webpages with an accuracy of 98.24%, which is a significant improvement compared to state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Phishing Webpage Detecting Algorithm Using Webpage Noise and N-Gram

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Two-stage ELM for phishing Web pages detection using hybrid features

Article 29 September 2016

References

Abbasi A, Zahedi F, Kaza S et al (2012) Detecting fake medical web sites using recursive trust labeling. ACM Trans Inf Syst (TOIS) 30(4):22
Article Google Scholar
Abraham A, Ohsawa Y, Dote Y (2007) Web intelligence and chance discovery. Soft Comput Fusion Found Methodol Appl 11(8):695–696
Google Scholar
Alexa. Alexa top sites. http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
Bannur SN, Saul LK, Savage S (2011) Judging a site by its content: learning the textual, structural, and visual features of malicious web pages. In: Proceedings of the 4th ACM workshop on security and artificial intelligence. ACM, pp 1–10
Basnet R, Mukkamala S, Sung AH (2008) Detection of phishing attacks: a machine learning approach. In: Soft computing applications in industry. Springer, pp 373–383
Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71
Google Scholar
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, p 144–152
Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th international conference on world wide web. ACM, pp 197–206
Carrasco RA, Villar P (2012) A new model for linguistic summarization of heterogeneous data: an application to tourism web data sources. Soft Comput 16(1):135–151
Article Google Scholar
Cdric Champeau (2014) Jlangdetect. https://github.com/melix/jlangdetect
Chang CC, Lin CJ (2015) LIBSVM—a library for support vector machines. https://www.csie.ntu.edu.tw/~cjlin/ libsvm/
Chau M, Chen H (2008) A machine learning approach to web page filtering using content and structure analysis. Decis Support Syst 44(2):482–494
Article Google Scholar
Chen J, Guo C (2006) Online detection and prevention of phishing attacks. In: Communications and networking in China, 2006. ChinaCom’06. First international conference on IEEE, pp 1–7
Chieu HL, Ng HT (2002) A maximum entropy approach to information extraction from semi-structured and free text. AAAI/IAAI 2002:786–791
Google Scholar
Christodorescu M, Jha S (2004) Testing malware detectors. ACM SIGSOFT Softw Eng Notes 29(4):34–44
Article Google Scholar
Comodo Group (2017) Creating trust online. https://www.comodo.com/
Corinna C, Vladimir V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Deniz A, Kiziloz HE, Dokeroglu T, Cosar A (2017) Robust multiobjective evolutionary feature subset selection algorithm for binary classification using machine learning techniques. Neurocomputing 241:128–146
Article Google Scholar
El-Halees A (2007) Arabic text classification using maximum entropy. Islam Univ J (Series of Natural Studies and Engineering) 15(1):157–167
Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Hou Y-T, Chang Y, Chen T, Laih C-S, Chen C-M (2010) Malicious web content detection by machine learning. Exp Syst Appl 37(1):55–60
Article Google Scholar
Hsu CW, Chang CC, Lin et al (2003) A practical guide to support vector classification
Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In: Neural networks, 2004. Proceedings. 2004 IEEE international joint conference on IEEE, vol 2, pp 985–990
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501
Article Google Scholar
Huang GB, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2(2):107–122
Article Google Scholar
International Telecommunication Union (2015) Statistics. http://www.itu.int/en/ITU-D/Statistics/Documents/statistics/2015/ITU_Key_2005-2015_ICT_data.xls
Invernizzi L, Comparetti PM, Benvenuti S, Kruegel C, Cova M, Vigna G (2012) Evilseed: a guided approach to finding malicious web pages. In: Security and privacy (SP), 2012 IEEE symposium on IEEE, pp 428–442
Kazemian HB, Ahmed S (2015) Comparisons of machine learning techniques for detecting malicious webpages. Exp Syst Appl 42(3):1166–1177
Article Google Scholar
Machine Learning Group at National Taiwan University (2015) Liblinear—a library for large linear classification. https://www.csie.ntu.edu.tw/~cjlin/liblinear/
Moshchuk A, Bragin T, Damien D, Gribble SD, Levy HM (2007) Execution-based detection of malicious web content. In: USENIX security, Spyproxy
Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering, vol 1, pp 61–67
Nocedal J (1980) Updating quasi-newton matrices with limited storage. Math Comput 35(151):773–782
Article MathSciNet MATH Google Scholar
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol 10. Association for Computational Linguistics, pp 79–86
PhishTank P (2016) Join the fight against phishing. https://www.phishtank.com/developer_info.php
Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: predictive blacklisting to detect phishing attacks. In: INFOCOM, 2010 proceedings IEEE. IEEE, pp. 1–5
Provos N, McNamee D, Mavrommatis P, Wang K, Modadugu N et al (2007) The ghost in the browser: analysis of web-based malware. HotBots 7:4–4
Google Scholar
Seifert C, Welch I, Komisarczuk P (2008) Identification of malicious web pages with static heuristics. In: Telecommunication networks and applications conference, 2008. ATNAC 2008. Australasian. IEEE, pp 91–96
Seifert C, Welch I, Komisarczuk P, Aval CU, Endicott-Popovsky B (2008) Identification of malicious web pages through analysis of underlying DNS and web server relationships. In: LCN, Citeseer, pp 935–941
Sirageldin A, Baharudin BB, Jung LT (2014) Malicious web page detection: a machine learning approach. In: Advances in computer science and its applications. Springer, pp 217–224
Tsuruoka Y (2006) A simple c++ library for maximum entropy classification v3.0. Software available at http://www.nactem.ac.uk/tsuruoka/maxent/
Tsuruoka Y, Tsujii J, Ananiadou S (2009) Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, volume 1–1. Association for computational linguistics, pp 477–485
Wassermann G, Su Z (2008) Static detection of cross-site scripting vulnerabilities. In: 2008 ACM/IEEE 30th international conference on software engineering. IEEE, pp 171–180
Zhu QY, Huang GB (2004) Basic ELM algorithms. http://www.ntu.edu.sg/home/egbhuang/elm_codes.html

Download references

Author information

Authors and Affiliations

HAVELSAN, Hava Elektronik Sanayi, Ankara, Turkey
Betul Altay
Computer Engineering Department, THK University, Ankara, Turkey
Tansel Dokeroglu
Computer Engineering Department, Middle East Technical University, Ankara, Turkey
Ahmet Cosar

Authors

Betul Altay
View author publications
You can also search for this author in PubMed Google Scholar
Tansel Dokeroglu
View author publications
You can also search for this author in PubMed Google Scholar
Ahmet Cosar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tansel Dokeroglu.

Ethics declarations

Conflicts of interest

There is no conflict of interest between authors.

Ethical standard

This article does not contain any studies with human participants performed by any of the authors. This article does not contain any studies with animals performed by any of the authors. This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

There is no individual participant included in the study.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altay, B., Dokeroglu, T. & Cosar, A. Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection. Soft Comput 23, 4177–4191 (2019). https://doi.org/10.1007/s00500-018-3066-4

Download citation

Published: 10 February 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s00500-018-3066-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection

Abstract

Access this article

Similar content being viewed by others

A Phishing Webpage Detecting Algorithm Using Webpage Noise and N-Gram

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Two-stage ELM for phishing Web pages detection using hybrid features

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Ethical standard

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection

Abstract

Access this article

Similar content being viewed by others

A Phishing Webpage Detecting Algorithm Using Webpage Noise and N-Gram

Web Page Classification Based on an Accurate Technique for Key Data Extraction

Two-stage ELM for phishing Web pages detection using hybrid features

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Ethical standard

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation