A deep learner model for multi-language webshell detection

Hannousse, Abdelhakim; Nait-Hamoud, Mohamed Cherif; Yahiouche, Salima

doi:10.1007/s10207-022-00615-5

A deep learner model for multi-language webshell detection

Regular contribution
Published: 18 October 2022

Volume 22, pages 47–61, (2023)
Cite this article

International Journal of Information Security Aims and scope Submit manuscript

Abdelhakim Hannousse ORCID: orcid.org/0000-0002-8895-4161¹,
Mohamed Cherif Nait-Hamoud² &
Salima Yahiouche³

764 Accesses
Explore all metrics

Abstract

Webshell attacks are becoming more and more prevalent every year. Webshells are malicious scripts injected into web servers in the aim to confiscate persistent and remote access through simple HTTP requests on web browsers. Through webshells, attackers can remotely access confidential data and execute system commands. Actually, threat actors use webshells as an initial foothold to compromise the network infrastructure and cause dramatic damages. The impacts of webshell attacks are enormous, ranging from basic malicious actions, such as exposing sensitive data and upload more dangerous malware, to cause denial of services and compromise external networks and hence put the whole infrastructure at risk. Webshell attacks are hazardous since they can persist for a long time without being noticed by inexperienced administrators and ordinary malware scanners. In the literature, several machine learning-based models were proposed for the detection of PHP webshells. In this paper, we propose and experiment the ability of a simple deep learner model for the detection of multi-language webshells. The aim is to highlight existing challenges on the detection of webshell attacks and outline the way forward. Through analyzing source file scripts, the proposed model is designed to be able to distinguish webshells from benign files. Due to the absence of benchmark datasets for webshell detection, we collected a reasonable in size dataset for the validation process. We compared the performance of the proposed model with recent state-of-the-art systems. We also experimented source-code and opcode-based PHP detection models and the impact of presence of near-duplicates in datasets. Experimental results showed that: (1) the proposed deep learner outperforms all the experimented systems for four tested languages: PHP, JSP, ASP and ASPX with more than 98.27% of accuracy, (2) source-code based detection models are more effective than opcode-based detection models for PHP webshells, (3) the presence of near-duplicates causes higher but biased performance of webshell detection models and (4) more attention should be paid for the detection of webshells with advanced coding tricks such as letter slicing and code splitting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning

Deep Learning Based Webshell Detection Coping with Long Text and Lexical Ambiguity

AST2Vec: A Robust Neural Code Representation for Malicious PowerShell Detection

Data Availability Statement

The datasets generated and analyzed during the current study are publically available in the Mendeley repository at https://dx.doi.org/10.17632/wt8m6bcwbr.2.

Notes

DAws Advanced Shell available at: https://github.com/dotcppfile/DAws.
Github link: https://github.com.
Note that VLD captures opcode arrays with additional information and parameters; opcodes are described in arrays as capital letters, and this enables their distinction from other information and parameters.

References

Ahsan, M.M., Mahmud, M.A.P., Saha, P.K., Gupta, K.D., Siddique, Z.: Effect of data scaling methods on machine learning algorithms and model performance. Technologies (2021). https://doi.org/10.3390/technologies9030052
Article Google Scholar
Allamanis, M.: The adverse effects of code duplication in machine learning models of code. In: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153. Onward! 2019, ACM, New York, NY, USA (2019). https://doi.org/10.1145/3359591.3359735
Avast: Avast software: Free antivirus is your first step to online freedom. [online], available: (1995). https://www.avast.com/
Bengfort, B., Bilbro, R., Ojeda, T.: Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning, 1st edn. O’Reilly Media Inc. (2018)
Cui, H., Huang, D., Fang, Y., Liu, L., Huang, C.: Webshell detection based on random forest–gradient boosting decision tree algorithm. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC). pp. 153–160. IEEE CS (2018). https://doi.org/10.1109/DSC.2018.00030
Fang, Y., Qiu, Y., Liu, L., Huang, C.: Detecting webshell based on random forest with fasttext. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence. 52–56. ICCAI 2018, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3194452.3194470
Guo, Y., Marco-Gisbert, H., Keir, P.: Mitigating webshell attacks through machine learning techniques. Fut. Internet 12(1), 1–16 (2020)
Google Scholar
Hannousse, A., Yahiouche, S.: Handling webshell attacks: a systematic mapping and survey. Comput. Secur. 108, 102366 (2021). https://doi.org/10.1016/j.cose.2021.102366
Article Google Scholar
Hannousse, A., Yahiouche, S.: Multi-language webshell dataset. Mendeley Data, V1 (2021). https://doi.org/10.17632/wt8m6bcwbr.1
Hannousse, A., Yahiouche, S.: RF-DNN$^{2}$: An ensemble learner for effective detection of PHP Webshells. In: Proceedings of the International Conference on Artificial Intelligence for Cyber Security Systems and Privacy. pp. 1–6. AI-CSP’21, IEEE CS (2021). https://doi.org/10.1109/AI-CSP52968.2021.9671226
Hannousse, A., Yahiouche, S., Nait-Hamoud, M.C: Twenty-two years since revealing cross-site scripting attacks: a systematic mapping and a comprehensive survey. CoRR, arXiv:2205.08425v2, 1–52 (2022)
Kang, W., Zhong, S., Chen, K., Lai, J., Xu, G.: Rf-adacost: Webshell detection method that combines statistical features and opcode. In: Proceedings of the 3rd International Conference on Frontiers in Cyber Security. pp. 667–682. FCS 2020, Springer Singapore, Singapore (2020). https://doi.org/10.1007/978-981-15-9739-8_49
Leal, L.: Webshell in fake plugin /blnmrpb/ directory, [online], available: (2020). https://blog.sucuri.net/2020/01/webshell-in-fake-plugin-blnmrpb-directory.html
Li W., Zhang Z., Wang L.: A dynamic and heterogeneous web application to defense webshell attacks by using diversified PHP code. In: Proceedings of the 4th International Conference on Communication and Information Processing. 107–111. ICCIP ’18. ACM (2018). https://doi.org/10.1145/3290420.3290438
Li, Y., Huang, J., Ikusan, A., Mitchell, M., Zhang, J., Dai, R.: Shellbreaker: automatically detecting php-based malicious web shells. Comput. Secur. 87, 1–11 (2019). https://doi.org/10.1016/j.cose.2019.101595
Article Google Scholar
Lopes, C.V., Maj, P., Martins, P., Saini, V., Yang, D., Zitny, J., Sajnani, H., Vitek, J.: Déjàvu: a map of code duplicates on github. Proc. ACM Prog. Lang. (2017). https://doi.org/10.1145/3133908
Article Google Scholar
Lv, Z.H., Yan, H.B., Mei, R.: Automatic and accurate detection of webshell based on convolutional neural network. In: Proceedings of the 15th International Annual Conference on Cyber Security, pp. 73–85. CNCERT 2018, Springer Singapore (2019). https://doi.org/10.1007/978-981-13-6621-5_6
Microsoft 365 Defender Research Team: Web shell attacks continue to rise, [online], available: (2021). https://www.microsoft.com/security/blog/2021/02/11/web-shell-attacks-continue-to-rise/
Mumtaz, H., Alshayeb, M., Mahmood, S., Niazi, M.: An empirical study to improve software security through the application of code refactoring. Inf. Softw. Technol. 96, 112–125 (2018). https://doi.org/10.1016/j.infsof.2017.11.010
Article Google Scholar
Naderi-Afooshteh, A., Kwon, Y., Nguyen-Tuong, A., Bagheri-Marzijarani, M., Davidson, J.W.: Cubismo: Decloaking server-side malware via cubist program analysis. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 430–443. ACSAC ’19, ACM (2019). https://doi.org/10.1145/3359789.3359821
OWASP: Owasp top 10: The ten most critical web application security risks. Tech. rep., OWASP Foundation (2017). https://owasp.org/www-project-top-ten/
Qihoo 360: 360 total security: Protection antivirus gratuitet. [online], available: (2014). https://www.360totalsecurity.com
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014). https://doi.org/10.5555/2627435.2670313
Article MATH Google Scholar
Starov, O., Dahse, J., Ahmad, S.S., Holz, T., Nikiforakis, N.: No honor among thieves: A large-scale analysis of malicious web shells. In: Proceedings of the 25th International Conference on World Wide Web, pp. 1021–1032. WWW ’16, ACM (2016). https://doi.org/10.1145/2872427.2882992
Sun, X., Lu, X., Dai, H.: A matrix decomposition based webshell detection method. In: Proceedings of the 2017 International Conference on Cryptography, Security and Privacy, pp. 66–70. ICCSP ’17, ACM (2017). https://doi.org/10.1145/3058060.3058083
Tu T.D., Guang C., Xiaojun G., Wubin P.: Webshell detection techniques in web applications. In: Proceedings of the fifth International Conference on Computing, Communications and Networking Technologies, pp. 1–7. ICCCNT’14, IEEE CS (2014). https://doi.org/10.1109/ICCCNT.2014.6963152
VirusTotal: Free online virus, malware and url scanner, [online], available: (2016). https://www.virustotal.com/
W3Techs: Usage statistics of server-side programming languages for websites, [online], available: (2021). https://w3techs.com/technologies/overview/programming_language
Wainer, J., Cawley, G.: Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst. Appl. 182, 115222 (2021). https://doi.org/10.1016/j.eswa.2021.115222
Article Google Scholar
Wang, C., Yang, H., Zhao, Z., Gong, L., Li, Z.: The Research and Improvement in the Detection of PHP Variable WebShell based on Information Entropy. J. Comput. 28, 62–68 (2017). https://doi.org/10.3966/199115992017102805006
Article Google Scholar
Wrench, P., Irwin, B.: Detecting derivative malware samples using deobfuscation-assisted similarity analysis. SAIEE Africa Res. J. 107(2), 65–77 (2016). https://doi.org/10.23919/SAIEE.2016.8531543
Article Google Scholar
Wu, Y., Sun, Y., Huang, C., Jia, P., Liu, L., Schrittwieser, S.: Session-based webshell detection using machine learning in web logs. Secur. Commun. Netw. 2019, 1–11 (2019). https://doi.org/10.1155/2019/3093809
Article Google Scholar
Yadav, T., Rao, A.M.: Technical Aspects of Cyber Kill Chain. In: Proceedings of the International Symposium on Security in Computing and Communication, pp. 438–452. SSCC 2015. Springer (2015). https://doi.org/10.1007/978-3-319-22915-7_40
Zhu, T., Weng, Z., Fu, L., Ruan, L.: A web shell detection method based on multiview feature fusion. Appl. Sci. 10(18), 6274 (2020). https://doi.org/10.3390/app10186274
Article Google Scholar

Download references

Author information

Authors and Affiliations

PI:MIS Laboratory, University of 8 May 1945 Guelma, BP 401, 24000, Guelma, Algeria
Abdelhakim Hannousse
Department of Mathematics and Science Computing, Larbi Tebessi University, BP 289, 12000, Tebessa, Algeria
Mohamed Cherif Nait-Hamoud
LRS Laboratory, Badji Mokhtar University, BP 12, 23000, Annaba, Algeria
Salima Yahiouche

Authors

Abdelhakim Hannousse
View author publications
You can also search for this author inPubMed Google Scholar
Mohamed Cherif Nait-Hamoud
View author publications
You can also search for this author inPubMed Google Scholar
Salima Yahiouche
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Abdelhakim Hannousse.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any study with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

1.1 List of sources used for dataset collection

Sources used for collecting Webshells

1.:: https://github.com/tennc/webshell
2.:: https://github.com/JohnTroony/php-webshells
3.:: https://github.com/xl7dev/webshell
4.:: https://github.com/tutorial0/webshell
5.:: https://github.com/bartblaze/PHP-backdoors
6.:: https://github.com/BlackArch/webshells
7.:: https://github.com/nikicat/web-malware-collection
8.:: https://github.com/fuzzdb-project/fuzzdb
9.:: https://github.com/lcatro/PHP-webshell-Bypass-WAF
10.:: https://github.com/linuxsec/indoxploit-shell
11.:: https://github.com/b374k/b374k
12.:: https://github.com/LuciferoO/webshell-collector
13.:: https://github.com/malwares/webshell
14.:: https://github.com/tanjiti/webshell-Sample
15.:: https://github.com/JoyChou93/webshell
16.:: https://github.com/webshellpub/awsome-webshell
17.:: https://github.com/xypiie/webshell
18.:: https://github.com/leett1/Programe/
19.:: https://github.com/lhlsec/webshell
20.:: https://github.com/ysrc/webshell-sample
21.:: https://github.com/feihong-cs/JspMaster-Deprecated
22.:: https://github.com/threedr3am/JSP-Webshells

Sources used for collecting normal files

1.:: https://github.com/WordPress/WordPress
2.:: https://github.com/yiisoft/yii2
3.:: https://github.com/johnshen/PHPcms
4.:: https://github.com/joomla/joomla-cms
5.:: https://github.com/laravel/laravel
6.:: https://github.com/learnstartup/4tweb
7.:: https://github.com/phpmyadmin/phpmyadmin
8.:: https://github.com/rainrocka/xinhu
9.:: https://github.com/octobercms/october
10.:: https://github.com/alkacon/opencms-core
11.:: https://github.com/craftcms/cms
12.:: https://github.com/croogo/croogo
13.:: https://github.com/doorgets/CMS
14.:: https://github.com/smarty-php/smarty
15.:: https://github.com/source-trace/phpcms
16.:: https://github.com/symfony/symfony
17.:: https://github.com/typecho/typecho
18.:: https://github.com/leett1/Programe/
19.:: https://github.com/rpeterclark/aspunit
20.:: https://github.com/dluxem/LiberumASP
21.:: https://github.com/aspLite/aspLite
22.:: https://github.com/coldstone/easyasp
23.:: https://github.com/amasad/sane
24.:: https://github.com/sextondb/ClassicASPUnit
25.:: https://github.com/ASP-Ajaxed/asp-ajaxed
26.:: https://www.codewithc.com
27.:: https://www.kashipara.com

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hannousse, A., Nait-Hamoud, M.C. & Yahiouche, S. A deep learner model for multi-language webshell detection. Int. J. Inf. Secur. 22, 47–61 (2023). https://doi.org/10.1007/s10207-022-00615-5

Download citation

Published: 18 October 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10207-022-00615-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learner model for multi-language webshell detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning

Deep Learning Based Webshell Detection Coping with Long Text and Lexical Ambiguity

AST2Vec: A Robust Neural Code Representation for Malicious PowerShell Detection

Data Availability Statement

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendices

1.1 List of sources used for dataset collection

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now