Abstract
Webshell attacks are becoming more and more prevalent every year. Webshells are malicious scripts injected into web servers in the aim to confiscate persistent and remote access through simple HTTP requests on web browsers. Through webshells, attackers can remotely access confidential data and execute system commands. Actually, threat actors use webshells as an initial foothold to compromise the network infrastructure and cause dramatic damages. The impacts of webshell attacks are enormous, ranging from basic malicious actions, such as exposing sensitive data and upload more dangerous malware, to cause denial of services and compromise external networks and hence put the whole infrastructure at risk. Webshell attacks are hazardous since they can persist for a long time without being noticed by inexperienced administrators and ordinary malware scanners. In the literature, several machine learning-based models were proposed for the detection of PHP webshells. In this paper, we propose and experiment the ability of a simple deep learner model for the detection of multi-language webshells. The aim is to highlight existing challenges on the detection of webshell attacks and outline the way forward. Through analyzing source file scripts, the proposed model is designed to be able to distinguish webshells from benign files. Due to the absence of benchmark datasets for webshell detection, we collected a reasonable in size dataset for the validation process. We compared the performance of the proposed model with recent state-of-the-art systems. We also experimented source-code and opcode-based PHP detection models and the impact of presence of near-duplicates in datasets. Experimental results showed that: (1) the proposed deep learner outperforms all the experimented systems for four tested languages: PHP, JSP, ASP and ASPX with more than 98.27% of accuracy, (2) source-code based detection models are more effective than opcode-based detection models for PHP webshells, (3) the presence of near-duplicates causes higher but biased performance of webshell detection models and (4) more attention should be paid for the detection of webshells with advanced coding tricks such as letter slicing and code splitting.










Similar content being viewed by others
Data Availability Statement
The datasets generated and analyzed during the current study are publically available in the Mendeley repository at https://dx.doi.org/10.17632/wt8m6bcwbr.2.
Notes
DAws Advanced Shell available at: https://github.com/dotcppfile/DAws.
Github link: https://github.com.
Note that VLD captures opcode arrays with additional information and parameters; opcodes are described in arrays as capital letters, and this enables their distinction from other information and parameters.
References
Ahsan, M.M., Mahmud, M.A.P., Saha, P.K., Gupta, K.D., Siddique, Z.: Effect of data scaling methods on machine learning algorithms and model performance. Technologies (2021). https://doi.org/10.3390/technologies9030052
Allamanis, M.: The adverse effects of code duplication in machine learning models of code. In: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143–153. Onward! 2019, ACM, New York, NY, USA (2019). https://doi.org/10.1145/3359591.3359735
Avast: Avast software: Free antivirus is your first step to online freedom. [online], available: (1995). https://www.avast.com/
Bengfort, B., Bilbro, R., Ojeda, T.: Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning, 1st edn. O’Reilly Media Inc. (2018)
Cui, H., Huang, D., Fang, Y., Liu, L., Huang, C.: Webshell detection based on random forest–gradient boosting decision tree algorithm. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC). pp. 153–160. IEEE CS (2018). https://doi.org/10.1109/DSC.2018.00030
Fang, Y., Qiu, Y., Liu, L., Huang, C.: Detecting webshell based on random forest with fasttext. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence. 52–56. ICCAI 2018, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3194452.3194470
Guo, Y., Marco-Gisbert, H., Keir, P.: Mitigating webshell attacks through machine learning techniques. Fut. Internet 12(1), 1–16 (2020)
Hannousse, A., Yahiouche, S.: Handling webshell attacks: a systematic mapping and survey. Comput. Secur. 108, 102366 (2021). https://doi.org/10.1016/j.cose.2021.102366
Hannousse, A., Yahiouche, S.: Multi-language webshell dataset. Mendeley Data, V1 (2021). https://doi.org/10.17632/wt8m6bcwbr.1
Hannousse, A., Yahiouche, S.: RF-DNN\(^{2}\): An ensemble learner for effective detection of PHP Webshells. In: Proceedings of the International Conference on Artificial Intelligence for Cyber Security Systems and Privacy. pp. 1–6. AI-CSP’21, IEEE CS (2021). https://doi.org/10.1109/AI-CSP52968.2021.9671226
Hannousse, A., Yahiouche, S., Nait-Hamoud, M.C: Twenty-two years since revealing cross-site scripting attacks: a systematic mapping and a comprehensive survey. CoRR, arXiv:2205.08425v2, 1–52 (2022)
Kang, W., Zhong, S., Chen, K., Lai, J., Xu, G.: Rf-adacost: Webshell detection method that combines statistical features and opcode. In: Proceedings of the 3rd International Conference on Frontiers in Cyber Security. pp. 667–682. FCS 2020, Springer Singapore, Singapore (2020). https://doi.org/10.1007/978-981-15-9739-8_49
Leal, L.: Webshell in fake plugin /blnmrpb/ directory, [online], available: (2020). https://blog.sucuri.net/2020/01/webshell-in-fake-plugin-blnmrpb-directory.html
Li W., Zhang Z., Wang L.: A dynamic and heterogeneous web application to defense webshell attacks by using diversified PHP code. In: Proceedings of the 4th International Conference on Communication and Information Processing. 107–111. ICCIP ’18. ACM (2018). https://doi.org/10.1145/3290420.3290438
Li, Y., Huang, J., Ikusan, A., Mitchell, M., Zhang, J., Dai, R.: Shellbreaker: automatically detecting php-based malicious web shells. Comput. Secur. 87, 1–11 (2019). https://doi.org/10.1016/j.cose.2019.101595
Lopes, C.V., Maj, P., Martins, P., Saini, V., Yang, D., Zitny, J., Sajnani, H., Vitek, J.: Déjàvu: a map of code duplicates on github. Proc. ACM Prog. Lang. (2017). https://doi.org/10.1145/3133908
Lv, Z.H., Yan, H.B., Mei, R.: Automatic and accurate detection of webshell based on convolutional neural network. In: Proceedings of the 15th International Annual Conference on Cyber Security, pp. 73–85. CNCERT 2018, Springer Singapore (2019). https://doi.org/10.1007/978-981-13-6621-5_6
Microsoft 365 Defender Research Team: Web shell attacks continue to rise, [online], available: (2021). https://www.microsoft.com/security/blog/2021/02/11/web-shell-attacks-continue-to-rise/
Mumtaz, H., Alshayeb, M., Mahmood, S., Niazi, M.: An empirical study to improve software security through the application of code refactoring. Inf. Softw. Technol. 96, 112–125 (2018). https://doi.org/10.1016/j.infsof.2017.11.010
Naderi-Afooshteh, A., Kwon, Y., Nguyen-Tuong, A., Bagheri-Marzijarani, M., Davidson, J.W.: Cubismo: Decloaking server-side malware via cubist program analysis. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 430–443. ACSAC ’19, ACM (2019). https://doi.org/10.1145/3359789.3359821
OWASP: Owasp top 10: The ten most critical web application security risks. Tech. rep., OWASP Foundation (2017). https://owasp.org/www-project-top-ten/
Qihoo 360: 360 total security: Protection antivirus gratuitet. [online], available: (2014). https://www.360totalsecurity.com
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014). https://doi.org/10.5555/2627435.2670313
Starov, O., Dahse, J., Ahmad, S.S., Holz, T., Nikiforakis, N.: No honor among thieves: A large-scale analysis of malicious web shells. In: Proceedings of the 25th International Conference on World Wide Web, pp. 1021–1032. WWW ’16, ACM (2016). https://doi.org/10.1145/2872427.2882992
Sun, X., Lu, X., Dai, H.: A matrix decomposition based webshell detection method. In: Proceedings of the 2017 International Conference on Cryptography, Security and Privacy, pp. 66–70. ICCSP ’17, ACM (2017). https://doi.org/10.1145/3058060.3058083
Tu T.D., Guang C., Xiaojun G., Wubin P.: Webshell detection techniques in web applications. In: Proceedings of the fifth International Conference on Computing, Communications and Networking Technologies, pp. 1–7. ICCCNT’14, IEEE CS (2014). https://doi.org/10.1109/ICCCNT.2014.6963152
VirusTotal: Free online virus, malware and url scanner, [online], available: (2016). https://www.virustotal.com/
W3Techs: Usage statistics of server-side programming languages for websites, [online], available: (2021). https://w3techs.com/technologies/overview/programming_language
Wainer, J., Cawley, G.: Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst. Appl. 182, 115222 (2021). https://doi.org/10.1016/j.eswa.2021.115222
Wang, C., Yang, H., Zhao, Z., Gong, L., Li, Z.: The Research and Improvement in the Detection of PHP Variable WebShell based on Information Entropy. J. Comput. 28, 62–68 (2017). https://doi.org/10.3966/199115992017102805006
Wrench, P., Irwin, B.: Detecting derivative malware samples using deobfuscation-assisted similarity analysis. SAIEE Africa Res. J. 107(2), 65–77 (2016). https://doi.org/10.23919/SAIEE.2016.8531543
Wu, Y., Sun, Y., Huang, C., Jia, P., Liu, L., Schrittwieser, S.: Session-based webshell detection using machine learning in web logs. Secur. Commun. Netw. 2019, 1–11 (2019). https://doi.org/10.1155/2019/3093809
Yadav, T., Rao, A.M.: Technical Aspects of Cyber Kill Chain. In: Proceedings of the International Symposium on Security in Computing and Communication, pp. 438–452. SSCC 2015. Springer (2015). https://doi.org/10.1007/978-3-319-22915-7_40
Zhu, T., Weng, Z., Fu, L., Ruan, L.: A web shell detection method based on multiview feature fusion. Appl. Sci. 10(18), 6274 (2020). https://doi.org/10.3390/app10186274
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any study with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
1.1 List of sources used for dataset collection
Sources used for collecting Webshells
- 1.:
- 2.:
- 3.:
- 4.:
- 5.:
- 6.:
- 7.:
- 8.:
- 9.:
- 10.:
- 11.:
- 12.:
- 13.:
- 14.:
- 15.:
- 16.:
- 17.:
- 18.:
- 19.:
- 20.:
- 21.:
- 22.:
Sources used for collecting normal files
- 1.:
- 2.:
- 3.:
- 4.:
- 5.:
- 6.:
- 7.:
- 8.:
- 9.:
- 10.:
- 11.:
- 12.:
- 13.:
- 14.:
- 15.:
- 16.:
- 17.:
- 18.:
- 19.:
- 20.:
- 21.:
- 22.:
- 23.:
- 24.:
- 25.:
- 26.:
- 27.:
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hannousse, A., Nait-Hamoud, M.C. & Yahiouche, S. A deep learner model for multi-language webshell detection. Int. J. Inf. Secur. 22, 47–61 (2023). https://doi.org/10.1007/s10207-022-00615-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10207-022-00615-5