Abstract
Webshell is a web script containing malicious code fragment, which hackers could use to launch web attacks. Hence, it is of great signifiance to identify whether a web script contains malicious code fragments in the aspect of web security. However, the flexibility of scripting language such as PHP provides attackers the opportunities to obfuscate scripts, making it challenging for traditional rule-based webshell detectors to detect malicious code fragments. Deep learning brings new ideas for webshell detection and improves the effect of detectors. However, the effect of deep learning-based detectors depends on feature engineering and deep learning models. The feature representations and models adopted by existing methods fail to mine the syntactic and semantic features of webshell scripts. To tackle those problems, we design a new code representation called script sequence according to the characteristics of webshell and also we introduce new pretrain task to enhance understanding of deep learning model to syntax information of webshell code. This leads to the design and implementation of Malicious Script Detector (MSDetector). In order to evaluate MSDetector, we present a new PHP webshell dataset. Experimental results prove that MSDetector can achieve higher F1 score and accuracy than other approaches on the dataset.
Keywords
Supported by the National Natural Science Foundation of China (No.: 61873069).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017). https://transacl.org/ojs/index.php/tacl/article/view/999
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Buratti, L., et al.: Exploring software naturalness through neural language models. CoRR abs/2006.12641 (2020). https://arxiv.org/abs/2006.12641
Chen, Y.: Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo (2015)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation, pp. 1724–1734 (2014). https://doi.org/10.3115/v1/d14-1179
Fang, Y., Qiu, Y., Liu, L., Huang, C.: Detecting webshell based on random forest with fasttext. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, ICCAI 2018, Chengdu, China, 12–14 March 2018, pp. 52–56. ACM (2018). https://doi.org/10.1145/3194452.3194470
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. In: EMNLP 2020, pp. 1536–1547 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139
Guo, Y., Marco-Gisbert, H., Keir, P.: Mitigating webshell attacks through machine learning techniques. Future Internet 12(1), 12 (2020). https://doi.org/10.3390/fi12010012
Harer, J.A., et al.: Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497 (2018)
Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: CodeSearchNet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)
Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016). https://doi.org/10.18653/v1/p16-1195
Li, T., Ren, C., Fu, Y., Xu, J., Guo, J., Chen, X.: Webshell detection based on the word attention mechanism. IEEE Access 7, 185140–185147 (2019). https://doi.org/10.1109/ACCESS.2019.2959950
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.S.: Gated graph sequence neural networks (2016). http://arxiv.org/abs/1511.05493
Lu, J., Tang, Z., Mao, J., Gu, Z., Zhang, J.: Mixed-models method based on machine learning in detecting webshell attack. In: CIPAE 2020: 2020 International Conference on Computers, Information Processing and Advanced Education, Ottawa, ON, Canada, 16–18 October 2020, pp. 251–259. ACM (2020). https://doi.org/10.1145/3419635.3419716
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Schuurmans, D., Wellman, M.P. (eds.) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 12–17 February 2016, Phoenix, Arizona, USA, pp. 1287–1293. AAAI Press (2016). http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11775
Nguyen, N., Le, V., Phung, V., Du, P.: Toward a deep learning approach for detecting PHP webshell. In: Proceedings of the Tenth International Symposium on Information and Communication Technology, Ha Noi, Ha Long Bay, Vietnam, 4–6 December 2019, pp. 514–521. ACM (2019). https://doi.org/10.1145/3368926.3369733
Pappas, N., Popescu-Belis, A.: Multilingual hierarchical attention networks for document classification, pp. 1015–1025 (2017). https://aclanthology.org/I17-1102/
Roy, D., Panda, P., Roy, K.: Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Netw. 121, 148–160 (2020). https://doi.org/10.1016/j.neunet.2019.09.010
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big-code. In: Dillon, L.K., Visser, W., Williams, L.A. (eds.) Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, 14–22 May 2016, pp. 1157–1168. ACM (2016). https://doi.org/10.1145/2884781.2884877
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks, pp. 1556–1566 (2015). https://doi.org/10.3115/v1/p15-1150
Tao, F., Cao, C., Liu, Z.: Webshell detection model based on deep learning. In: Sun, X., Pan, Z., Bertino, E. (eds.) ICAIS 2019. LNCS, vol. 11635, pp. 408–420. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-24268-8_38
Tu, T.D., Guang, C., Xiaojun, G., Wubin, P.: Webshell detection techniques in web applications. In: Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–7. IEEE (2014)
Vaswani, A., et al.: Attention is all you need, pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144
Xiao-Bo, X.U., Nie, X.M.: A method of detecting webshell based on multi-layer perception. Commun. Technol. 51, 895–900 (2018)
Zhang, H., Xue, Z., Shi, Y.: Improved method of detecting webshell based on multi-layer perception. Commun. Technol. 52, 179–183 (2019)
Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X.: A novel neural source code representation based on abstract syntax tree. In: Atlee, J.M., Bultan, T., Whittle, J. (eds.) Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, 25–31 May 2019, pp. 783–794. IEEE/ACM (2019). https://doi.org/10.1109/ICSE.2019.00086
Zhou, Y., Liu, S., Siow, J.K., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks, pp. 10197–10207 (2019). https://proceedings.neurips.cc/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Cheng, B., Guo, Y., Ren, Y., Yang, G., Xu, G. (2022). MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning. In: Aït-Ameur, Y., Crăciun, F. (eds) Theoretical Aspects of Software Engineering. TASE 2022. Lecture Notes in Computer Science, vol 13299. Springer, Cham. https://doi.org/10.1007/978-3-031-10363-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-10363-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10362-9
Online ISBN: 978-3-031-10363-6
eBook Packages: Computer ScienceComputer Science (R0)