Skip to main content

SIFAST: An Efficient Unix Shell Embedding Framework for Malicious Detection

  • Conference paper
  • First Online:
Information Security (ISC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14411))

Included in the following conference series:

  • 356 Accesses

Abstract

Unix Shell is a powerful tool for system developers and engineers, but it poses serious security risks when used by cybercriminals to execute malicious scripts. These scripts can compromise servers, steal confidential data, or cause system crashes. Therefore, detecting and preventing malicious scripts is an important task for intrusion detection systems. In this paper, we propose a novel framework, called SIFAST, for embedding and detecting malicious Unix Shell scripts. Our framework consists of Smooth Inverse Frequency (SIF) and Abstract Syntax Tree (AST) techniques to rapidly convert Unix Shell commands and scripts into vectors and capture their semantic and syntactic features. These vectors can then be beneficial for various downstream machine learning models for classification or anomaly detection. Compared with other embedding methods with multiple downstream detection models, We have demonstrated that SIFAST can significantly improve the accuracy and efficiency on different downstream models. We also provide a supervised dataset of normal and abnormal Unix commands and scripts, which was collected from various open-source data. Hopefully, we can make a humble contribution to the field of intrusion detection systems by offering a solution to identifying malicious scripts in Unix Shell.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Different linux Commands and Utilities Commonly Used by Attackers. https://www.uptycs.com/blog/linux-commands-and-utilities-commonly-used-by-attackers

  2. Evasive techniques used by malicious shell scripts on different unix systems. https://www.uptycs.com/blog/evasive-techniques-used-by-malicious-linux-shell-scripts

  3. LOLBAS. https://lolbas-project.github.io/

  4. Tree-sitter Using Parsers. https://tree-sitter.github.io/tree-sitter/using-parsers

  5. What Is a Reverse Shell \(|\) Examples & Prevention Techniques \(|\) Imperva

    Google Scholar 

  6. GTFOBins (2022). https://gtfobins.github.io/

  7. Living Off the Land: How to Defend Against Malicious Use of Legitimate Utilities (2022). https://threatpost.com/living-off-the-land-malicious-use-legitimate-utilities/177762/

  8. Al-Janabi, M., Altamimi, A.M.: A comparative analysis of machine learning techniques for classification and detection of Malware. In: 2020 21st International Arab Conference on Information Technology (ACIT), pp. 1–9 (2020). https://doi.org/10.1109/ACIT50332.2020.9300081

  9. Alahmadi, A., Alkhraan, N., BinSaeedan, W.: MPSAutodetect: a malicious powershell script detection model based on stacked denoising auto-encoder. Comput. Secur. 116, 102658 (2022). https://doi.org/10.1016/j.cose.2022.102658

    Article  Google Scholar 

  10. Andrew, Y., Lim, C., Budiarto, E.: Mapping Linux shell commands to MITRE ATT &CK using NLP-based approach. In: 2022 International Conference on Electrical Engineering and Informatics (ICELTICs), pp. 37–42 (2022). https://doi.org/10.1109/ICELTICs56128.2022.9932097

  11. Boffa, M., Milan, G., Vassio, L., Drago, I., Mellia, M., Ben Houidi, Z.: Towards NLP-based processing of honeypot logs. In: 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 314–321 (2022). https://doi.org/10.1109/EuroSPW55150.2022.00038

  12. Bohannon, D., Holmes, L.: Revoke-Obfuscation: PowerShell Obfuscation Detection Using Science (2017)

    Google Scholar 

  13. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information (2017)

    Google Scholar 

  14. Chai, H., Ying, L., Duan, H., Zha, D.: Invoke-Deobfuscation: AST-based and semantics-preserving deobfuscation for powershell scripts. In: 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 295–306 (2022). https://doi.org/10.1109/DSN53405.2022.00039

  15. Elmasry, W., Akbulut, A., Zaim, A.H.: Deep learning approaches for predictive masquerade detection. Secur. Commun. Netw. 2018, e9327215 (2018). https://doi.org/10.1155/2018/9327215

    Article  Google Scholar 

  16. Fang, Y., Huang, C., Zeng, M., Zhao, Z., Huang, C.: JStrong: malicious JavaScript detection based on code semantic representation and graph neural network. Comput. Secur. 118, 102715 (2022). https://doi.org/10.1016/j.cose.2022.102715

    Article  Google Scholar 

  17. Fang, Y., Zhou, X., Huang, C.: Effective method for detecting malicious PowerShell scripts based on hybrid features. Neurocomputing 448, 30–39 (2021). https://doi.org/10.1016/j.neucom.2021.03.117

    Article  Google Scholar 

  18. Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages (2020). https://doi.org/10.48550/arXiv.2002.08155

  19. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.552

  20. Goudie, M.: The Rise of “Living off the Land” Attacks \(|\) CrowdStrike (2019). https://www.crowdstrike.com/blog/going-beyond-malware-the-rise-of-living-off-the-land-attacks/

  21. Hendler, D., Kels, S., Rubin, A.: Detecting malicious powershell commands using deep neural networks. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 187–197. ASIACCS ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3196494.3196511

  22. Hendler, D., Kels, S., Rubin, A.: AMSI-based detection of malicious powershell code using contextual embeddings. In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 679–693. ASIA CCS ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3320269.3384742

  23. Hussain, Z., Nurminen, J., Mikkonen, T., Kowiel, M.: Command Similarity Measurement Using NLP (2021). https://doi.org/10.4230/OASIcs.SLATE.2021.13

  24. Kidwai, A., et al.: A comparative study on shells in Linux: a review. Mater. Today Proc. 37, 2612–2616 (2021). https://doi.org/10.1016/j.matpr.2020.08.508

    Article  Google Scholar 

  25. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, vol. 32, pp. II-1188-II-1196. ICML’14, JMLR.org, Beijing, China (2014)

    Google Scholar 

  26. Lin, X.V., Wang, C., Zettlemoyer, L., Ernst, M.D.: NL2Bash: a corpus and semantic parser for natural language interface to the Linux operating system (2018). arXiv:1802.08979 [cs]

  27. Liu, C., et al.: Code execution with pre-trained language models (2023). https://doi.org/10.48550/arXiv.2305.05383

  28. Liu, W., Mao, Y., Ci, L., Zhang, F.: A new approach of user-level intrusion detection with command sequence-to-sequence model. J. Intell. Fuzzy Syst. 38(5), 5707–5716 (2020). https://doi.org/10.3233/JIFS-179659

    Article  Google Scholar 

  29. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv:1301.3781 [cs]

  30. Mimura, M., Tajiri, Y.: Static detection of malicious PowerShell based on word embeddings. Internet Things 15, 100404 (2021). https://doi.org/10.1016/j.iot.2021.100404

    Article  Google Scholar 

  31. Ongun, T., et al.: Living-off-the-land command detection using active learning. In: Proceedings of the 24th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 442–455. RAID ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3471621.3471858

  32. Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49

    Chapter  Google Scholar 

  33. Rathore, H., Agarwal, S., Sahay, S.K., Sewak, M.: Malware detection using machine learning and deep learning. In: Mondal, A., Gupta, H., Srivastava, J., Reddy, P.K., Somayajulu, D.V.L.N. (eds.) BDA 2018. LNCS, vol. 11297, pp. 402–411. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04780-1_28

    Chapter  Google Scholar 

  34. Rebootuser: LinEnum (2023)

    Google Scholar 

  35. Rousseau, A.: Hijacking.NET to Defend PowerShell (2017). https://doi.org/10.48550/arXiv.1709.07508

  36. Song, J., Kim, J., Choi, S., Kim, J., Kim, I.: Evaluations of AI-based malicious PowerShell detection with feature optimizations. ETRI J. 43(3), 549–560 (2021). https://doi.org/10.4218/etrij.2020-0215

    Article  Google Scholar 

  37. Swissky: Payloads All The Things (2023)

    Google Scholar 

  38. Trizna, D.: Shell language processing: Unix command parsing for machine learning (2021). arXiv:2107.02438 [cs]

  39. Tsai, M.H., Lin, C.C., He, Z.G., Yang, W.C., Lei, C.L.: PowerDP: de-obfuscating and profiling malicious PowerShell commands with multi-label classifiers. IEEE Access 11, 256–270 (2023). https://doi.org/10.1109/ACCESS.2022.3232505

    Article  Google Scholar 

  40. Zhai, H., et al.: Masquerade detection based on temporal convolutional network. In: 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 305–310 (2022). https://doi.org/10.1109/CSCWD54268.2022.9776088

Download references

Acknowledgements

This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences with No.XDC02030400, the Scaling Program of Institute of Information Engineering, CAS (Grant No. E3Z0041101), the Scaling Program of Institute of Information Engineering, CAS (Grant No. E3Z0191101).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rong Yang .

Editor information

Editors and Affiliations

Appendix

Appendix

  • TF-IDF This is the most basic sentence vector generation method in the field of natural language, and it is also the sentence vector embedding method mentioned in Trizna et al. It first generates a TF-IDF representation of each word in a sentence, and then adds each word to form a TF-IDF representation of a sentence.

  • Doc2Vec [25] This is a method that trains the sentence vector and other words in the sentence to directly generate the sentence vector. Its method is similar to Word2Vec, but on the basis of Word2Vec, sentence vectors are added for joint training.

  • MPSAutodetect [9] This is a deep learning framework for detecting PowerShell malicious scripts, which uses a character-based embedding method and inputs it into a denoising AutoEncoder to extract features, and finally inputs the features into a classifier for classification.

  • SimCSE [19] An Advanced Pretrained Sentence Vector Embedding Model Based on Contrastive Learning. We employ the unsupervised learning part of SimCSE to learn code representations for shell scripts. Although SimCSE requires powerful hardware capabilities, making it impossible to be embedded in Unix, we still use it as one of our comparison objects to illustrate the gap between our model and conventional deep learning models.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, S. et al. (2023). SIFAST: An Efficient Unix Shell Embedding Framework for Malicious Detection. In: Athanasopoulos, E., Mennink, B. (eds) Information Security. ISC 2023. Lecture Notes in Computer Science, vol 14411. Springer, Cham. https://doi.org/10.1007/978-3-031-49187-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49187-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49186-3

  • Online ISBN: 978-3-031-49187-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics