Abstract
Word documents are one of the most widely used types of documents and are used every day by millions of people to share information over the internet, mostly as attachments to mail. According to the Internet Security Threat Report, 2019 by Symantec, 48% of malicious email attachments were MS Office files in 2018. Therefore there is an urgent need for fast and accurate detection of Word document malware. In this work, we propose a method to detect malicious office files with high accuracy. We first apply the static analysis method and achieve the most top detection accuracy of 97.13% using a Random Forest classifier. Then we apply a dynamic analysis method as it helps to get vital information to detect obfuscated and packed malware where the static approach is not as efficient. We achieve the highest detection accuracy of 99.11% with the Random Forest classifier for a dynamic approach. Finally, we combine both the approaches static and dynamic and use a hybrid method to detect Word document malware. Our hybrid method achieves the highest detection accuracy of 99.57% using Random Forest classifier.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aboud, E., O’Brien, D.: Detection of malicious VBA macros using machine learning methods. In: AICS (2018)
Admin. Microsoft office - DDE attacks, 16 January 2018. https://pentestlab.blog/2018/01/16/microsoft-office-dde-attacks/. Accessed 15 Mar 2020
Bazzi, A., Onozato, Y.: IDS for detecting malicious non-executable files using dynamic analysis. In: 2013 15th Asia-Pacific Network Operations and Management Symposium (APNOMS), pp. 1–3, September 2013
Bontchev, V.: https://github.com/bontchev/pcodedmp. Accessed 30 Mar 2020
Brownlee, J.: A gentle introduction to XGBoost for applied machine learning 2018. Accessed 10 Apr 2020
Catalin Cimpanu: Microsoft office attack runs malware without needing macros, 12 October 2017. https://www.bleepingcomputer.com/news/security/microsoft-office-attack-runs-malware-without-needing-macros/. Accessed 15 Mar 2020
Bremer, J., Guarnieri, C., Tanasi, A.: Cuckoo 2019. https://cuckoo.sh/docs/introduction/what.html. Accessed 11 Mar 2020
Cohen, Aviad., Nissim, Nir., Rokach, Lior, Elovici, Yuval: SFEM: structural feature extraction methodology for the detection of malicious office documents using machine learning methods. Expert Syst. Appl. 63, 324–343 (2016)
Decalage. https://github.com/decalage2/oletools/wiki/olevba. Accessed 30 Mar 2020
Digitalcorpora. https://digitalcorpora.org/. Accessed 11 Mar 2020
Dogru, N., Subasi, A.: Traffic accident detection using random forest classifier. In: Learning and Technology Conference (L&T), pp. 40–45. IEEE (2018)
ECMA. Standard ECMA-376 office open xml file formats, December 2016. http://www.ecma-international.org/publications/standards/Ecma-376.htm. Accessed 15 Mar 2020
Saif El-Sherei Etienne Stalmans. Macro-less code exec in msword (2017)
Gunnarsdottir, K.M., Gamaldo, C.E., Salas, R.M.E., Ewen, J.B., Allen, R.P., Sarma, S.V.: A novel sleep stage scoring system: combining expert-based rules with a decision tree classifier. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3240–3243. IEEE (2018)
Lagadec, P.: OpenDocument and open xml security. https://www.decalage.info/files/JCV07_Lagadec_OpenDocument_OpenXML_v4_decalage.pdf. Accessed 15 Mar 2020
Lagadec, P.: oletools - python tools to analyze OLE and MS office files 2018. https://www.decalage.info/python/oletools. Accessed 5 Apr 2020
Lin, J., Pao, H.: Multi-view malicious document detection. In: 2013 Conference on Technologies and Applications of Artificial Intelligence, pp. 170–175, December 2013
loc.gov. https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml. Accessed 15 Mar 2020
Lu, X., Wang, F., Shu, Z.: Malicious word document detection based on multi-view features learning, pp. 1–6, July 2019
microsoft. https://docs.microsoft.com/en-us/office/vba/library-reference/concepts/getting-started-with-vba-in-office. Accessed 15 Mar 2020
Microsoft. Binary file format. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22
Mimura, M., Ohminami, T.: Towards Efficient Detection of Malicious VBA Macros with LSI, pp. 168–185, July 2019
Moe, O.: Registry run keys/startup folder, 31 May 2017. https://attack.mitre.org/techniques/T1060/. Accessed 15 Mar 2020
Myers, J.: Threat analysis malicious microsoft word documents being used in targeted attack campaigns, 19 December 2017. Accessed 15 Mar 2020
Nissim, N., Cohen, A., Elovici, Y.: ALDOCX: detection of unknown malicious microsoft office documents using designated active learning methods based on new structural feature extraction methodology. IEEE Trans. Inf. Forensics Secur. 12, 1 (2016)
An NCC Group Publication: Understanding microsoft word OLE exploit primitives: Exploiting CVE-2015-1642 microsoft office C tasksymbol use-after-free vulnerability 2015. Accessed 10 Apr 2020
Raman, K., et al.: Selecting features to classify malware. InfoSec Southwest (2012)
VirtualBox. https://www.virtualbox.org/. Accessed 11 Mar 2020
ViruShare. https://virusshare.com/. Accessed 11 Mar 2020
Zhang, Y., Huang, Q., Ma, X., Yang, Z., Jiang, J.: Using multi-features and ensemble learning method for imbalanced malware classification. In: Trustcom/BigDataSE/I SPA, pp. 965–973. IEEE (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Khan, R., Kumar, N., Handa, A., Shukla, S.K. (2021). Malware Detection in Word Documents Using Machine Learning. In: Anbar, M., Abdullah, N., Manickam, S. (eds) Advances in Cyber Security. ACeS 2020. Communications in Computer and Information Science, vol 1347. Springer, Singapore. https://doi.org/10.1007/978-981-33-6835-4_22
Download citation
DOI: https://doi.org/10.1007/978-981-33-6835-4_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6834-7
Online ISBN: 978-981-33-6835-4
eBook Packages: Computer ScienceComputer Science (R0)