Malware Detection in Word Documents Using Machine Learning

Khan, Riya; Kumar, Nitesh; Handa, Anand; Shukla, Sandeep K.

doi:10.1007/978-981-33-6835-4_22

Malware Detection in Word Documents Using Machine Learning

Riya Khan⁸,
Nitesh Kumar⁹,
Anand Handa⁹ &
…
Sandeep K. Shukla⁹

Conference paper
First Online: 05 February 2021

1767 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1347))

Abstract

Word documents are one of the most widely used types of documents and are used every day by millions of people to share information over the internet, mostly as attachments to mail. According to the Internet Security Threat Report, 2019 by Symantec, 48% of malicious email attachments were MS Office files in 2018. Therefore there is an urgent need for fast and accurate detection of Word document malware. In this work, we propose a method to detect malicious office files with high accuracy. We first apply the static analysis method and achieve the most top detection accuracy of 97.13% using a Random Forest classifier. Then we apply a dynamic analysis method as it helps to get vital information to detect obfuscated and packed malware where the static approach is not as efficient. We achieve the highest detection accuracy of 99.11% with the Random Forest classifier for a dynamic approach. Finally, we combine both the approaches static and dynamic and use a hybrid method to detect Word document malware. Our hybrid method achieves the highest detection accuracy of 99.57% using Random Forest classifier.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aboud, E., O’Brien, D.: Detection of malicious VBA macros using machine learning methods. In: AICS (2018)
Google Scholar
Admin. Microsoft office - DDE attacks, 16 January 2018. https://pentestlab.blog/2018/01/16/microsoft-office-dde-attacks/. Accessed 15 Mar 2020
Bazzi, A., Onozato, Y.: IDS for detecting malicious non-executable files using dynamic analysis. In: 2013 15th Asia-Pacific Network Operations and Management Symposium (APNOMS), pp. 1–3, September 2013
Google Scholar
Bontchev, V.: https://github.com/bontchev/pcodedmp. Accessed 30 Mar 2020
Brownlee, J.: A gentle introduction to XGBoost for applied machine learning 2018. Accessed 10 Apr 2020
Google Scholar
Catalin Cimpanu: Microsoft office attack runs malware without needing macros, 12 October 2017. https://www.bleepingcomputer.com/news/security/microsoft-office-attack-runs-malware-without-needing-macros/. Accessed 15 Mar 2020
Bremer, J., Guarnieri, C., Tanasi, A.: Cuckoo 2019. https://cuckoo.sh/docs/introduction/what.html. Accessed 11 Mar 2020
Cohen, Aviad., Nissim, Nir., Rokach, Lior, Elovici, Yuval: SFEM: structural feature extraction methodology for the detection of malicious office documents using machine learning methods. Expert Syst. Appl. 63, 324–343 (2016)
Article Google Scholar
Decalage. https://github.com/decalage2/oletools/wiki/olevba. Accessed 30 Mar 2020
Digitalcorpora. https://digitalcorpora.org/. Accessed 11 Mar 2020
Dogru, N., Subasi, A.: Traffic accident detection using random forest classifier. In: Learning and Technology Conference (L&T), pp. 40–45. IEEE (2018)
Google Scholar
ECMA. Standard ECMA-376 office open xml file formats, December 2016. http://www.ecma-international.org/publications/standards/Ecma-376.htm. Accessed 15 Mar 2020
Saif El-Sherei Etienne Stalmans. Macro-less code exec in msword (2017)
Google Scholar
Gunnarsdottir, K.M., Gamaldo, C.E., Salas, R.M.E., Ewen, J.B., Allen, R.P., Sarma, S.V.: A novel sleep stage scoring system: combining expert-based rules with a decision tree classifier. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3240–3243. IEEE (2018)
Google Scholar
Lagadec, P.: OpenDocument and open xml security. https://www.decalage.info/files/JCV07_Lagadec_OpenDocument_OpenXML_v4_decalage.pdf. Accessed 15 Mar 2020
Lagadec, P.: oletools - python tools to analyze OLE and MS office files 2018. https://www.decalage.info/python/oletools. Accessed 5 Apr 2020
Lin, J., Pao, H.: Multi-view malicious document detection. In: 2013 Conference on Technologies and Applications of Artificial Intelligence, pp. 170–175, December 2013
Google Scholar
loc.gov. https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml. Accessed 15 Mar 2020
Lu, X., Wang, F., Shu, Z.: Malicious word document detection based on multi-view features learning, pp. 1–6, July 2019
Google Scholar
microsoft. https://docs.microsoft.com/en-us/office/vba/library-reference/concepts/getting-started-with-vba-in-office. Accessed 15 Mar 2020
Microsoft. Binary file format. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22
Mimura, M., Ohminami, T.: Towards Efficient Detection of Malicious VBA Macros with LSI, pp. 168–185, July 2019
Google Scholar
Moe, O.: Registry run keys/startup folder, 31 May 2017. https://attack.mitre.org/techniques/T1060/. Accessed 15 Mar 2020
Myers, J.: Threat analysis malicious microsoft word documents being used in targeted attack campaigns, 19 December 2017. Accessed 15 Mar 2020
Google Scholar
Nissim, N., Cohen, A., Elovici, Y.: ALDOCX: detection of unknown malicious microsoft office documents using designated active learning methods based on new structural feature extraction methodology. IEEE Trans. Inf. Forensics Secur. 12, 1 (2016)
Google Scholar
An NCC Group Publication: Understanding microsoft word OLE exploit primitives: Exploiting CVE-2015-1642 microsoft office C tasksymbol use-after-free vulnerability 2015. Accessed 10 Apr 2020
Google Scholar
Raman, K., et al.: Selecting features to classify malware. InfoSec Southwest (2012)
Google Scholar
VirtualBox. https://www.virtualbox.org/. Accessed 11 Mar 2020
ViruShare. https://virusshare.com/. Accessed 11 Mar 2020
Zhang, Y., Huang, Q., Ma, X., Yang, Z., Jiang, J.: Using multi-features and ensemble learning method for imbalanced malware classification. In: Trustcom/BigDataSE/I SPA, pp. 965–973. IEEE (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Information Technology and Management Kerala, Kazhakkoottam, Kerala, India
Riya Khan
C3i Center, Department of CSE, Indian Institute of Technology, Kanpur, India
Nitesh Kumar, Anand Handa & Sandeep K. Shukla

Authors

Riya Khan
View author publications
You can also search for this author in PubMed Google Scholar
Nitesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Anand Handa
View author publications
You can also search for this author in PubMed Google Scholar
Sandeep K. Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Handa .

Editor information

Editors and Affiliations

National Advanced IPv6 Centre, Universiti Sains Malaysia, Penang, Malaysia
Mohammed Anbar
Hodeidah University, Hodeidah, Yemen
Nibras Abdullah
National Advanced IPv6 Centre, Universiti Sains Malaysia, Penang, Malaysia
Selvakumar Manickam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khan, R., Kumar, N., Handa, A., Shukla, S.K. (2021). Malware Detection in Word Documents Using Machine Learning. In: Anbar, M., Abdullah, N., Manickam, S. (eds) Advances in Cyber Security. ACeS 2020. Communications in Computer and Information Science, vol 1347. Springer, Singapore. https://doi.org/10.1007/978-981-33-6835-4_22

Download citation

DOI: https://doi.org/10.1007/978-981-33-6835-4_22
Published: 05 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6834-7
Online ISBN: 978-981-33-6835-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics