Skip to main content

Malware Detection in Word Documents Using Machine Learning

  • Conference paper
  • First Online:
  • 1767 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1347))

Abstract

Word documents are one of the most widely used types of documents and are used every day by millions of people to share information over the internet, mostly as attachments to mail. According to the Internet Security Threat Report, 2019 by Symantec, 48% of malicious email attachments were MS Office files in 2018. Therefore there is an urgent need for fast and accurate detection of Word document malware. In this work, we propose a method to detect malicious office files with high accuracy. We first apply the static analysis method and achieve the most top detection accuracy of 97.13% using a Random Forest classifier. Then we apply a dynamic analysis method as it helps to get vital information to detect obfuscated and packed malware where the static approach is not as efficient. We achieve the highest detection accuracy of 99.11% with the Random Forest classifier for a dynamic approach. Finally, we combine both the approaches static and dynamic and use a hybrid method to detect Word document malware. Our hybrid method achieves the highest detection accuracy of 99.57% using Random Forest classifier.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Aboud, E., O’Brien, D.: Detection of malicious VBA macros using machine learning methods. In: AICS (2018)

    Google Scholar 

  2. Admin. Microsoft office - DDE attacks, 16 January 2018. https://pentestlab.blog/2018/01/16/microsoft-office-dde-attacks/. Accessed 15 Mar 2020

  3. Bazzi, A., Onozato, Y.: IDS for detecting malicious non-executable files using dynamic analysis. In: 2013 15th Asia-Pacific Network Operations and Management Symposium (APNOMS), pp. 1–3, September 2013

    Google Scholar 

  4. Bontchev, V.: https://github.com/bontchev/pcodedmp. Accessed 30 Mar 2020

  5. Brownlee, J.: A gentle introduction to XGBoost for applied machine learning 2018. Accessed 10 Apr 2020

    Google Scholar 

  6. Catalin Cimpanu: Microsoft office attack runs malware without needing macros, 12 October 2017. https://www.bleepingcomputer.com/news/security/microsoft-office-attack-runs-malware-without-needing-macros/. Accessed 15 Mar 2020

  7. Bremer, J., Guarnieri, C., Tanasi, A.: Cuckoo 2019. https://cuckoo.sh/docs/introduction/what.html. Accessed 11 Mar 2020

  8. Cohen, Aviad., Nissim, Nir., Rokach, Lior, Elovici, Yuval: SFEM: structural feature extraction methodology for the detection of malicious office documents using machine learning methods. Expert Syst. Appl. 63, 324–343 (2016)

    Article  Google Scholar 

  9. Decalage. https://github.com/decalage2/oletools/wiki/olevba. Accessed 30 Mar 2020

  10. Digitalcorpora. https://digitalcorpora.org/. Accessed 11 Mar 2020

  11. Dogru, N., Subasi, A.: Traffic accident detection using random forest classifier. In: Learning and Technology Conference (L&T), pp. 40–45. IEEE (2018)

    Google Scholar 

  12. ECMA. Standard ECMA-376 office open xml file formats, December 2016. http://www.ecma-international.org/publications/standards/Ecma-376.htm. Accessed 15 Mar 2020

  13. Saif El-Sherei Etienne Stalmans. Macro-less code exec in msword (2017)

    Google Scholar 

  14. Gunnarsdottir, K.M., Gamaldo, C.E., Salas, R.M.E., Ewen, J.B., Allen, R.P., Sarma, S.V.: A novel sleep stage scoring system: combining expert-based rules with a decision tree classifier. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3240–3243. IEEE (2018)

    Google Scholar 

  15. Lagadec, P.: OpenDocument and open xml security. https://www.decalage.info/files/JCV07_Lagadec_OpenDocument_OpenXML_v4_decalage.pdf. Accessed 15 Mar 2020

  16. Lagadec, P.: oletools - python tools to analyze OLE and MS office files 2018. https://www.decalage.info/python/oletools. Accessed 5 Apr 2020

  17. Lin, J., Pao, H.: Multi-view malicious document detection. In: 2013 Conference on Technologies and Applications of Artificial Intelligence, pp. 170–175, December 2013

    Google Scholar 

  18. loc.gov. https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml. Accessed 15 Mar 2020

  19. Lu, X., Wang, F., Shu, Z.: Malicious word document detection based on multi-view features learning, pp. 1–6, July 2019

    Google Scholar 

  20. microsoft. https://docs.microsoft.com/en-us/office/vba/library-reference/concepts/getting-started-with-vba-in-office. Accessed 15 Mar 2020

  21. Microsoft. Binary file format. https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22

  22. Mimura, M., Ohminami, T.: Towards Efficient Detection of Malicious VBA Macros with LSI, pp. 168–185, July 2019

    Google Scholar 

  23. Moe, O.: Registry run keys/startup folder, 31 May 2017. https://attack.mitre.org/techniques/T1060/. Accessed 15 Mar 2020

  24. Myers, J.: Threat analysis malicious microsoft word documents being used in targeted attack campaigns, 19 December 2017. Accessed 15 Mar 2020

    Google Scholar 

  25. Nissim, N., Cohen, A., Elovici, Y.: ALDOCX: detection of unknown malicious microsoft office documents using designated active learning methods based on new structural feature extraction methodology. IEEE Trans. Inf. Forensics Secur. 12, 1 (2016)

    Google Scholar 

  26. An NCC Group Publication: Understanding microsoft word OLE exploit primitives: Exploiting CVE-2015-1642 microsoft office C tasksymbol use-after-free vulnerability 2015. Accessed 10 Apr 2020

    Google Scholar 

  27. Raman, K., et al.: Selecting features to classify malware. InfoSec Southwest (2012)

    Google Scholar 

  28. VirtualBox. https://www.virtualbox.org/. Accessed 11 Mar 2020

  29. ViruShare. https://virusshare.com/. Accessed 11 Mar 2020

  30. Zhang, Y., Huang, Q., Ma, X., Yang, Z., Jiang, J.: Using multi-features and ensemble learning method for imbalanced malware classification. In: Trustcom/BigDataSE/I SPA, pp. 965–973. IEEE (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Handa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khan, R., Kumar, N., Handa, A., Shukla, S.K. (2021). Malware Detection in Word Documents Using Machine Learning. In: Anbar, M., Abdullah, N., Manickam, S. (eds) Advances in Cyber Security. ACeS 2020. Communications in Computer and Information Science, vol 1347. Springer, Singapore. https://doi.org/10.1007/978-981-33-6835-4_22

Download citation

  • DOI: https://doi.org/10.1007/978-981-33-6835-4_22

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-33-6834-7

  • Online ISBN: 978-981-33-6835-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics