Applying Natural Language Processing for detecting malicious patterns in Android applications

https://doi.org/10.1016/j.fsidi.2021.301270Get rights and content

Highlights

  • Natural Language Processing techniques, when applied to an intermediate representation, can help find malicious patterns in applications.

  • Malware analysis intermediate language makes the code accessible to Natural Language Processing.

  • Check semantic similarities through control flow patterns of applications to find malicious code.

  • Tools for future researchers interested in discovering and guiding the forensic analysis of malicious codes.

  • Tools for future researchers interested in discovering and guiding the forensic analysis of malicious codes.

Abstract

With increasing quantity and sophistication, malicious code is becoming difficult to discover and analyze. Modern NLP (Natural Language Processing) techniques have significantly improved, and are being used in practice to accomplish various tasks. Recently, many research works have applied NLP for finding malicious patterns in Android and Windows apps. In this paper, we exploit this fact and apply NLP techniques to an intermediate representation (MAIL – Malware analysis intermediate language) of Android apps to build a similarity index model, named SIMP. We use SIMP to find malicious patterns in Android apps. MAIL provides control flow patterns to enhance the malware analysis and makes the code accessible to NLP techniques for checking semantic similarities. For applying NLP, we consider a MAIL program as one document. The control flow patterns in this program when divided, into specific blocks (words), become sentences. We apply TFIDF and Bag-of-Words over these control flow patterns to build SIMP. Our proposed model, when tested with real malware and benign Android apps using different validation methods, achieved an MCC (Mathews Correlation Coefficient) ≥ 0.94 between the true and predicted values. That indicates, predicting a new sample either as malware or benign with a high success rate.

Section snippets

Introduction and motivation

Due to the ubiquitous nature of mobile phones, recently we have seen a dramatic increase in mobile malware. Android being the most popular mobile phone operating system (OS), is host to most of these malicious apps. Mobile malware programs increased by 24 million from 2018 to 2019 (McAfee Mobile Threat Report, 2020), and in 2019 companies spent on average 2.4 million USD defending against malware (The ultimate list of cybe, 2019). We need to develop methods to defend and minimize these attacks

Related work

In this section, we briefly highlight recent research works that have applied NLP techniques for detecting malicious patterns in Android and Windows apps.

Overview of the system

The system proposed in this paper after converting an APK into a MAIL program generates the CFG of each function in the MAIL program. Each CFG contains either a single or multiple execution paths of the MAIL function. We extract these paths from each CFG, and call them MAIL CFG Paths, i.e., all the CFG paths in a MAIL program. This process of extracting each of these paths can be compared with extracting sentences from a natural language. We then, build a similarity index with these MAIL CFG

Empirical evaluation

We carried out an empirical study to evaluate and validate the performance of our proposed model. All the experiments were carried out on a desktop PC running Windows 8.1 equipped with an Intel Core(TM) i-7-4510U @ 2 GHz with 8 GB of RAM. In this section, we present the dataset, evaluation metrics, threshold computation, validation experiments, obtained results, and analysis (comparison with other works and limitations).

Conclusion

Modern NLP techniques have greatly improved and are used in practice for accomplishing various tasks, such as machine translation, summarization of larger texts, and question-answering, etc. In this paper, we have exploited this fact and applied NLP techniques to build a similarity index model of MAIL CFG Paths, that is used to find malicious patterns in Android apps. We have demonstrated through experiments that our model outperforms many other such models. Our proposed model, when tested with

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

References (37)

  • M. Christodorescu et al.

    Semantics-aware malware detection

  • W.G. Cochran

    The χ2 test of goodness of fit

    Ann. Math. Stat.

    (1952)
  • S. Deerwester et al.

    Improving information-retrieval with latent semantic indexing

  • dex2jar – Tools to work with Android .dex and java .class files, https://sourceforge.net/projects/dex2jar...
  • T. Dullien et al.

    REIL: a platform-independent intermediate representation of disassembled code for static code analysis

  • Z.S. Harris

    Distributional structure

    Word

    (1954)
  • R. Ito et al.

    Detecting unknown malware from ascii strings with natural language processing techniques

  • E.B. Karbab et al.

    Design: dynamic fingerprinting for the automatic detection of android malware

  • Cited by (5)

    • Interpol review of digital evidence for 2019–2022

      2023, Forensic Science International: Synergy
    • Detection of Harassment Toward Women in Twitter During Pandemic Based on Machine Learning

      2024, International Journal of Advanced Computer Science and Applications
    • A Case Study for Declarative Pattern Mining in Digital Forensics

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Analysing Android Apps Classification and Categories Validation by Using Latent Dirichlet Allocation

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View full text