Applying Natural Language Processing for detecting malicious patterns in Android applications

doi:10.1016/j.fsidi.2021.301270

Forensic Science International: Digital Investigation

Volume 39, December 2021, 301270

https://doi.org/10.1016/j.fsidi.2021.301270 Get rights and content

Highlights

•
Natural Language Processing techniques, when applied to an intermediate representation, can help find malicious patterns in applications.
•
Malware analysis intermediate language makes the code accessible to Natural Language Processing.
•
Check semantic similarities through control flow patterns of applications to find malicious code.
•
Tools for future researchers interested in discovering and guiding the forensic analysis of malicious codes.
•
Tools for future researchers interested in discovering and guiding the forensic analysis of malicious codes.

Abstract

With increasing quantity and sophistication, malicious code is becoming difficult to discover and analyze. Modern NLP (Natural Language Processing) techniques have significantly improved, and are being used in practice to accomplish various tasks. Recently, many research works have applied NLP for finding malicious patterns in Android and Windows apps. In this paper, we exploit this fact and apply NLP techniques to an intermediate representation (MAIL – Malware analysis intermediate language) of Android apps to build a similarity index model, named SIMP. We use SIMP to find malicious patterns in Android apps. MAIL provides control flow patterns to enhance the malware analysis and makes the code accessible to NLP techniques for checking semantic similarities. For applying NLP, we consider a MAIL program as one document. The control flow patterns in this program when divided, into specific blocks (words), become sentences. We apply TFIDF and Bag-of-Words over these control flow patterns to build SIMP. Our proposed model, when tested with real malware and benign Android apps using different validation methods, achieved an MCC (Mathews Correlation Coefficient) ≥ 0.94 between the true and predicted values. That indicates, predicting a new sample either as malware or benign with a high success rate.

Section snippets

Introduction and motivation

Due to the ubiquitous nature of mobile phones, recently we have seen a dramatic increase in mobile malware. Android being the most popular mobile phone operating system (OS), is host to most of these malicious apps. Mobile malware programs increased by 24 million from 2018 to 2019 (McAfee Mobile Threat Report, 2020), and in 2019 companies spent on average 2.4 million USD defending against malware (The ultimate list of cybe, 2019). We need to develop methods to defend and minimize these attacks

Related work

In this section, we briefly highlight recent research works that have applied NLP techniques for detecting malicious patterns in Android and Windows apps.

Overview of the system

The system proposed in this paper after converting an APK into a MAIL program generates the CFG of each function in the MAIL program. Each CFG contains either a single or multiple execution paths of the MAIL function. We extract these paths from each CFG, and call them MAIL CFG Paths, i.e., all the CFG paths in a MAIL program. This process of extracting each of these paths can be compared with extracting sentences from a natural language. We then, build a similarity index with these MAIL CFG

Empirical evaluation

We carried out an empirical study to evaluate and validate the performance of our proposed model. All the experiments were carried out on a desktop PC running Windows 8.1 equipped with an Intel Core(TM) i-7-4510U @ 2 GHz with 8 GB of RAM. In this section, we present the dataset, evaluation metrics, threshold computation, validation experiments, obtained results, and analysis (comparison with other works and limitations).

Conclusion

Modern NLP techniques have greatly improved and are used in practice for accomplishing various tasks, such as machine translation, summarization of larger texts, and question-answering, etc. In this paper, we have exploited this fact and applied NLP techniques to build a similarity index model of MAIL CFG Paths, that is used to find malicious patterns in Android apps. We have demonstrated through experiments that our model outperforms many other such models. Our proposed model, when tested with

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

References (37)

S. Alam et al.
DroidNative: automating and optimizing detection of android native code malware variants
Comput. Secur.
(2017)
T. Fawcett
An introduction to ROC analysis
Pattern Recogn. Lett.
(2006)
B.W. Matthews
Comparison of the predicted and observed secondary structure of t4 phage lysozyme
Biochim. Biophys. Acta Protein Struct.
(1975)
G. Nguyen et al.
A heuristics approach to mine behavioural data logs in mobile malware detection system
Data Knowl. Eng.
(2018)
A.V. Aho et al.
Compilers: Principles, Techniques, and Tools
(2006)
S. Alam et al.
MAIL: malware analysis intermediate language - a step towards automating and optimizing malware detection
S. Anju et al.
Malware detection using assembly code and control flow graph optimization
W.B. Cavnar et al.
N-gram-based text categorization
S. Cesare et al.
WIRE – a formal intermediate language for binary analysis
D. Chicco et al.
The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation
BMC Genom.
(2020)

M. Christodorescu et al.

Semantics-aware malware detection

W.G. Cochran

The χ2 test of goodness of fit

Ann. Math. Stat.

(1952)

S. Deerwester et al.

Improving information-retrieval with latent semantic indexing

dex2jar – Tools to work with Android .dex and java .class files, https://sourceforge.net/projects/dex2jar...

T. Dullien et al.

REIL: a platform-independent intermediate representation of disassembled code for static code analysis

Z.S. Harris

Distributional structure

Word

(1954)

R. Ito et al.

Detecting unknown malware from ascii strings with natural language processing techniques

E.B. Karbab et al.

Design: dynamic fingerprinting for the automatic detection of android malware

Cited by (5)

Detection approaches for android malware: Taxonomy and review analysis
2024, Expert Systems with Applications
The main objective of this review is to present an in-depth study of Android malware detection approaches. This article provides a comprehensive survey of 150 studies on Android malware detection from 2010 to 2022. Two broader categories like traditional signature-based and behavior-based approaches are discussed throughout the review process. The behavior-based detection approaches are further categorized in to static, dynamic, and hybrid analysis methods. The survey has conducted in different dimensions including detection approaches, datasets used, features, sustainability of the solutions, etc. Although researchers have proposed detection tools and techniques to develop efficient countermeasures against Android malware, there is a scarcity of a concise review for research practitioners in this subject area. The survey shows there is a great deal of interest in machine learning-based detection methods among the research community. The review not only provides an authentic assessment of the malware detection capabilities of different approaches but also presents observations and suggestions regarding various aspects of the Android malware ecosystem. These observations and suggestions are intended to assist researchers in enhancing further research towards the subject domain.
Interpol review of digital evidence for 2019–2022
2023, Forensic Science International: Synergy
Detection of Harassment Toward Women in Twitter During Pandemic Based on Machine Learning
2024, International Journal of Advanced Computer Science and Applications
A Case Study for Declarative Pattern Mining in Digital Forensics
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Analysing Android Apps Classification and Categories Validation by Using Latent Dirichlet Allocation
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View full text

Applying Natural Language Processing for detecting malicious patterns in Android applications

Highlights

Abstract

Section snippets

Introduction and motivation

Related work

Overview of the system

Empirical evaluation

Conclusion

Declaration of competing interest

Acknowledgment

Comput. Secur.

Pattern Recogn. Lett.

Biochim. Biophys. Acta Protein Struct.

Data Knowl. Eng.

Compilers: Principles, Techniques, and Tools

MAIL: malware analysis intermediate language - a step towards automating and optimizing malware detection

Malware detection using assembly code and control flow graph optimization

N-gram-based text categorization

WIRE – a formal intermediate language for binary analysis

The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation

BMC Genom.