Leveraging ontologies and machine-learning techniques for malware analysis into Android permissions ecosystems

doi:10.1016/j.cose.2018.07.013

Computers & Security

Volume 78, September 2018, Pages 429-453

https://doi.org/10.1016/j.cose.2018.07.013 Get rights and content

Abstract

Smartphones form a complex application ecosystem with a myriad of components, properties, and interfaces that produce an intricate relationship network. Given the intrinsic complexity of this system, we hereby propose two main contributions. First, we devise a methodology to systematically determine and analyze the complex relationship network among components, properties, and interfaces associated with the permission mechanism in Android ecosystems. Second, we investigate whether it is possible to identify characteristics shared by malware samples at this high level of abstraction that could be leveraged to unveil their presence. We propose an ontology-based framework to model the relationships between application and system elements, together with a machine-learning approach to analyze the complex network that arises therefrom. We represent the ontological model for the considered Android ecosystem with 4570 apps through a graph with some 55,000 nodes and 120,000 edges. Experiments have shown that a classifier operating on top of this complex representation can achieve an accuracy of 88% and precision of 91% and is capable of identifying and determining 24 features that correspond to 70 important graph nodes related to malware activity, which is a remarkable feat for security.

Introduction

Smartphones have become ubiquitous computing devices worldwide. A recent Ericsson Mobility Report (Carson et al., 2016) indicated that smartphones currently represent 55% of all mobile subscriptions globally. The report further projects the number of unique mobile subscribers to reach 6.1 billion by 2022, covering roughly 75% of the world’s population. Despite the multitude of different device models and the availability of several different operating systems for smartphones, the Android operating system currently holds 88% of market share (Sui, 2016).

Mobile devices are increasingly being used for activities that directly impact social, work, and financial environments; as such, they have become a primary target for cyber-criminals. A study published by Lee and Talbot (2016) concluded that, in the United Kingdom, the top ten usages for smartphones include social networking, emailing, banking, and shopping with similar patterns across other developed countries. To the eyes of a cyber-criminal, social networks can be viewed as a repository of the smartphone user’s personal information; work-related emails are a potential source of sensitive information, and banking apps are the gateway for accessing the user’s finances (Bojjagani, Sastry, 2016, Chanajitt, Viriyasitavat, Choo, 2016, Kadir, Stakhanova, Ghorbani, Lee, Zhang, Chen, 2013).

As a prophylactic security measure against unauthorized use or access, the Android ecosystem possesses a permission system for its applications (apps) (Enck et al., 2009). The permission system informs the user of which system resources and information an app uses prior to installation so that the user can make an informed choice on whether or not to install that app based on the resources used. However, Kelley et al. (2012) and Felt et al. (2012b) have shown flaws in the use of the permission system as a preventive security measure. In particular, users tend not to pay attention to permissions, and more worryingly, permission systems sometimes fail to aid users with the task of properly taking security-related decisions. Furthermore, developers tend to overprivilege applications requesting more permissions than necessary, anticipating future releases (Felt, Chin, Hanna, Song, Wagner, 2011, Felt, Egelman, Finifter, Akhawe, Wagner, 2012a). Moreover, Android documentation also has flaws in mapping permissions related to system calls, as described in the study from Pscout developers (Au et al., 2012), a software that intercepts system calls and keeps track of which permissions are tested by the operating system, producing actual documentation about which permissions are verified in each system-call access.

As a matter of fact, malicious apps can control seemingly harmless system resources to exploit a vulnerability in another app (Kelley et al., 2012) indirectly. Given that the Android ecosystem has over 1.7 million apps and 235 different permissions (Olmstead and Atkinson, 2015), the task of mapping and analyzing relationships among permissions, malware, and benign apps is daunting and, undoubtedly, cannot be manually performed by a human curator. Likewise, any developed methodology must be extensible, automatic, and dynamic to allow for new characteristics to be taken into consideration on the fly as apps, malware, and permissions are continuously added or removed from the ecosystem.

Given the above, application testing in Android devices faces important challenges (Wang and Alshboul, 2015) that must be addressed. Within this context, the present contribution proposes two methods (described in Section 4): the first for mapping relationships in the Android ecosystem using ontologies and the second, a machine-learning-based solution to analyze malware features from the obtained network of relations and dependencies. We validate the effectiveness of these methods in Section 4 and show that the proposed methods are able to determine the most important nodes related to malware activity, representing an important contribution to smartphone security.

Section snippets

Concepts and related work

Before we move on to the new methods we propose in this paper, we present a brief introduction to Android security, ontologies, and feature engineering using Bags of Graphs as well as the random forests classifier, which are necessary concepts to understand the paper. The expert reader can go directly to Section 3, where the new methods are introduced.

Proposed method

In this work, our primary goal is to analyze which permissions and resources are related to malicious apps in the Android ecosystem as represented in the Android manifests. We rely solely upon application manifest XML files as our source of information. The reasoning for this choice is that such files are publicly available and do not require any reverse engineering, code execution monitoring, or complicated code-level analysis to detect the presence of malware in a system, as described in

Experiments and results

In the following sections, we report on the experiments conducted to verify the method proposed in Section 3 with real-world data. In Section 4.1, we describe the metrics used to evaluate the performance of classifiers; in Section 4.2, we explain the Android ecosystem used on the experiments, which was transformed by the pre-processing method described in Sections 3.1 and 3.2 onto the features dataset. The full dataset was broken down into two partitions, one for the fitting process and another

Conclusion and future work

In this paper, we have introduced two new methods to address the problem of mapping the relationships and characteristics of malicious software in smartphones. We provided an extensible framework for mapping the analyzed elements in the Android system using ontologies, as well as a random forest-based method for automatically extracting meaningful information from the ontological map obtained from the new mapping algorithm. Experimental results in the considered Android ecosystem showed that

Acknowledgment

We thank the financial support of Intel Strategic Research Alliance (Grant #440850/2013-4), the National Council for Scientific and Technological Development – CNPq (Grant #302224/2015-7), the São Paulo Research Foundation (FAPESP) (DéjàVu Grant #2017/12646-3), and the Coordination for the Improvement of Higher Education Personnel – Capes (DeepEyes grant), as well as Cambridge Trusts-CAPES grant BEX 9407-11-1.

Luiz C. Navarro is an electronic engineer with specialization in digital systems, graduated in 1982 from Polytechnic School of the University of Sao Paulo, with extensive experience in the market of software development, system integration and software architecture. Currently, he is a master’s student in Computer Science at the Institute of Computing of the University of Campinas (UNICAMP), focusing research in systems security, Android security, digital forensics, ontologies and machine

References (72)

V. Singh et al.
Revisiting security ontologies
Int J Comput Scie Issues
(2014)
K.A. Talha et al.
Apk auditor: permission-based android malware detection system
Digit Investig
(2015)
Y. Wang et al.
Mobile security testing approaches and challenges
Proceedings of the 2015 first conference on mobile and secure services (MOBISECSERV)
(2015)
ZhangM. et al.
Semantics-aware android malware classification using weighted contextual api dependency graphs
Proceedings of the 2014 ACM SIGSAC conference on computer and communications security (CCS ’14)
(2014)
A. Altmann et al.
Permutation importance: a corrected feature importance measure
Bioinformatics
(2010)
K. Olmstead et al.
Apps permissions in the Google Play Store
Technical Report
(2015)
AuK.W.Y. et al.
Pscout: analyzing the android permission specification
Proceedings of the 2012 ACM conference on computer and communications security (CCS ’12)
(2012)
D. Beckett(W3C)
RDF 1.1 N-triples
Technical Report
(2014)
D. Beckett(W3C) et al.
RDF 1.1 turtle – terse RDF triple language
Technical Report
(2014)
S. Bojjagani et al.
Stamba: security testing for android mobile banking apps

L. Breiman

Bagging predictors

Mach Learn

(1996)

L. Breiman

Out-of-bag estimation

Technical Report

(1996)

L. Breiman

Random forests

Mach Learn

(2001)

L. Breiman et al.

Classification and regression trees

(1984)

Carson S., Furuskr A., Jonsson P., Kronander J., Lindberg P., Ludwig R., hman K., Sehti J.S.. Ericson mobility report....

R. Caruana et al.

An empirical comparison of supervised learning algorithms

Proceedings of the 23rd international conference on machine learning (ICML ’06)

(2006)

R. Chanajitt et al.

Forensic analysis and security assessment of android m-banking apps

Aust J Forensic Sci

(2016)

Community V.. Apk malware samples acquired from a torrent. 2017a. Accessed:...

Community V.. Virustotal public api v2.0. 2017b. Accessed:...

A. Criminisi et al.

Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning

Found Trends® Comput Graph Vis

(2012)

S. Das et al.

Semantics-based online malware detection: towards efficient real-time protection against malware

IEEE Trans Inf Forensics Secur

(2016)

Eddy M.. Mobile threat monday: Android apps hide windows malware. 2014. Accessed:...

K. Eilbeck et al.

The sequence ontology: a tool for the unification of genome annotations

Genome Biol

(2005)

N. Elenkov

Android security internals: an in-depth guide to Android’s security architecture

(2014)

W. Enck et al.

Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones

Proceedings of the 9th USENIX conference on operating systems design and implementation (OSDI’10)

(2010)

W. Enck et al.

Understanding android security

IEEE Secur Privacy

(2009)

P. Faruki et al.

Android security: a survey of issues, malware penetration and defenses

IEEE Commun Surv Tutor

(2015)

A.P. Felt et al.

Android permissions demystified

Proceedings of the 18th ACM conference on computer and communications security (CCS ’11)

(2011)

A.P. Felt et al.

How to ask for permission

Proceedings of the 7th USENIX conference on hot topics in security (HotSec’12)

(2012)

A.P. Felt et al.

Android permissions: user attention, comprehension, and behavior

Proceedings of the eighth symposium on usable privacy and security (SOUPS ’12)

(2012)

Fenz S.. Ontology-and Bayesian-based information security risk management....

S. Fenz et al.

Formalizing information security knowledge

Proceedings of the 4th international symposium on information, computer, and communications security (ASIACCS ’09)

(2009)

Google. Google play. 2017. Accessed:...

T. Gruber

Ontology

Encyclopedia of database systems

(2009)

N. Guarino et al.

What is an ontology?

Handbook on ontologies

(2009)

M. Hartung et al.

Recent advances in schema and ontology evolution

Cited by (26)

GSEDroid: GNN-based Android malware detection framework using lightweight semantic embedding
2024, Computers and Security
Currently, the prevalence of Android malware remains substantial. Malicious programs increasingly use advanced obfuscation techniques, posing challenges for security professionals with enhanced disguises, a proliferation of variants, and escalating detection difficulty. Leveraging semantic features presents a promising avenue to address these challenges. Rich semantic information encapsulated within opcodes and API call graphs has been identified as crucial in distinguishing benign from malicious applications. Consequently, various Natural Language Processing (NLP) technologies, such as Word2vec, are employed to encode features of Dalvik opcode sequences, thereby yielding embedded representations. Given that malware developers often opt for semantically similar APIs to achieve comparable functionalities, it is posited that the opcode embeddings for such APIs should exhibit similar characteristics. However, simple NLP models that only extract statistical information are insufficient for understanding obfuscated malware's behavioral patterns, as they do not provide comprehensive semantic insights. To bridge this gap, we propose a novel, lightweight embedding model based on CodeBERT and TextCNN. This model aims for efficient and precise representation of opcode sequences. Consequently, we introduce GSEDroid, an Android malware detection framework that uses an API call graph with permission and opcode semantic features to characterize APKs. This approach converts the detection challenge into a graph classification task executed via a graph neural network algorithm. The efficacy of our method has been validated through comparative analyses with other techniques. Experimental results demonstrate that our GraphSage+SAGPooling model achieved an accuracy of 99.47% and an F1-score of 99.44%, underscoring its effectiveness in Android malware detection.
RecMaL: Rectify the malware family label via hybrid analysis
2023, Computers and Security
Intelligent applications can be significantly impacted by incorrectly categorized data. Recently, artificial intelligence technology has been deployed in an increasing number of security-related scenarios, but the issue of data mislabeling has received little attention. We concentrate on the problem of malware mislabeling in this paper. Unfortunately, in the security field, the mislabeling issue of malware is not taken seriously. Existing work attempts to aggregate the AV labels to alleviate malware mislabeling. This will mislead the security analyst and pass the error to subsequent data-driven applications. Therefore, we conduct an in-depth analysis to explore the severity of the malware mislabel issue, and try to rectify the description of malware generated from anti-virus engines. We first propose a malware label correction tool called RecMaL. It employs hybrid analyses for malware label rectifying.
According to the thorough exploratory analysis, we figure out the core reasons for mislabeling issues and summarize them into 3 types. To verify the effectiveness and how RecMaL benefits the downstream applications (e.g., malware classification), we evaluate RecMaL through a series of experiments and show that the main components of RecMaL improve the performance, which proves our method effectively alleviates the mislabeling issue.
An ontology-driven framework for knowledge representation of digital extortion attacks
2023, Computers in Human Behavior
Citation Excerpt :
To the best of our knowledge, no related ontology has been provided for extortion assaults and their relationship to system behaviors and components that can answer the aforementioned competency queries. Given that our goal is different from the ontologies presented in software (Hilario et al., 2009; Keet et al., 2015; Malone et al., 2014; Oberle et al., 2009), cybersecurity (Gao et al., 2013; Huang et al., 2010, 2014; Iannacone et al., 2015; Jia et al., 2018; Mozzaquatro et al., 2018; Narayanan et al., 2018; Navarro et al., 2018; Oltramari et al., 2014; Rastogi et al., 2020; Salini & Shenbagam, 2015; Shoaib & Farooq, 2015; Syed et al., 2016; Undercoffer et al., 2003), and vulnerability management (Mittal et al., 2016; Syed, 2020), we start developing the ontology from scratch. Although there were slight overlaps in some of the concepts and specifications between the proposed ontology and the research work mentioned, due to the small number, we manually merged them into the Rantology.
With the COVID-19 pandemic and the growing influence of the Internet in critical sectors of industry and society, cyberattacks have not only not declined, but have risen sharply. In the meantime, ransomware is at the forefront of the most devastating threats that have launched the lucrative illegal business. Due to the proliferation and variety of ransomware forays, there is a need for a new theory of categories. The intricacy and multiplicity of components involved in digital extortions entails the construction of a knowledge representation system that is able to organize large volumes of information from heterogeneous sources in a formal structured format and infer new knowledge from it. This paper suggests and develops a dedicated ontology of digital blackmails, called Rantology, with a particular focus on ransomware assaults. The logic coded in this ontology allows to assess the maliciousness of programs based on various factors, including called API functions and their behaviors. The proposed framework can be used to facilitate interoperability between cybersecurity experts and knowledge-based systems, and identify sensitive points for surveillance. The evaluation results based on several criteria confirm the adequacy of the suggested ontology in terms of clarity, modularity, consistency, coverage and inheritance richness.
Detection of malicious Android applications using Ontology-based intelligent model in mobile cloud environment
2021, Journal of Information Security and Applications
Citation Excerpt :
The permissions and the resources protected by the permissions are extracted from the apps and are used to construct an ontology graph using Protege. To reduce the time required for generating the feature vector as done in [32], a standard query language Simple Protocol And Resource Query Language (SPARQL), is used to collect concepts from the ontology graph and to generate a concept vector for each app in lesser time. From the existing work, it is observed that the permissions are an essential feature set to discriminate the apps, while FS is required to improve the detection rate.
Mobile Cloud Computing (MCC) is a computing model that makes mobile devices resourceful by executing mobile applications (apps) in the cloud and storing data in cloud servers. MCC faces several security threats in both the Cloud and Mobile environments. Among several threats, malicious apps are the most threatening ones, because they can perform various malicious activities in both environments. The traditional malware detection methods may not detect new types of malware or rapidly changing malware behavior. So, there is a need to develop an accurate model for detecting malicious apps in the MCC environment. Scalability and Knowledge Reusability are challenging issues in existing detection methods. To overcome these issues, the proposed model uses an effective Ontology-based intelligent model based on app permissions to detect malware apps. This model extracts the relationship between the static features from the apps and builds an Apps Feature Ontology (AFO). A concept vector set for apps is created using the items obtained from the AFO. The most discriminant features are selected using optimization algorithms like Particle Swarm Optimization, Social Spider Algorithm (SSA), and Gravitational Search Algorithm to reduce the dimension of the concept vector set. Various classifiers are applied to the reduced set. The efficiency of the proposed approach was evaluated on datasets obtained from the AndroZoo repository and VirusShare. The experimental results reveal that the proposed model can correctly detect malware using the Random Forest (RF) classifier with SSA and achieve higher detection accuracy with the lesser fall-out and less detection speed than existing Android malware detection techniques. Specifically, RF with SSA obtained higher accuracy, F1-score, and reduction in the fall-out of 94.11%, 93%, and 3%, respectively.
APTMalInsight: Identify and cognize APT malware based on system call information and ontology knowledge framework
2021, Information Sciences
Citation Excerpt :
The proposed model can overcome the challenges of virtual machine evasion and polymorphic malware. Navarro et al. [33] proposed an ontology-based framework to simulate the relationship between applications and system elements. The author uses machine learning methods to analyze complex networks and identify common characteristics of malware samples.
APT attacks have posed serious threats to the security of cyberspace nowadays which are usually tailored for specific targets. Identification and understanding of APT attacks remains a key issue for society. Attackers often utilize malware as the weapons to launch cyber-attacks. For this reason, detecting APT malware and gaining an insight of its malicious behaviors can strengthen the power to understand and counteract APT attacks. Based on the above motivation, this paper proposes a novel APT malware detection and cognition framework named APTMalInsight aiming at identifying and cognizing APT malware by leveraging system call information and ontology knowledge. We systematically study APT malware and extracts dynamic system call information to describe its behavioral characteristics. With respect to the established feature vectors, the APT malware can be detected and clustered into their belonging families accurately. Furthermore, a horizontal comparison between APT malware and the traditional malware is conducted from the perspective of behavior types, to understand the behavioral characteristics of APT malware in depth. On the above basis, the ontology model is introduced to construct the APT malware knowledge framework to represent its typical malicious behaviors, thereby implementing the systematic cognition of APT malware and providing contextual understanding of APT attacks. The evaluation results based on real APT malware samples demonstrate that the detection and clustering accuracy can reach up to 99.28% and 98.85% respectively. In addition, APTMalInsight supplies an effective cognition framework for APT malware and enhances the capability to understand APT attacks.
Ontology-based knowledge representation for malware individuals and families
2019, Computers and Security
Malware consists of a large numbers of malware families and individuals, and each individual has complex behaviors. So knowledge base is urgently needed to process and store such a huge amount of information. In present the traditional signature-based database cannot represent the behavioral semantics of malicious code. Therefore, people cannot know what malware will do on a computer system. To solve this issue, we apply ontology technique into the malware domain, and propose the method for constructing malware knowledge base. We design the concept classes and object properties of malware, and propose the method for representing semantics of malware behavior. The data mining method, Apriori algorithm, is applied to extract the common behaviors of individuals belonging to the same family, and common behaviors are used to represent the knowledge of a malware family. The experimental results show that the data mining method can discover the common behaviors of the malware family, and the common behaviors mined can effectively classify the malware families.

View all citing articles on Scopus

Alexandre K. W. Navarro is a Machine Learning Ph.D. student at the University of Cambridge Engineering Department. His major academic interests lie in approximate inference, probabilistic graphical models and machine learning. He also holds an M.Sc. and a B.Sc. in Chemical Engineering from the University of Campinas (UNICAMP) with an emphasis in control systems, optimization and simulation.

Andre Gregio is an Assistant Professor at the Federal University of Parana, Brazil (UFPR). His research interests include several aspects of computer and network security, such as countermeasures against malicious codes, security data visualization/analysis, and mobile security. Prof. Gregio is funded by the Brazilian National Counsel of Technological and Scientific Development (CNPq) and the Brazilian Ministry of Health. In 2017, Prof. Gregio was awarded the Google Latin America Research Award for his proposal on automatic detection of concept-drift in malware classifiers.

Anderson Rocha is an associate professor at the Institute of Computing, University of Campinas. His main interests include Reasoning for Complex Data, Digital Forensics and Machine Intelligence. He is an IEEE Senior Member, an elected affiliate member of the Brazilian Academy of Sciences (ABC) and of the IEEE Information Forensics and Security Technical Committee. He is a Microsoft Research Faculty Fellow, a Google Research Faculty Fellow and a Tan Chin Tuan Fellow. Finally, he is currently the principal investigator of a number of research projects in partnership with public funding agencies and multinational companies having already licensed several patents.

Ricardo Dahab is associate professor at the University of Campinas’ (UNICAMP) Institute of Computing. He holds a Computer Science Masters degree from UNICAMP and a Ph.D. in Combinatorics and Optimization from the University of Waterloo. His teaching and research interests are in Cryptography and Information Security. In academic research his main contributions are in elliptic curve-based cryptographic methods, some of which have become industry standards. Prof. Dahab has also been engaged in several R&D projects in partnership with industry and other research institutions, which have turned out successful products among which is the official HSM (Hardware Security Module) supporting the Brazilian PKI’s root certification authority. He has been an active member in joint efforts by the security community in Brazil and Latin America to promote and consolidate the area in the region, having served in several committees and organized events such as The 2009 Brazilian Symposium on Information and Systems Security (SBSeg), The 2011 Advanced School of Cryptography in 2011, the Latincrypt School in 2011 and 2013, the Cryptology and Network Security Symposium (CANS 2013) in 2013, PKC 2018, among others. He has also contributed to the creation and expansion in Brazil and Latin America of ACM’s International Collegiate Programming Contest, of which he is Latin America’s Director of Contests. He was one of the recipients of the 2011 UNICAMP’s Zeferino Vaz academic excellence award and of the 2013 UNICAMP’s Inventors Award.

View full text

Leveraging ontologies and machine-learning techniques for malware analysis into Android permissions ecosystems

Abstract

Introduction

Section snippets

Concepts and related work

Proposed method

Experiments and results

Conclusion and future work

Acknowledgment

Int J Comput Scie Issues

Digit Investig

Permutation importance: a corrected feature importance measure

Bioinformatics

Apps permissions in the Google Play Store

Technical Report

Pscout: analyzing the android permission specification

Proceedings of the 2012 ACM conference on computer and communications security (CCS ’12)

RDF 1.1 N-triples

Technical Report

RDF 1.1 turtle – terse RDF triple language

Technical Report

Stamba: security testing for android mobile banking apps

Bagging predictors

Mach Learn

Out-of-bag estimation

Technical Report

Random forests

Mach Learn

Classification and regression trees

An empirical comparison of supervised learning algorithms

Proceedings of the 23rd international conference on machine learning (ICML ’06)

Forensic analysis and security assessment of android m-banking apps

Aust J Forensic Sci

Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning

Found Trends® Comput Graph Vis

Semantics-based online malware detection: towards efficient real-time protection against malware

IEEE Trans Inf Forensics Secur

The sequence ontology: a tool for the unification of genome annotations

Genome Biol

Android security internals: an in-depth guide to Android’s security architecture

Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones

Proceedings of the 9th USENIX conference on operating systems design and implementation (OSDI’10)

Understanding android security

IEEE Secur Privacy

Android security: a survey of issues, malware penetration and defenses

IEEE Commun Surv Tutor

Android permissions demystified

Proceedings of the 18th ACM conference on computer and communications security (CCS ’11)

How to ask for permission

Proceedings of the 7th USENIX conference on hot topics in security (HotSec’12)

Android permissions: user attention, comprehension, and behavior

Proceedings of the eighth symposium on usable privacy and security (SOUPS ’12)

Formalizing information security knowledge

Proceedings of the 4th international symposium on information, computer, and communications security (ASIACCS ’09)

Ontology

Encyclopedia of database systems

What is an ontology?

Handbook on ontologies

Recent advances in schema and ontology evolution