Abstract
As the number of Android applications (apps) is increasing dramatically, users face a serious problem to find relevant apps to their needs. Therefore, there is an important demand for app search engines or recommendation services where developing an accurate similarity method is a challenging issue. Contrary to malware detection, very fewer efforts have been devoted to similarity computation of apps. Furthermore, all the existing methods use the features obtained only from the app stores such as description and rating, which could be inaccurate, varied in different stores, and affected by language barrier; they totally neglect useful information clearly capturing the app’s functionalities and behaviors that can be mined from the apps themselves such as the API calls and manifest information. In this paper, we propose an effective method called SimAndro to compute the similarity of apps, which extracts the features based on the information obtained only from apps themselves and the Android platform without using information obtained from third-party sources such as app stores. SimAndro performs both feature extraction and similarity computation where the API calls, manifest information, package name, and strings are used as features. To compute the similarity score of an app-pair, a separate similarity score is computed based on each feature, and a weighted linear combination of these four scores is regarded as the final similarity score by utilizing an automatic weighting scheme based on TreeRankSVM. The results of extensive experiments with three real-world datasets and a dataset constructed by human experts demonstrate the effectiveness of SimAndro.
Similar content being viewed by others
Notes
If the total number of methods referenced in an app exceeds 65536, the app is converted to multiple DEX files (Android developers site 2018).
We note that the app recommendation, app clustering, and malware detection topics are out of the scope of our paper. As a part of our future work, we plan to extensively study and evaluate the effectiveness of applying SimAndro to the aforementioned topics.
The translation of the app name is “thehyundai.com.”
As observed in our experimental results in Sect. 5, all similarity methods show lower accuracy with the amazon dataset than other ones since it contains more duplicate apps than other datasets.
Similarity is normally defined as a concept between two objects (Lin et al. 2012); therefore, we use only a single app as the query.
The BBM free calls and messaging app is provided by BlackBerry Limited.
The Hike messenger is a free messaging app provided by Hike Ltd.
References
Android developers site. developer.android.com/studio/build/multidex.html, (December 2018)
Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: Proceedings of international conference on security and privacy in communication systems, pp 86–103
Airola A, Pahikkala T, Salakoski T (2011) Training linear ranking svms in linearithmic time using redblack trees. Pattern Recognit Lett 32(9):1328–1336
Arp D, Spreitzenbarth M, Gascon H, Rieck K (2014) Drebin: effective and explainable detection of android malware in your pocket. In: Proceedings of the 14st international conference on network and distributed system security symposium, pp 1–12
Backurs A, Indyk P (2015) Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In: Proceedings of the 47th annual ACM symposium on theory of computing, pp 51–58
Bhandari U, Sugiyama K, Datta A, Jindal R (2013) Serendipitous recommendation for mobile apps using item-item similarity graph. In: Proceedings of the 10th Asia information retrieval societies conference, pp 440–451
Chae D-K, Kim S-W, Cho S-J, Kim Y (2015) Effective and efficient detection of software theft via dynamic API authority vectors. J Syst Softw 110:1–9
Chen N, Hoi S, Li S, Xiao X (2015) Simapp: a framework for detecting similar mobile applications by online kernel learning. In: Proceedings of the 8th ACM international conference on web search and data mining, pp 305–314
Chen N, Hoi S, Li S, Xiao X (2016) Mobile app tagging. In: Proceedings of the 9th ACM international conference on web search and data mining, pp 63–72
Chiki NF, Rothenburger B, Gilles N (2008) Combining link and content information for scientific topics discovery. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, ICTAI, pp 211–214
Crussell J, Gibler C, Chen H (2012) Attack of the clones: detecting cloned applications on android markets. In: Proceedings of the European symposium on research in computer security, pp 37–54
Crussell J, Gibler C, Chen H (2016) Andarwin: scalable detection of android application clones based on semantics. IEEE Trans Mobile Comput 14(10):2007–2019
Demontis A, Melis M, Biggio B, Maiorca D, Arp D, Corona I (2017) Yes, machine learning can be more secure! a case study on android malware detection. IEEE Trans Dependable Secure Comput 1–14. https://doi.org/10.1109/TDSC.2017.2700270
Dalvik executable format. https://source.android.com/devices/tech/dalvik/dex-format, (December 2018)
Do Q, Martini B, Choo K-K (2015) Exfiltrating data from android devices. Comput Secur 48(C):74–91
Dutta B, Shinde JV (2017) Intuitionistic fuzzy clustering based segmentation of spine mr image. Int Res J Eng Technol 4(7):790–794
Faruki P, Bharmal A, Laxmi V, Ganmoor V, Gaur M (2015) Android security: a survey of issues, malware penetration, and defenses. IEEE Commun Surv Tutor 17(2):998–1022
Faruki P, Laxmi V, Bharmal A, Gaur MS, Ganmoor V (2015) Androsimilar: Robust signature for detecting cariants of android malware. Inf Secur Appl 22:66–80
Feizollah A, Anuar NB, Salleh R, Abdul Wahab A (2015) A review on feature selection in mobile malware detection. Digit Investig 13(C):22–37
Hamedani MR, Kim S-W (2016) Simcc-at: a method to compute similarity of scientific papers with automatic parameter tuning. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, pp 1005–1008
Hamedani MR, Kim S (2017) Jacsim: an accurate and efficient link-based similarity measure in graphs. Inf Sci 414:203–224
Hamedani MR, Kim S-W, Kim D-J (2016) Simcc: a novel method to consider both content and citations for computing similarity of scientific papers. Inf Sci 334–335(C):273–292
Jang J-W, Kang H, Woo J, Aziz M, Kim HK (2015) Andro-autopsy: Anti-malware system based on similarity matching of malware and malware creator-centric information. Digit Investig 14:17–35
Kim Y, Cho S-J, Han S, You I (2018) A software classification scheme using binary level characteristics for efficient software filtering. Soft Comput 22(2):595–606
Ko J, Shim H, Kim D, Jeong Y-S, Cho S-j, Park M, Han S, Kim SB (2013) Measuring similarity of android applications via reversing and k-gram birthmarking. In: Proceedings of research in adaptive and convergent systems, pp 336–341
Lee K, Ban Y, Lee S (2017) Efficient depth enhancement using a combination of color and depth information. Sensors 17(7):1–27
Lee S, Dolby J, Ryu S (2016) Hybridroid: static analysis framework for android hybrid applications. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, pp 250–261
Levin J (2015) Android internals—a confectioner’s cookbook. vol I. Cambridge, MA, USA
Li M, Li Q, Long Y (2017) Representation learning of multiword expressions with compositionality constraint. In: Proceedings of the international conference on knowledge science, engineering and management, pp 507–519
Lin Z, Lyu MR, King I (2012) Matchsim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32(1):141–166
Magdy W, Jones GJF (2010) Pres: A score metric for evaluating recall-oriented information retrieval applications. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, pp 611–618
Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Motta JM, Ladouceur J (2017) A CRF machine learning model reinforced by ontological knowledge for document summarization. In: Proceedings of the international conference artificial intelligence, pp 127–135
Narudin F, Feizollah A, Anuar N, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput Fusion Found Methodol Appl 20(1):343–357
Ng T (2016) Prefix distance between regular languages. In: Proceedings of the international conference on implementation and application of automata, pp 224–235
Rastogi V, Chen Y, Jiang X (2014) Catch me if you can: evaluating android anti-malware against transformation attacks. IEEE Trans Inf Forensics Secur 9(1):99–108
Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PGa (2012) On the automatic categorisation of android applications. In: Proceedings of the 9th annual IEEE consumer communications and networking conference-security and content protection, pp 149–153
Sarma B, Li N, Gates C, Potharaju R, Nita-Rotaru C, Molloy I (2012) Android permissions: a perspective combining risks and benefits. In: Proceedings of the 17th ACM symposium on access control models and technologies, pp 13–22
Sugiyama K, Kan M-Y (2013) Exploiting potential citation papers in scholarly paper recommendation. In: Proceedings of the 13th ACM/IEEE joint conference on digital libraries, pp 153–162
Wei J, He J, Kai C, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst Appl 69(1):29–39
Wei T-E, Tyan H-R, Jeng A, Lee H-M, Liao H-Y, Wang J-C (2015) Droidexec: root exploit malware recognition against wide variability via folding redundant function-relation graph. In: Proceedings of the 17st international conference on advanced communication technology, pp 161–169
Wu D-J, Mao C-H, Wei T-E, Lee H-M, Wu K-P (2012) Droidmat: android malware detection through manifest and API calls tracing. In: Proceedings of the 7th Asia joint conference on information security, pp 62–96
Yerima S, Sezer S, McWilliams G, Igor M (2013) A new android malware detection approach using bayesian classification. In: Proceedings of the 27th IEEE international conference on advanced information networking and applications, pp 121–128
Yin P, Luo P, Lee W-C, Wang M (2013) App recommendation: a contest between satisfaction and temptation. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 395–404
Zhang M, Duan Y, Yin H, Zhao Z (2014) Semantics-aware android malware classification using weighted contextual API dependency graphs. In: Proceedings of the ACM SIGSAC conference on computer and communications security, pp 1105–1116
Zheng M, Sun M, Lui J (2013) Droid analytics: a signature based analytic system to collect, extract, analyze and associate android malware. In: Proceedings of the 12st IEEE international conference on trust, security and privacy in computing and communications, pp 163–171
Zhou W, Zhou Y, Grace M, Jian X, Zou S (2013) Fast, scalable detection of piggybacked mobile applications. In: Proceedings of the 3th ACM conference on data and application security and privacy, pp 185–196
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2015R1D1A1A02061946), and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (no. 2018R1A2B2004830).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by A. Di Nola.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hamednai, M.R., Kim, G. & Cho, Sj. SimAndro: an effective method to compute similarity of Android applications. Soft Comput 23, 7569–7590 (2019). https://doi.org/10.1007/s00500-019-03755-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-03755-4