Skip to main content
Log in

SimAndro: an effective method to compute similarity of Android applications

  • Foundations
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

As the number of Android applications (apps) is increasing dramatically, users face a serious problem to find relevant apps to their needs. Therefore, there is an important demand for app search engines or recommendation services where developing an accurate similarity method is a challenging issue. Contrary to malware detection, very fewer efforts have been devoted to similarity computation of apps. Furthermore, all the existing methods use the features obtained only from the app stores such as description and rating, which could be inaccurate, varied in different stores, and affected by language barrier; they totally neglect useful information clearly capturing the app’s functionalities and behaviors that can be mined from the apps themselves such as the API calls and manifest information. In this paper, we propose an effective method called SimAndro to compute the similarity of apps, which extracts the features based on the information obtained only from apps themselves and the Android platform without using information obtained from third-party sources such as app stores. SimAndro performs both feature extraction and similarity computation where the API calls, manifest information, package name, and strings are used as features. To compute the similarity score of an app-pair, a separate similarity score is computed based on each feature, and a weighted linear combination of these four scores is regarded as the final similarity score by utilizing an automatic weighting scheme based on TreeRankSVM. The results of extensive experiments with three real-world datasets and a dataset constructed by human experts demonstrate the effectiveness of SimAndro.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://play.google.com/apps.

  2. https://www.amazon.com.

  3. https://apkpure.com.

  4. If the total number of methods referenced in an app exceeds 65536, the app is converted to multiple DEX files (Android developers site 2018).

  5. https://github.com/androguard.

  6. http://dedexer.sourceforge.net/.

  7. We note that the app recommendation, app clustering, and malware detection topics are out of the scope of our paper. As a part of our future work, we plan to extensively study and evaluate the effectiveness of applying SimAndro to the aforementioned topics.

  8. https://github.com/JesusFreke/smali.

  9. https://github.com/Sable/soot

  10. https://theory.stanford.edu/~aiken/moss/.

  11. https://developer.android.com.

  12. Except of the size and developer, which are used as features for similarity computation in references Chen et al. (2015) and Chen et al. (2016).

  13. The translation of the app name is “thehyundai.com.”

  14. As observed in our experimental results in Sect. 5, all similarity methods show lower accuracy with the amazon dataset than other ones since it contains more duplicate apps than other datasets.

  15. Similarity is normally defined as a concept between two objects (Lin et al. 2012); therefore, we use only a single app as the query.

  16. The BBM free calls and messaging app is provided by BlackBerry Limited.

  17. The Hike messenger is a free messaging app provided by Hike Ltd.

References

  • Android developers site. developer.android.com/studio/build/multidex.html, (December 2018)

  • Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: Proceedings of international conference on security and privacy in communication systems, pp 86–103

  • Airola A, Pahikkala T, Salakoski T (2011) Training linear ranking svms in linearithmic time using redblack trees. Pattern Recognit Lett 32(9):1328–1336

    Article  Google Scholar 

  • Arp D, Spreitzenbarth M, Gascon H, Rieck K (2014) Drebin: effective and explainable detection of android malware in your pocket. In: Proceedings of the 14st international conference on network and distributed system security symposium, pp 1–12

  • Backurs A, Indyk P (2015) Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In: Proceedings of the 47th annual ACM symposium on theory of computing, pp 51–58

  • Bhandari U, Sugiyama K, Datta A, Jindal R (2013) Serendipitous recommendation for mobile apps using item-item similarity graph. In: Proceedings of the 10th Asia information retrieval societies conference, pp 440–451

  • Chae D-K, Kim S-W, Cho S-J, Kim Y (2015) Effective and efficient detection of software theft via dynamic API authority vectors. J Syst Softw 110:1–9

    Article  Google Scholar 

  • Chen N, Hoi S, Li S, Xiao X (2015) Simapp: a framework for detecting similar mobile applications by online kernel learning. In: Proceedings of the 8th ACM international conference on web search and data mining, pp 305–314

  • Chen N, Hoi S, Li S, Xiao X (2016) Mobile app tagging. In: Proceedings of the 9th ACM international conference on web search and data mining, pp 63–72

  • Chiki NF, Rothenburger B, Gilles N (2008) Combining link and content information for scientific topics discovery. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, ICTAI, pp 211–214

  • Crussell J, Gibler C, Chen H (2012) Attack of the clones: detecting cloned applications on android markets. In: Proceedings of the European symposium on research in computer security, pp 37–54

  • Crussell J, Gibler C, Chen H (2016) Andarwin: scalable detection of android application clones based on semantics. IEEE Trans Mobile Comput 14(10):2007–2019

    Article  Google Scholar 

  • Demontis A, Melis M, Biggio B, Maiorca D, Arp D, Corona I (2017) Yes, machine learning can be more secure! a case study on android malware detection. IEEE Trans Dependable Secure Comput 1–14. https://doi.org/10.1109/TDSC.2017.2700270

  • Dalvik executable format. https://source.android.com/devices/tech/dalvik/dex-format, (December 2018)

  • Do Q, Martini B, Choo K-K (2015) Exfiltrating data from android devices. Comput Secur 48(C):74–91

    Article  Google Scholar 

  • Dutta B, Shinde JV (2017) Intuitionistic fuzzy clustering based segmentation of spine mr image. Int Res J Eng Technol 4(7):790–794

    Google Scholar 

  • Faruki P, Bharmal A, Laxmi V, Ganmoor V, Gaur M (2015) Android security: a survey of issues, malware penetration, and defenses. IEEE Commun Surv Tutor 17(2):998–1022

    Article  Google Scholar 

  • Faruki P, Laxmi V, Bharmal A, Gaur MS, Ganmoor V (2015) Androsimilar: Robust signature for detecting cariants of android malware. Inf Secur Appl 22:66–80

    Google Scholar 

  • Feizollah A, Anuar NB, Salleh R, Abdul Wahab A (2015) A review on feature selection in mobile malware detection. Digit Investig 13(C):22–37

    Article  Google Scholar 

  • Hamedani MR, Kim S-W (2016) Simcc-at: a method to compute similarity of scientific papers with automatic parameter tuning. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, pp 1005–1008

  • Hamedani MR, Kim S (2017) Jacsim: an accurate and efficient link-based similarity measure in graphs. Inf Sci 414:203–224

    Article  Google Scholar 

  • Hamedani MR, Kim S-W, Kim D-J (2016) Simcc: a novel method to consider both content and citations for computing similarity of scientific papers. Inf Sci 334–335(C):273–292

    Article  Google Scholar 

  • Jang J-W, Kang H, Woo J, Aziz M, Kim HK (2015) Andro-autopsy: Anti-malware system based on similarity matching of malware and malware creator-centric information. Digit Investig 14:17–35

    Article  Google Scholar 

  • Kim Y, Cho S-J, Han S, You I (2018) A software classification scheme using binary level characteristics for efficient software filtering. Soft Comput 22(2):595–606

    Article  Google Scholar 

  • Ko J, Shim H, Kim D, Jeong Y-S, Cho S-j, Park M, Han S, Kim SB (2013) Measuring similarity of android applications via reversing and k-gram birthmarking. In: Proceedings of research in adaptive and convergent systems, pp 336–341

  • Lee K, Ban Y, Lee S (2017) Efficient depth enhancement using a combination of color and depth information. Sensors 17(7):1–27

    Article  Google Scholar 

  • Lee S, Dolby J, Ryu S (2016) Hybridroid: static analysis framework for android hybrid applications. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, pp 250–261

  • Levin J (2015) Android internals—a confectioner’s cookbook. vol I. Cambridge, MA, USA

  • Li M, Li Q, Long Y (2017) Representation learning of multiword expressions with compositionality constraint. In: Proceedings of the international conference on knowledge science, engineering and management, pp 507–519

  • Lin Z, Lyu MR, King I (2012) Matchsim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32(1):141–166

    Article  Google Scholar 

  • Magdy W, Jones GJF (2010) Pres: A score metric for evaluating recall-oriented information retrieval applications. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, pp 611–618

  • Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Motta JM, Ladouceur J (2017) A CRF machine learning model reinforced by ontological knowledge for document summarization. In: Proceedings of the international conference artificial intelligence, pp 127–135

  • Narudin F, Feizollah A, Anuar N, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput Fusion Found Methodol Appl 20(1):343–357

    Google Scholar 

  • Ng T (2016) Prefix distance between regular languages. In: Proceedings of the international conference on implementation and application of automata, pp 224–235

  • Rastogi V, Chen Y, Jiang X (2014) Catch me if you can: evaluating android anti-malware against transformation attacks. IEEE Trans Inf Forensics Secur 9(1):99–108

    Article  Google Scholar 

  • Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PGa (2012) On the automatic categorisation of android applications. In: Proceedings of the 9th annual IEEE consumer communications and networking conference-security and content protection, pp 149–153

  • Sarma B, Li N, Gates C, Potharaju R, Nita-Rotaru C, Molloy I (2012) Android permissions: a perspective combining risks and benefits. In: Proceedings of the 17th ACM symposium on access control models and technologies, pp 13–22

  • Sugiyama K, Kan M-Y (2013) Exploiting potential citation papers in scholarly paper recommendation. In: Proceedings of the 13th ACM/IEEE joint conference on digital libraries, pp 153–162

  • Wei J, He J, Kai C, Zhou Y, Tang Z (2017) Collaborative filtering and deep learning based recommendation system for cold start items. Expert Syst Appl 69(1):29–39

    Article  Google Scholar 

  • Wei T-E, Tyan H-R, Jeng A, Lee H-M, Liao H-Y, Wang J-C (2015) Droidexec: root exploit malware recognition against wide variability via folding redundant function-relation graph. In: Proceedings of the 17st international conference on advanced communication technology, pp 161–169

  • Wu D-J, Mao C-H, Wei T-E, Lee H-M, Wu K-P (2012) Droidmat: android malware detection through manifest and API calls tracing. In: Proceedings of the 7th Asia joint conference on information security, pp 62–96

  • Yerima S, Sezer S, McWilliams G, Igor M (2013) A new android malware detection approach using bayesian classification. In: Proceedings of the 27th IEEE international conference on advanced information networking and applications, pp 121–128

  • Yin P, Luo P, Lee W-C, Wang M (2013) App recommendation: a contest between satisfaction and temptation. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 395–404

  • Zhang M, Duan Y, Yin H, Zhao Z (2014) Semantics-aware android malware classification using weighted contextual API dependency graphs. In: Proceedings of the ACM SIGSAC conference on computer and communications security, pp 1105–1116

  • Zheng M, Sun M, Lui J (2013) Droid analytics: a signature based analytic system to collect, extract, analyze and associate android malware. In: Proceedings of the 12st IEEE international conference on trust, security and privacy in computing and communications, pp 163–171

  • Zhou W, Zhou Y, Grace M, Jian X, Zou S (2013) Fast, scalable detection of piggybacked mobile applications. In: Proceedings of the 3th ACM conference on data and application security and privacy, pp 185–196

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2015R1D1A1A02061946), and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (no. 2018R1A2B2004830).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seong-je Cho.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by A. Di Nola.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hamednai, M.R., Kim, G. & Cho, Sj. SimAndro: an effective method to compute similarity of Android applications. Soft Comput 23, 7569–7590 (2019). https://doi.org/10.1007/s00500-019-03755-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-03755-4

Keywords

Navigation