Skip to main content

Analysing Android Apps Classification and Categories Validation by Using Latent Dirichlet Allocation

  • Conference paper
  • First Online:
Computational Collective Intelligence (ICCCI 2023)

Abstract

A key step in publishing on Google Play Store (GPS) is the manual selection of the app category. The category is highly relevant for users when searching for a suitable app. To prevent misclassification, existing work focused on automating the apps’ categories identification through different learning methods. However, most existing approaches do not consider a validation of the categories. This research proposes Latent Dirichlet Allocation (LDA) for categories’ validation and for identification of similar apps. LDA can provide human-understandable topics (mixture of words) and it can help in discovering Android apps’ categories based on their descriptions. For diversity, the most popular 5,940 apps in US and Romania were considered. LDA performance is evaluated under different scenarios defined by data set processing methods. The evaluation relies on the user’s defined categories from GPS. LDA topics are labeled with categories’ names based on these and by applying cosine similarity and human interpretation. Results show a model with a corpus containing various parts of speech (i.e., nouns, adjectives, verbs) and improved with phrases can achieve a precision of 0.69. Moreover, the analysis hints there might exist discrepancies between the GPS guideline regarding the categories’ content and their actual content, but further studies are required.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/datasets/elenaflondor/most-popular-applications-from-google-play-store.

References

  1. Google play store. https://play.google.com/store. Accessed 14 Feb 2023

  2. Android. Software (2008). https://source.android.com/. Accessed 14 Feb 2023

  3. Adhitama, R., Kusumaningrum, R., Gernowo, R.: Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 247–252 (2017). https://doi.org/10.1109/ICICOS.2017.8276370

  4. Alam, S.: Applying natural language processing for detecting malicious patterns in Android applications. Forensic Sci. Int.: Digit. Invest. 301270 (2021). https://doi.org/10.1016/j.fsidi.2021.301270

  5. Andow, B., Nadkarni, A., Bassett, B., Enck, W., Xie, T.: A study of grayware on google play. In: 2016 IEEE Security and Privacy Workshops (SPW), pp. 224–233 (2016). https://doi.org/10.1109/SPW.2016.40

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 993–1022 (2003)

    Google Scholar 

  7. Bunyamin, H., Sulistiani, L.: Automatic topic clustering using latent Dirichlet allocation with skip-gram model on final project abstracts. In: 2017 21st International Computer Science and Engineering Conference (ICSEC), pp. 1–5 (2017). https://doi.org/10.1109/ICSEC.2017.8443795

  8. Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS 2009, pp. 288–296. Curran Associates Inc., Red Hook (2009)

    Google Scholar 

  9. Games CA: Flow legends: Pipe games. Android Application (2021). https://play.google.com/store/apps/details?id=com.vladk.pipemasters. Accessed 14 Feb 2023

  10. Garg, M., Monga, A., Bhatt, P., Arora, A.: Android app behaviour classification using topic modeling techniques and outlier detection using app permissions. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 500–506 (2016). https://doi.org/10.1109/PDGC.2016.7913246

  11. Gensim: Parallelized latent Dirichlet allocation. Technology. https://radimrehurek.com/gensim/models/ldamulticore.html. Accessed 14 Feb 2023

  12. Gensim: Phrases. https://radimrehurek.com/gensim/models/phrases.html. Accessed 14 Feb 2023

  13. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(Suppl_1), 5228–5235 (2004). https://doi.org/10.1073/pnas.0307752101

  14. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis, pp. 177–196 (2001). https://doi.org/10.1023/A:1007617005950

  15. Joung, J., Kim, H.: Automated keyword filtering in LDA for identifying product attributes from online reviews. J. Mech. Des. 143 (2020). https://doi.org/10.1115/1.4048960

  16. Priya Kalaivani, K., Arulanand, N.: Mobile app categorization based on app descriptions and API calls. Int. J. Aquatic Sci. 12(2), 3718–3728 (2021). http://www.journal-aquaticscience.com/article_135795.html

  17. Ma, S., Wang, S., Lo, D., Deng, R.H., Sun, C.: Active semi-supervised approach for checking app behavior against its description. In: 2015 IEEE 39th Annual Computer Software and Applications Conference, vol. 2, pp. 179–184 (2015). https://doi.org/10.1109/COMPSAC.2015.93

  18. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013). https://doi.org/10.48550/ARXIV.1310.4546

  20. Mokarizadeh, S., Rahman, M., Matskin, M.: Mining and analysis of apps in Google Play, pp. 527–535 (2013)

    Google Scholar 

  21. Pollock, L., Vijay-Shanker, K., Hill, E., Sridhara, G., Shepherd, D.: Natural language-based software analyses and tools for software maintenance. In: De Lucia, A., Ferrucci, F. (eds.) ISSSE 2009-2011. LNCS, vol. 7171, pp. 94–125. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36054-1_4

    Chapter  Google Scholar 

  22. Puspaningrum, A., Siahaan, D., Fatichah, C.: Mobile app review labeling using LDA similarity and term frequency-inverse cluster frequency (TF-ICF). In: 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 365–370 (2018). https://doi.org/10.1109/ICITEED.2018.8534785

  23. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). https://doi.org/10.1145/361219.361220

    Article  MATH  Google Scholar 

  24. Store, G.P.: Choose a category and tags for your app or game. https://support.google.com/googleplay/android-developer/answer/9859673?hl=en. Accessed 14 Feb 2023

  25. Suadaa, L.H., Purwarianti, A.: Combination of latent Dirichlet allocation (LDA) and term frequency-inverse cluster frequency (TFxICF) in Indonesian text clustering with labeling. In: 2016 4th International Conference on Information and Communication Technology (ICoICT), pp. 1–6 (2016). https://doi.org/10.1109/ICoICT.2016.7571885

  26. Yang, C.Z., Tu, M.H.: Yang, C.Z. Tu, M.H.: LACTA: An enhanced automatic software categorization on the native code of Android applications. In: Proceedings of the International Multiconference of Engineers and Computer Scientists (IMECS), vol. 1, pp. 1–5 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elena Flondor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Flondor, E., Frincu, M. (2023). Analysing Android Apps Classification and Categories Validation by Using Latent Dirichlet Allocation. In: Nguyen, N.T., et al. Computational Collective Intelligence. ICCCI 2023. Lecture Notes in Computer Science(), vol 14162. Springer, Cham. https://doi.org/10.1007/978-3-031-41456-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41456-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41455-8

  • Online ISBN: 978-3-031-41456-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics