Abstract
A key step in publishing on Google Play Store (GPS) is the manual selection of the app category. The category is highly relevant for users when searching for a suitable app. To prevent misclassification, existing work focused on automating the apps’ categories identification through different learning methods. However, most existing approaches do not consider a validation of the categories. This research proposes Latent Dirichlet Allocation (LDA) for categories’ validation and for identification of similar apps. LDA can provide human-understandable topics (mixture of words) and it can help in discovering Android apps’ categories based on their descriptions. For diversity, the most popular 5,940 apps in US and Romania were considered. LDA performance is evaluated under different scenarios defined by data set processing methods. The evaluation relies on the user’s defined categories from GPS. LDA topics are labeled with categories’ names based on these and by applying cosine similarity and human interpretation. Results show a model with a corpus containing various parts of speech (i.e., nouns, adjectives, verbs) and improved with phrases can achieve a precision of 0.69. Moreover, the analysis hints there might exist discrepancies between the GPS guideline regarding the categories’ content and their actual content, but further studies are required.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Google play store. https://play.google.com/store. Accessed 14 Feb 2023
Android. Software (2008). https://source.android.com/. Accessed 14 Feb 2023
Adhitama, R., Kusumaningrum, R., Gernowo, R.: Topic labeling towards news document collection based on latent Dirichlet allocation and ontology. In: 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), pp. 247–252 (2017). https://doi.org/10.1109/ICICOS.2017.8276370
Alam, S.: Applying natural language processing for detecting malicious patterns in Android applications. Forensic Sci. Int.: Digit. Invest. 301270 (2021). https://doi.org/10.1016/j.fsidi.2021.301270
Andow, B., Nadkarni, A., Bassett, B., Enck, W., Xie, T.: A study of grayware on google play. In: 2016 IEEE Security and Privacy Workshops (SPW), pp. 224–233 (2016). https://doi.org/10.1109/SPW.2016.40
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 993–1022 (2003)
Bunyamin, H., Sulistiani, L.: Automatic topic clustering using latent Dirichlet allocation with skip-gram model on final project abstracts. In: 2017 21st International Computer Science and Engineering Conference (ICSEC), pp. 1–5 (2017). https://doi.org/10.1109/ICSEC.2017.8443795
Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS 2009, pp. 288–296. Curran Associates Inc., Red Hook (2009)
Games CA: Flow legends: Pipe games. Android Application (2021). https://play.google.com/store/apps/details?id=com.vladk.pipemasters. Accessed 14 Feb 2023
Garg, M., Monga, A., Bhatt, P., Arora, A.: Android app behaviour classification using topic modeling techniques and outlier detection using app permissions. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 500–506 (2016). https://doi.org/10.1109/PDGC.2016.7913246
Gensim: Parallelized latent Dirichlet allocation. Technology. https://radimrehurek.com/gensim/models/ldamulticore.html. Accessed 14 Feb 2023
Gensim: Phrases. https://radimrehurek.com/gensim/models/phrases.html. Accessed 14 Feb 2023
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(Suppl_1), 5228–5235 (2004). https://doi.org/10.1073/pnas.0307752101
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis, pp. 177–196 (2001). https://doi.org/10.1023/A:1007617005950
Joung, J., Kim, H.: Automated keyword filtering in LDA for identifying product attributes from online reviews. J. Mech. Des. 143 (2020). https://doi.org/10.1115/1.4048960
Priya Kalaivani, K., Arulanand, N.: Mobile app categorization based on app descriptions and API calls. Int. J. Aquatic Sci. 12(2), 3718–3728 (2021). http://www.journal-aquaticscience.com/article_135795.html
Ma, S., Wang, S., Lo, D., Deng, R.H., Sun, C.: Active semi-supervised approach for checking app behavior against its description. In: 2015 IEEE 39th Annual Computer Software and Applications Conference, vol. 2, pp. 179–184 (2015). https://doi.org/10.1109/COMPSAC.2015.93
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013). https://doi.org/10.48550/ARXIV.1310.4546
Mokarizadeh, S., Rahman, M., Matskin, M.: Mining and analysis of apps in Google Play, pp. 527–535 (2013)
Pollock, L., Vijay-Shanker, K., Hill, E., Sridhara, G., Shepherd, D.: Natural language-based software analyses and tools for software maintenance. In: De Lucia, A., Ferrucci, F. (eds.) ISSSE 2009-2011. LNCS, vol. 7171, pp. 94–125. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36054-1_4
Puspaningrum, A., Siahaan, D., Fatichah, C.: Mobile app review labeling using LDA similarity and term frequency-inverse cluster frequency (TF-ICF). In: 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 365–370 (2018). https://doi.org/10.1109/ICITEED.2018.8534785
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). https://doi.org/10.1145/361219.361220
Store, G.P.: Choose a category and tags for your app or game. https://support.google.com/googleplay/android-developer/answer/9859673?hl=en. Accessed 14 Feb 2023
Suadaa, L.H., Purwarianti, A.: Combination of latent Dirichlet allocation (LDA) and term frequency-inverse cluster frequency (TFxICF) in Indonesian text clustering with labeling. In: 2016 4th International Conference on Information and Communication Technology (ICoICT), pp. 1–6 (2016). https://doi.org/10.1109/ICoICT.2016.7571885
Yang, C.Z., Tu, M.H.: Yang, C.Z. Tu, M.H.: LACTA: An enhanced automatic software categorization on the native code of Android applications. In: Proceedings of the International Multiconference of Engineers and Computer Scientists (IMECS), vol. 1, pp. 1–5 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Flondor, E., Frincu, M. (2023). Analysing Android Apps Classification and Categories Validation by Using Latent Dirichlet Allocation. In: Nguyen, N.T., et al. Computational Collective Intelligence. ICCCI 2023. Lecture Notes in Computer Science(), vol 14162. Springer, Cham. https://doi.org/10.1007/978-3-031-41456-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-41456-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41455-8
Online ISBN: 978-3-031-41456-5
eBook Packages: Computer ScienceComputer Science (R0)