Abstract
This paper presents a Hybrid Latent Dirichlet Allocation – Kmeans (HLDA-Kmeans) Algorithm for document clustering. The overload information has became a challenge for users due to the existence of abundance information and heterogeneous nature of the Web. Researchers such as academician as well as people who are involved in text analytics have encountered challenges to analyze documents because of ambiguity in keywords/keyphrases. Hence, the objective is to perform document clustering analysis using HLDA - Kmeans algorithm to discover the clusters among the unlabelled text data, classify the keyphrases based on topics and visualize the clustering results. Online news from Oil and Gas is used as a dataset for training and testing using 70%–30% split. The system performance of the proposed HLDA - Kmeans algorithm was assessed using Precision, Recall and F-Score Formulas. Experimental results show that the proposed HLDA - Kmeans has achieved clustering results satisfactorily.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Xiao, Y.: A survey of document clustering techniques & comparison of LDA and moVMF (2010)
Thakare, P., Karche, R.K.P., Gaikwad, S., Khaladhar, M.: Data analysis using document clustering. Int. J. Eng. Comput. Sci. 4(4), 11267–11271 (2015)
Hoffman, T.: Probabilistic latent semantic analysis. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pp. 289–296 (1999)
Li, Z.X., Shi, Z.P., Li, Z.Q., Shi, Z.Z.: Automatic image annotation by fusing semantic topics. J. Softw. 22(4), 801–812 (2011)
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Hao, L.: Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimed. Tools Appl. 78(11), 15169–15211 (2019)
Foong, O.-M.: Swarm LSA-PSO clustering model in text summarization. Int. J. Adv. Soft Comput. Appl. 8(3), 88–99 (2016)
Liu, L., Tang, L., Dong, W., Yao, S., Zhou, W.: An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1), 1–22 (2016)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bartolome, A., Islam, M., Vundekode, S.: Clustering and topic analysis. Information Storage and Retrieval (2016)
Blei, D.M., Lafferty, J.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)
Chappell, D.: Introducing Azure machine learning. A guide for technical Professionals (2015)
Jelodar, H., Wang, Y., Yuan, C., Feng, X.: Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey (2017)
Ian. W.: Recent advances and applications of probabilistic topic models. In: AIP Conference Proceedings, vol. 1636, no. 1, p. 124 (2014)
Christophe, D.: Inference and applications for topic models. Machine Learning. PSL Research University (2017)
Lirong, Q., Jia, Y.: CLDA: an effective topic model for mining user interest preference under big data background (2018)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceeding of Workshop on Text Summarization ACL 2004, pp. 74–81 (2004)
Acknowledgement
The authors would like to thank YUTP (Cost Center 015LC0-173) for the financial support in the research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Foong, OM., Ismail, A.N. (2020). Document Clustering Using Hybrid LDA- Kmeans. In: Silhavy, R. (eds) Applied Informatics and Cybernetics in Intelligent Systems. CSOC 2020. Advances in Intelligent Systems and Computing, vol 1226. Springer, Cham. https://doi.org/10.1007/978-3-030-51974-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-51974-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-51973-5
Online ISBN: 978-3-030-51974-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)