Document Clustering Using Hybrid LDA- Kmeans

Foong, Oi-Mean; Ismail, Alia Nabila

doi:10.1007/978-3-030-51974-2_12

Oi-Mean Foong¹⁵ &
Alia Nabila Ismail¹⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1226))

Included in the following conference series:

Computer Science On-line Conference

919 Accesses
2 Citations

Abstract

This paper presents a Hybrid Latent Dirichlet Allocation – Kmeans (HLDA-Kmeans) Algorithm for document clustering. The overload information has became a challenge for users due to the existence of abundance information and heterogeneous nature of the Web. Researchers such as academician as well as people who are involved in text analytics have encountered challenges to analyze documents because of ambiguity in keywords/keyphrases. Hence, the objective is to perform document clustering analysis using HLDA - Kmeans algorithm to discover the clusters among the unlabelled text data, classify the keyphrases based on topics and visualize the clustering results. Online news from Oil and Gas is used as a dataset for training and testing using 70%–30% split. The system performance of the proposed HLDA - Kmeans algorithm was assessed using Precision, Recall and F-Score Formulas. Experimental results show that the proposed HLDA - Kmeans has achieved clustering results satisfactorily.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures

Document Clustering Using Different Unsupervised Learning Approaches: A Survey

Clustering Analysis for Newsgroup Classification

References

Xiao, Y.: A survey of document clustering techniques & comparison of LDA and moVMF (2010)
Google Scholar
Thakare, P., Karche, R.K.P., Gaikwad, S., Khaladhar, M.: Data analysis using document clustering. Int. J. Eng. Comput. Sci. 4(4), 11267–11271 (2015)
Google Scholar
Hoffman, T.: Probabilistic latent semantic analysis. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pp. 289–296 (1999)
Google Scholar
Li, Z.X., Shi, Z.P., Li, Z.Q., Shi, Z.Z.: Automatic image annotation by fusing semantic topics. J. Softw. 22(4), 801–812 (2011)
Article Google Scholar
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Hao, L.: Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimed. Tools Appl. 78(11), 15169–15211 (2019)
Article Google Scholar
Foong, O.-M.: Swarm LSA-PSO clustering model in text summarization. Int. J. Adv. Soft Comput. Appl. 8(3), 88–99 (2016)
Google Scholar
Liu, L., Tang, L., Dong, W., Yao, S., Zhou, W.: An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1), 1–22 (2016)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bartolome, A., Islam, M., Vundekode, S.: Clustering and topic analysis. Information Storage and Retrieval (2016)
Google Scholar
Blei, D.M., Lafferty, J.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)
Article MathSciNet Google Scholar
Chappell, D.: Introducing Azure machine learning. A guide for technical Professionals (2015)
Google Scholar
Jelodar, H., Wang, Y., Yuan, C., Feng, X.: Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey (2017)
Google Scholar
Ian. W.: Recent advances and applications of probabilistic topic models. In: AIP Conference Proceedings, vol. 1636, no. 1, p. 124 (2014)
Google Scholar
Christophe, D.: Inference and applications for topic models. Machine Learning. PSL Research University (2017)
Google Scholar
Lirong, Q., Jia, Y.: CLDA: an effective topic model for mining user interest preference under big data background (2018)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceeding of Workshop on Text Summarization ACL 2004, pp. 74–81 (2004)
Google Scholar

Download references

Acknowledgement

The authors would like to thank YUTP (Cost Center 015LC0-173) for the financial support in the research.

Author information

Authors and Affiliations

Computer and Information Sciences Department, Universiti Teknologi PETRONAS, Perak, Malaysia
Oi-Mean Foong & Alia Nabila Ismail

Authors

Oi-Mean Foong
View author publications
You can also search for this author in PubMed Google Scholar
Alia Nabila Ismail
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oi-Mean Foong .

Editor information

Editors and Affiliations

Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Radek Silhavy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Foong, OM., Ismail, A.N. (2020). Document Clustering Using Hybrid LDA- Kmeans. In: Silhavy, R. (eds) Applied Informatics and Cybernetics in Intelligent Systems. CSOC 2020. Advances in Intelligent Systems and Computing, vol 1226. Springer, Cham. https://doi.org/10.1007/978-3-030-51974-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-51974-2_12
Published: 08 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-51973-5
Online ISBN: 978-3-030-51974-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics