Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures

Bui, Quang Vu; Sayadi, Karim; Amor, Soufian Ben; Bui, Marc

doi:10.1007/978-3-319-54472-4_24

Quang Vu Bui^17,18,
Karim Sayadi¹⁸,
Soufian Ben Amor^19,20 &
…
Marc Bui¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10191))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

2141 Accesses
13 Citations

Abstract

This paper evaluates through an empirical study eight different distance measures used on the LDA + K-means model. We performed our analysis on two miscellaneous datasets that are commonly used. Our experimental results indicate that the probabilistic-based distance measures are better than the vector based distance measures including Euclidean when it comes to cluster a set of documents in the topic space. Moreover, we investigate the implication of the number of topics and show that K-means combined to the results of the Latent Dirichlet Allocation model allows us to have better results than the LDA + Naive and Vector Space Model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://cran.r-project.org/web/packages/topicmodels/index.html.

References

Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bui, Q.V., Sayadi, K., Bui, M.: A multi-criteria document clustering method based on topic modeling and pseudoclosure function. Informatica 40(2), 169–180 (2016)
MathSciNet Google Scholar
Buntine, W.: Estimating likelihoods for topic models. In: Zhou, Z.-H., Washio, T. (eds.) ACML 2009. LNCS (LNAI), vol. 5828, pp. 51–64. Springer, Heidelberg (2009). doi:10.1007/978-3-642-05224-8_6
Chapter Google Scholar
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)
Google Scholar
Gordon, A.: Classification. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 2nd edn. CRC Press, Boca Raton (1999)
MATH Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5228–5235 (2004)
Article Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC 2008), Christchurch, New Zealand, pp. 49–56 (2008)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article MATH Google Scholar
Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2), 178–203 (2010)
Article Google Scholar
Maher, K., Joshi, M.S.: Effectiveness of different similarity measures for text classification and clustering. Int. J. Comput. Sci. Inf. Technol. 7(4), 1715–1720 (2016)
Google Scholar
Manning, C.D., Raghavan, P.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
MATH Google Scholar
Modha, D.S., Spangler, W.S.: Feature weighting in k-means clustering. Mach. Learn. 52(3), 217–237 (2003)
Article MATH Google Scholar
Pestov, V.: On the geometry of similarity search: dimensionality curse and concentration of measure. Inf. Process. Lett. 73(1), 47–51 (2000)
Article MathSciNet MATH Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
MathSciNet MATH Google Scholar
Xie, P., Xing, E.P.: Integrating Document Clustering and Topic Modeling, September 2013. arXiv:1309.6874

Download references

Author information

Authors and Affiliations

Hue University of Sciences, Hue, Vietnam
Quang Vu Bui
CHArt Laboratory EA 4004, EPHE, PSL Research University, Paris, France
Quang Vu Bui, Karim Sayadi & Marc Bui
LI-PARAD Laboratory, University of Versailles-Saint- Quentin-en-Yvelines, Versailles, France
Soufian Ben Amor
Paris-Saclay University, Paris, France
Soufian Ben Amor

Authors

Quang Vu Bui
View author publications
You can also search for this author in PubMed Google Scholar
Karim Sayadi
View author publications
You can also search for this author in PubMed Google Scholar
Soufian Ben Amor
View author publications
You can also search for this author in PubMed Google Scholar
Marc Bui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karim Sayadi .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology , Wroclaw, Poland
Ngoc Thanh Nguyen
Japan Advanced Institute of Science and Technology , Nomi, Japan
Satoshi Tojo
Japan Advanced Institute of Science and Technology , Nomi, Japan
Le Minh Nguyen
Wrocław University of Science and Technology , Wrocław, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bui, Q.V., Sayadi, K., Amor, S.B., Bui, M. (2017). Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures. In: Nguyen, N., Tojo, S., Nguyen, L., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2017. Lecture Notes in Computer Science(), vol 10191. Springer, Cham. https://doi.org/10.1007/978-3-319-54472-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-54472-4_24
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54471-7
Online ISBN: 978-3-319-54472-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics