Document Clustering Using Local and Universal Knowledge

Qazanfari, Kazem; Youssef, Abdou

doi:10.1007/978-3-319-96136-1_14

Kazem Qazanfari¹³ &
Abdou Youssef¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10934))

Included in the following conference series:

International Conference on Machine Learning and Data Mining in Pattern Recognition

1868 Accesses

Abstract

In almost all real-world text clustering problems, the distribution of the repository samples and the real distribution of the clusters’ concepts are rarely equivalent, which reduces the accuracy of the document clustering methods. Let U(f) and L(f) be the distribution functions of the extracted features based on Universal knowledge and Local -repository- knowledge, respectively. Having the same distribution functions U(f) and L(f) is desirable; however, in real-world situations, these two distribution functions are not equal and they might be even quite different. In this paper, we show how the difference between these two distribution functions could decrease the accuracy of the document clustering algorithms. To address this issue, two different methods are proposed which combine information from the local and universal knowledge efficiently. In the first method, a special transform T is introduced to combine the similarities of each pair of documents derived from the local and the universal knowledge. In the second method, the local and the universal knowledge are combined, per document, by concatenating each document’s feature vector derived from the local knowledge to the document feature vector derived from universal knowledge. The impact of the proposed methods on clustering is tested on two well-known datasets, Reuters and 20-Newsgroups. Experimental results show that by using either local or universal knowledge to generate the feature vectors, some documents could be assigned to a wrong cluster. However, we show that our proposed methods significantly improve the document clustering performance, thus demonstrating the benefit of enhancing local knowledge with universal knowledge in an efficient way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A semi-supervised framework for concept-based hierarchical document clustering

Article 02 October 2023

Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

References

Berkhin, P.: A survey of clustering data mining techniques. Group. Multidimens. Data 25–71 (2006). https://doi.org/10.1007/3-540-28349-8_2
Tan, P.N., Michael, S., Vipin, K.: Data mining cluster analysis: basic concepts and algorithms. Introd. Data Min. 8, 487–568 (2006)
Google Scholar
Qazanfari, K., Youssef, A.: Contextual feature weighting using knowledge beyond the repository knowledge. Int. J. Comput. Commun. Eng. (IJCCE) (2018)
Google Scholar
Qazanfari, K., Youssef, A., Keane, K., Nelson, J.: A novel recommendation system to match college events and groups to students. AIAAT 261, 1–15 (2017)
Google Scholar
Fahad, S.K.A., Wael, M.S.Y.: Review on semantic document clustering. Int. J. Contemp. Comput. Res. 1(1), 14–30 (2017)
Google Scholar
Singh, J.P., Nizar, B.: Proportional data clustering using K-means algorithm: a comparison of different distances. In: 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1048–1052. IEEE (2017). https://doi.org/10.1109/icit.2017.7915506
Forgy, E.C.: Analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)
MATH Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64 (2000)
Google Scholar
Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: 16th International Conference on World Wide Web, pp. 121–130. ACM (2007). https://doi.org/10.1145/1242572.1242590
Gower, J.C., Roos, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc. Ser. C (Appl. Stat.) 18, 54–64 (1969). https://doi.org/10.2307/2346439
Article MathSciNet Google Scholar
Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139–172 (1987). https://doi.org/10.1007/BF00114265
Article Google Scholar
King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 62, 86–101 (1967)
Article Google Scholar
Liu, X., Gong, Y., Xu, W., Zu, S.: Document clustering with cluster refinement and model selection capabilities. In: 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2002). https://doi.org/10.1145/564376.564411
Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017). https://doi.org/10.1016/j.neunet.2016.12.008
Article Google Scholar
Gallant, S.I.: Method for document retrieval and for word sense disambiguation using neural networks U.S. Patent No. 5,317,507. 31 (1994)
Google Scholar
Piotr, B., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Lewis, D.D.: Reuters-21578, Distribution 1.0 (1987)
Google Scholar
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96–118. Carnegie Mellon University (1996)
Google Scholar
Jey, H.L., Timothy, B.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016)
Rosenberg, A., Julia, H.: V-measure: a conditional entropy-based external cluster evaluation measure. In: EMNLP-CoNLL (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

The George Washington University, Washington, DC, 20052, USA
Kazem Qazanfari & Abdou Youssef

Authors

Kazem Qazanfari
View author publications
You can also search for this author in PubMed Google Scholar
Abdou Youssef
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazem Qazanfari .

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qazanfari, K., Youssef, A. (2018). Document Clustering Using Local and Universal Knowledge. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10934. Springer, Cham. https://doi.org/10.1007/978-3-319-96136-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-96136-1_14
Published: 08 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96135-4
Online ISBN: 978-3-319-96136-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics