Abstract
In almost all real-world text clustering problems, the distribution of the repository samples and the real distribution of the clusters’ concepts are rarely equivalent, which reduces the accuracy of the document clustering methods. Let U(f) and L(f) be the distribution functions of the extracted features based on Universal knowledge and Local -repository- knowledge, respectively. Having the same distribution functions U(f) and L(f) is desirable; however, in real-world situations, these two distribution functions are not equal and they might be even quite different. In this paper, we show how the difference between these two distribution functions could decrease the accuracy of the document clustering algorithms. To address this issue, two different methods are proposed which combine information from the local and universal knowledge efficiently. In the first method, a special transform T is introduced to combine the similarities of each pair of documents derived from the local and the universal knowledge. In the second method, the local and the universal knowledge are combined, per document, by concatenating each document’s feature vector derived from the local knowledge to the document feature vector derived from universal knowledge. The impact of the proposed methods on clustering is tested on two well-known datasets, Reuters and 20-Newsgroups. Experimental results show that by using either local or universal knowledge to generate the feature vectors, some documents could be assigned to a wrong cluster. However, we show that our proposed methods significantly improve the document clustering performance, thus demonstrating the benefit of enhancing local knowledge with universal knowledge in an efficient way.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Berkhin, P.: A survey of clustering data mining techniques. Group. Multidimens. Data 25–71 (2006). https://doi.org/10.1007/3-540-28349-8_2
Tan, P.N., Michael, S., Vipin, K.: Data mining cluster analysis: basic concepts and algorithms. Introd. Data Min. 8, 487–568 (2006)
Qazanfari, K., Youssef, A.: Contextual feature weighting using knowledge beyond the repository knowledge. Int. J. Comput. Commun. Eng. (IJCCE) (2018)
Qazanfari, K., Youssef, A., Keane, K., Nelson, J.: A novel recommendation system to match college events and groups to students. AIAAT 261, 1–15 (2017)
Fahad, S.K.A., Wael, M.S.Y.: Review on semantic document clustering. Int. J. Contemp. Comput. Res. 1(1), 14–30 (2017)
Singh, J.P., Nizar, B.: Proportional data clustering using K-means algorithm: a comparison of different distances. In: 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1048–1052. IEEE (2017). https://doi.org/10.1109/icit.2017.7915506
Forgy, E.C.: Analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64 (2000)
Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: 16th International Conference on World Wide Web, pp. 121–130. ACM (2007). https://doi.org/10.1145/1242572.1242590
Gower, J.C., Roos, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc. Ser. C (Appl. Stat.) 18, 54–64 (1969). https://doi.org/10.2307/2346439
Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139–172 (1987). https://doi.org/10.1007/BF00114265
King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 62, 86–101 (1967)
Liu, X., Gong, Y., Xu, W., Zu, S.: Document clustering with cluster refinement and model selection capabilities. In: 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–198 (2002). https://doi.org/10.1145/564376.564411
Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017). https://doi.org/10.1016/j.neunet.2016.12.008
Gallant, S.I.: Method for document retrieval and for word sense disambiguation using neural networks U.S. Patent No. 5,317,507. 31 (1994)
Piotr, B., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Lewis, D.D.: Reuters-21578, Distribution 1.0 (1987)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Computer Science Technical Report CMU-CS-96–118. Carnegie Mellon University (1996)
Jey, H.L., Timothy, B.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016)
Rosenberg, A., Julia, H.: V-measure: a conditional entropy-based external cluster evaluation measure. In: EMNLP-CoNLL (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Qazanfari, K., Youssef, A. (2018). Document Clustering Using Local and Universal Knowledge. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10934. Springer, Cham. https://doi.org/10.1007/978-3-319-96136-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-96136-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96135-4
Online ISBN: 978-3-319-96136-1
eBook Packages: Computer ScienceComputer Science (R0)