Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results

Zhao, Yueyang; Cui, Lei

doi:10.1007/s11192-022-04596-z

Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results

Published: 07 December 2022

Volume 128, pages 1163–1186, (2023)
Cite this article

Scientometrics Aims and scope Submit manuscript

Yueyang Zhao¹ &
Lei Cui²

359 Accesses
1 Citation
Explore all metrics

Abstract

To address the deficiency in semantic representations of medical texts and achieve the clustering of PubMed database retrieval results, this study presented a method to construct a fusion matrix using text similarity measures. Similarity relations between phrases, texts, and the content of phrases and texts were combined to create a fusion matrix, and several clustering algorithms were trained to group a collection of texts from the PubMed database. Category annotations were then created to describe the meaning of each category of clustered texts. Experimental results showed that the fusion matrix-based clustering was superior in grouping the text sets, and clustering the training set was not necessary to improve clustering performance. Moreover, the extracted high-frequency words in the category descriptions distinguished the meanings of the categories well; therefore, the fusion matrix design was effective for clustering descriptions of academic texts. As only the PubMed database was used in this study, future research should extend the fusion matrix to other text repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Article 31 July 2023

Research paper classification systems based on TF-IDF and LDA schemes

Article Open access 26 August 2019

References

Aggarwal, C. C., & Zhai, C. (2012). A survey of text clustering algorithms (pp. 77–128). Springer.
Google Scholar
Amer, A. A., & Abdalla, H. I. (2020). A set theory based similarity measure for text clustering and classification. Journal of Big Data, 7(1), 74.
Article Google Scholar
Bahdanau, D., Cho, K. & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of 3rd International Conference on Learning Representations, pp 1–14
Basu, T., & Murthy, C. A. (2013). Cues: A new hierarchical approach for document clustering. Journal of Pattern and Recognition Research, 8(1), 66–84.
Article Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Article Google Scholar
Dice, L. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302.
Article Google Scholar
Egghe, L. (2010). Good properties of similarity measures and their complementarity. Journal of the Association for Information Science & Technology, 61(10), 2151–2160.
Google Scholar
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, 96(34), 226–231.
Google Scholar
Eugene, F. K. (1987). Taxicab geometry. Dover Publications.
Google Scholar
Fabris, E., Kuhn, T., Silvello, G. (2019). A framework for citing nanopublications. Proceedings of the International Conference on Theory and Practice of Digital Libraries. Cham: Springer, pp 70–83
Github. KeyBERT[EB/OL]. [5-19]. https://maartengr.github.io/KeyBERT/
Guan, R., Shi, X., Marchese, M., Yang, C., & Liang, Y. (2011). Text clustering with seeds affinity propagation. IEEE Transactions on Knowledge & Data Engineering, 23(4), 627–637.
Article Google Scholar
He, Y., Tan, H., & Luo, W. (2011). MR-DBSCAN: An efficient parallel density-based clustering algorithm using Map Reduce. Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pp 473–480
Hirai, S., & Yamanishi, K. (2013). Efficient computation of normalized maximum likelihood codes for gaussian mixture models with its applications to clustering. IEEE Transactions on Information Theory, 59(11), 7718–7727.
Article MathSciNet MATH Google Scholar
Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261–266.
Article MathSciNet MATH Google Scholar
Hofmann, T. (2013). Probabilistic latent semantic analysis. arXiv:1301.6705v1 [cs.LG]
Huang, W., Chen, E., Liu, Q., Chen, Y., & Wang, S. (2019). Hierarchical multi-label text classification: An attention-based recurrent network approach. Proceedings of 28th ACM International Conference on Information and knowledge management, pp 1051–1060
Jaccard, P. (1901). Etude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin Del La Societe Vaudoise Des Sciences Naturelles, 37(142), 547–579.
Google Scholar
Jia, C., Carson, M. B., Wang, X., & Yu, J. (2017). Concept decompositions for short text clustering by identifying word communities. Pattern Recognition, 76, 691–703.
Article Google Scholar
Kohonen, T., Kaski, S., Lagus, K., Salojrvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactiongs on Neural Networks, 11(3), 574–585.
Article Google Scholar
Lai, S. W. (2016). Research on word and document semantic vector representation method based on neural network. University of Chinese Academy of Sciences.
Google Scholar
Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1), 77–85.
Article Google Scholar
Li, G. S., Meng, K., & Xie, J. (2013). An improved topic detection method for chinese microblog based on incremental cluste-ring. Journal of Software, 8(9), 2313–2320.
Article Google Scholar
Liao, Y., Hua, J. L., & Zhu, W. S. (2015). An effective divide-and-merge method for hierarchical clustering. Journal of Computational and Theoretical Nanoscience, 12(12), 5547–5554.
Article Google Scholar
Liu, T., Zhang, X., Xu, H. Y., & Lei, C. Y. (2020). Review of application research on text mining methods for technical path identification. Information studies: Theory & Application, 43(7), 179–185.
Lozano-Diez, A., Zazo, R., Toledano, D. T., & Gonzalez-Rodriguez, J. (2017). An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PLoS ONE, 12, 1–22.
Article Google Scholar
Macqueen, J. (1965). Some Methods for Classification and Analysis of MultiVariate Observations. Berkeley Symposium on Mathematical Statistics & Probability, pp 281–297.
Mehta, V., Bawa, S., & Singh, J. (2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7, 3211–3224.
Article Google Scholar
Mu, T., Goulermas, J. Y., & Korkontzelos, I. (2016). Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities. Journal of the American Society for Information Science and Technology, 67(1), 106–133.
Google Scholar
Niasi, K., & Sidheshwari, P. (2019). Self-tuned descriptive document clustering using a predictive network. IEEE Transactions on Knowledge and Data Engineering, 30(10), 1929–1942.
Google Scholar
Nielsen, F. (2016). Introduction to HPC with MPI for Data Science. Introduction to HPC with MPI for Data Science
Ning, W. H., Liu, J. H., & Xiong, H. (2021). Knowledge discovery using an enhanced latent Dirichlet allocation-based clustering method for solving on-site assembly problems. Robotics and Computer-Integrated Manufacturing, 73, 102246.
Article Google Scholar
Qiao, Y. F., Xiong, C. Y., Liu, Z., & Liu, Z. Y. (2019). Understanding the Behaviors of BERT in Ranking. arXiv. https://doi.org/10.48550/arXiv.1904.07531
Article Google Scholar
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 8(11), 613–620.
Article MATH Google Scholar
Sculley, D. (2010). Web-scale k-means clustering. International Conference on World Wide Web, pp 1177–1178
Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm. IEEE Access, 8, 80716–80727.
Article Google Scholar
Sánchez, D., Batet, M., Isern, D., & Valls, A. (2012). Ontology-based semantic similarity: A new feature-based approach. Expert System Application, 39(9), 7718–7728.
Article Google Scholar
Sohangir, S., & Wang, D. (2017). Improved sqrt-cosine similarity measurement. J Big Data, 4(1), 25.
Article Google Scholar
Song, Y., Wang, H., Wang, Z., Li, H., & Chen, W. (2011). Short text conceptualization using a probabilistic knowledgebase. Proceedings of 22nd International Joint Conference on Artificial Intelligence. https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-388
Article Google Scholar
Stefanovič, P., Kurasova, O., & Štrimaitis, R. (2019). The N-grams based text similarity detection approach using self-organizing maps and similarity measures. Applied Science, 9, 1870.
Article Google Scholar
Sun, M. X., & Liu, C. Q. (2017). Research on hot topic detection based on DBSCAN algorithm and inter sentence relationship. Library and Information Service, 61(12), 113–121.
Google Scholar
Wang, A. J. (2019). An improved news text clustering algorithm based on MinHash. Computer Technology and Development, 29(2), 39–42.
Google Scholar
Wang, C. L., Yang, Y. H., Deng, F., & Lai, H. Y. (2019). A review of text similarity approaches. Information Science, 37(3), 158–168.
Google Scholar
Wang, D., Liang, Y., Xu, D., Feng, X., & Guan, R. (2018). A content-based recommender system for computer science publications. Knowledge-Based Systems, 157, 1–9.
Article Google Scholar
Wang, Z., Mi, H., & Ittycheriah, A. (2016). Semi-supervised clustering for short text via deep representation learning. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. https://doi.org/10.18653/v1/K16-1004
Article Google Scholar
Xie, H. (2021). Improved Jaccard coefficient text similarity calculation based on word frequency ratio. Neijiang Science and Technology, 42(8), 27.
Google Scholar
Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. International Conference on Machine Learning. PMLR, pp 478–487
Xu, J. M., Xu, B., Wang, P., Zheng, S. C., Tian, G. H., Zhao, J., & Xu, B. (2017). Self-Taught convolutional neural networks for short text clustering. Neural Networks, 88, 22–31.
Article Google Scholar
Yang, L., & Xu, S. (2017). A local context-aware LDA model for topic modeling in a document network. Journal of the Association for Information Science & Technology, 68, 1429–1448.
Article Google Scholar
Yu, P. (2020). Jaccard distance of logical formulas and its application. Journal of Frontiers of Computer Science & Technology, 14(11), 1975–1980.
Google Scholar
Yu, S., Tranchevent, L., & Liu, X. (2011). Optimized data fusion for kernel k-means clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5), 1031–1039.
Google Scholar
Zhang, D., Nan, F., & Wei, X. (2021). Supporting clustering with contrastive learning. ar Xiv preprint ar Xiv:2103.12953.
Zhang, T., Ramakrishnan, R., & Livny, M. (1997). BIRCH: A new data clustering algorithm and its applications. Journal of Data Mining and Knowledge Discovery, 1(2), 141–182.
Article Google Scholar
Zhang, X. L., Fu, Y. Z., & Chu, X. P. (2015). Application of jaccard similarity coefficient in recommender system. Computer Technology and Development, 25(4), 158–161.
Google Scholar
Zhao, J., Zhu, T. T., & Lan, M. (2014). ECNU: One stone two birds: Ensemble of heterogenous measures for semantic relatedness and textual entailment. International Workshop on Semantic Evaluation in COLing, Dublin. https://doi.org/10.3115/v1/S14-2044
Article Google Scholar
Zheng, S. Y., Huang, Q., Zhang, G., Li, Y. X., & Chen, X. (2019). A ontology construction method for user generated content. Information Science, 37(11), 43–47.
Google Scholar

Download references

Funding

This study was funded by the Liaoning Social Science Planning Fund project (Grant No. L20BTQ003).

Author information

Authors and Affiliations

Library, Shengjing Hospital of China Medical University, No. 36, Sanhao St., Heping Dist., Shenyang, 110004, Liaoning, China
Yueyang Zhao
Institute of Health Sciences, China Medical University, Shenyang, Liaoning, China
Lei Cui

Authors

Yueyang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yueyang Zhao.

Ethics declarations

Conflict of interest

All authors declares that they have no conflict of interest to disclose.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, Y., Cui, L. Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results. Scientometrics 128, 1163–1186 (2023). https://doi.org/10.1007/s11192-022-04596-z

Download citation

Received: 31 May 2022
Accepted: 16 November 2022
Published: 07 December 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11192-022-04596-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results

Abstract

Access this article

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Research paper classification systems based on TF-IDF and LDA schemes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results

Abstract

Access this article

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Research paper classification systems based on TF-IDF and LDA schemes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation