Skip to main content
Log in

Combining semantic and term frequency similarities for text clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

A key challenge for document clustering consists in finding a proper similarity measure for text documents that enables the generation of cohesive groups. Measures based on the classic bag-of-words model take into account solely the presence (and frequency) of words in documents. In doing so, semantically similar documents which use different vocabularies may end up in different clusters. For this reason, semantic similarity measures that use external knowledge, such as word n-gram corpora or thesauri, have been proposed in the literature. In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source. Clustering algorithms are applied to several real datasets in order to experimentally evaluate the quality of the clusters obtained with the proposed measure and compare it with a number of state-of-the-art measures from the literature. The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. For the sake of compactness, Table 2 assumes the relaxed implementation of FGTM (Method 2). If the greedy implementation (Method 1) is used instead, just replace terms \(T'^2\) with \(T'^2\log (T')\) in the respective complexities.

  2. As a practical example, it is worth reporting that the WR method [39] took about 18 days to compute all the pairwise document similarity values for the largest dataset used in our experiments (20ng-whole, with \(\approx \)19K documents), on a DELL machine i7 3.60GHz with 16GB RAM. This is about 300x as much time as required by our proposed measure, FGTM.

  3. https://radimrehurek.com/gensim/ (accessed on December 18, 2017).

  4. available at https://github.com/jhlau/doc2vec (on December 18, 2017).

  5. available at https://github.com/PrincetonML/SIF (on December 18, 2017).

  6. http://qwone.com/~jason/20Newsgroups/(on August 23, 2016).

  7. https://vhasoares.github.io/downloads.html (on January 08, 2018).

  8. http://vis.icmc.usp.br/vicg/tool/7/data-sets-from-papers (on January 08, 2018).

  9. http://www.ncbi.nlm.nih.gov/pubmed (on January 08, 2018).

  10. http://www.daviddlewis.com/resources/testcollections/reuters21578/  (on August 23, 2016).

  11. http://www.scopus.com/.

  12. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ (on August 23, 2016).

  13. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/ (on August 23, 2016).

  14. For example, the stopword list used in this work contains 318 words.

  15. The number of fuzzy c-means iterations is defined as 50 in this work.

  16. Although the consensus method is not part of the original LDC algorithm, it may be implemented as an additional step. This is done in the present work.

References

  1. Aggarwal CC, Zhai CX (2012) Mining text data. Springer, Berlin

    Book  Google Scholar 

  2. Arora S, Liang Y, Ma T (2017) A simple but tough-to-beat baseline for sentence embeddings. In: International conference on learning representations

  3. Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence, IJCAI’03. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 805–810. http://dl.acm.org/citation.cfm?id=1630659.1630775. Accessed 24 Sept 2018

  4. Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ”nearest neighbor” meaningful? In: Proceedings of the 7th international conference on database theory, ICDT ’99. Springer, London, UK, pp 217–235

    Google Scholar 

  5. Bishop CM (2006) Pattern recognition and machine learning. No. 4 in Information science and statistics, Springer. https://doi.org/10.1117/1.2819119. http://www.library.wisc.edu/selectedtocs/bg0137.pdf. ISBN: 0-387-31073-8. Accessed 24 Sept 2018

  6. Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3:2003

    Google Scholar 

  7. Brants T, Franz A (2006) Web 1T 5-gram corpus version 1. Technical Report, Google Research

  8. Cai D, He X, Han J (2008) Training linear discriminant analysis in linear time. In: IEEE 24th international conference on data engineering, 2008, ICDE 2008. IEEE, pp 209–217

  9. Carpenter GA, Grossberg S (1987) A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process 37(1):54–115. https://doi.org/10.1016/S0734-189X(87)80014-2

    Article  MATH  Google Scholar 

  10. Cormack GV, Hidalgo JMG, Sánz EP (2007) Feature engineering for mobile (SMS) spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’07, pp 871–872. https://doi.org/10.1145/1277741.1277951

  11. Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  12. Ferreira R, Lins RD, Freitas F, Simske SJ, Riss M (2014) A new sentence similarity assessment measure based on a three-layer sentence representation. In: Proceedings of the 2014 acm symposium on document engineering, ACM, New York, NY, USA, DocEng ’14, pp 25–34. https://doi.org/10.1145/2644866.2644881

  13. Ho C, Murad MAA, Kadir RA, Doraisamy SC (2010) Word sense disambiguation-based sentence similarity. In: Proceedings of the 23rd international conference on computational linguistics: posters, association for computational linguistics, Stroudsburg, PA, USA, COLING ’10, pp 418–426. http://dl.acm.org/citation.cfm?id=1944566.1944614. Accessed 24 Sept 2018

  14. Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800–802

    Article  MathSciNet  Google Scholar 

  15. Hochberg Y, Tamhane AC (1987) Multiple comparison procedures. Wiley, New York

    Book  Google Scholar 

  16. Hollander M, Wolfe DA (1999) Nonparametric statistical methods. Wiley series in probability and statistics, Wiley, New York. A Wiley-Interscience publication. http://opac.inria.fr/record=b1095753. Accessed 24 Sept 2018

  17. Horn RA, Johnson CRCR (2012) Matrix analysis, 2nd edn. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9781139020411

    Book  Google Scholar 

  18. Hotho A, Nrnberger A, Paa G (2005) A brief survey of text mining. LDV Forum GLDV J Comput Linguist Lang Technol 20(1):19–62

    Google Scholar 

  19. Hu J, Fang L, Cao Y, Zeng HJ, Li H, Yang Q, Chen Z (2008) Enhancing text clustering by leveraging wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, SIGIR ’08, pp 179–186. https://doi.org/10.1145/1390334.1390367

  20. Huang A, Milne D, Frank E, Witten IH (2008) Clustering documents with active learning using wikipedia. In: Proceedings of the 2008 Eighth IEEE international conference on data mining , ICDM ’08. IEEE Computer Society, Washington, DC, USA, pp 839–844. https://doi.org/10.1109/ICDM.2008.80

  21. Huang L, Milne D, Frank E, Witten IH (2012) Learning a concept-based document similarity measure. J Am Soc Inf Sci Technol 63(8):1593–1608. https://doi.org/10.1002/asi.22689

    Article  Google Scholar 

  22. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/bf01908075

    Article  MATH  Google Scholar 

  23. Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2(2):10:1–10:25. https://doi.org/10.1145/1376815.1376819

    Article  Google Scholar 

  24. Islam A, Milios E, Keselj V (2012) Text similarity using Google tri-grams. In: Proceedings of the 25th Canadian conference on advances in artificial intelligence, Canadian AI’12. Springer, Berlin, Heidelberg, pp 312–317

    Chapter  Google Scholar 

  25. Kaplan A (1955) An experimental study of ambiguity and context. Mech Transl 2:39–46. http://www.mt-archive.info/MT-1955-Kaplan.pdf

  26. Kogan J, Nicholas C, Volkovich V (2003) Text mining with information-theoretic clustering. Comput Sci Eng 5(6):52–59. https://doi.org/10.1109/MCISE.2003.1238704

    Article  Google Scholar 

  27. Krishnapuram R, Joshi A, Nasraoui O, Yi L (2001) Low-complexity fuzzy relational clustering algorithms for web mining. Trans Fuzzy Syst 9(4):595–607. https://doi.org/10.1109/91.940971

    Article  Google Scholar 

  28. Lafore R (2002) Data structures and algorithms in Java, 2nd edn. Sams, Indianapolis

    Google Scholar 

  29. Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR arXiv:1607.05368

  30. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on international conference on machine learning, ICML’14, Vol 32. JMLR.org, pp II–1188–II–1196. http://dl.acm.org/citation.cfm?id=3044805.3045025. Accessed 24 Sept 2018

  31. Lee MD, Welsh M (2005) An empirical evaluation of models of text document similarity. In: In CogSci2005, Erlbaum, pp 1254–1259

  32. Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18(8):1138–1150. https://doi.org/10.1109/TKDE.2006.130

    Article  Google Scholar 

  33. Liu L, Shell D (2010) Assessing optimal assignment under uncertainty: an interval-based algorithm. In: Proceedings of robotics: science and systems, Zaragoza, Spain. https://doi.org/10.15607/RSS.2010.VI.016

  34. Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge

    MATH  Google Scholar 

  35. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  Google Scholar 

  36. Meng L, Tan AH, Xu D (2014) Semi-supervised heterogeneous fusion for multimedia data co-clustering. IEEE Trans Knowl Data Eng 26(9):2293–2306

    Article  Google Scholar 

  37. Meng L, Tan AH, Wunsch DC (2015) Adaptive scaling of cluster boundaries for large-scale social media data clustering. IEEE Trans Neural Netw Learn Syst 27(12):2656–2669

    Article  Google Scholar 

  38. Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st national conference on artificial intelligence, Vol 1, AAAI’06. AAAI Press, pp 775–780. http://dl.acm.org/citation.cfm?id=1597538.1597662. Accessed 24 Sept 2018

  39. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781

  40. Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41. https://doi.org/10.1145/219717.219748

    Article  Google Scholar 

  41. Milne D, Witten IH (2013) An open-source toolkit for mining wikipedia. Artif Intell 194:222–239. https://doi.org/10.1016/j.artint.2012.06.007

    Article  MathSciNet  Google Scholar 

  42. Naldi MC, Campello RJGB, Hruschka ER, Carvalho ACPLF (2011) Efficiency issues of evolutionary k-means. Appl Soft Comput 11(2):1938–1952

    Article  Google Scholar 

  43. Nourashrafeddin S (2014) Interactive user-supervised text document clustering. Ph.D. thesis, Dalhousie University

  44. Nourashrafeddin S, Milios E, Arnold D (2013) Interactive text document clustering using feature labeling. In: Proceedings of the 2013 ACM symposium on document engineering, DocEng ’13. ACM, New York, NY, USA, pp 61–70. https://doi.org/10.1145/2494266.2494279

  45. Nourashrafeddin S, Milios E, Arnold DV (2014) An ensemble approach for text document clustering using wikipedia concepts. In: Proceedings of the 2014 ACM symposium on document engineering, DocEng ’14. ACM, New York, NY, USA, pp 107–116. https://doi.org/10.1145/2644866.2644868

  46. Paulovich FV, Nonato LG, Minghim R, Levkowitz H (2008) Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans Vis Comput Graph 14(3):564–575. https://doi.org/10.1109/TVCG.2007.70443

    Article  Google Scholar 

  47. Rakib M, Islam A, Milios E (2015) TrWP: text relatedness using word and phrase relatedness. In: Proceedings of the SemEval-2015. ACM, New York, NY, USA, pp 90–95. https://doi.org/10.1145/2644866.2644881

  48. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536. https://doi.org/10.1038/323533a0

    Article  MATH  Google Scholar 

  49. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523. https://doi.org/10.1016/0306-4573(88)90021-0

    Article  Google Scholar 

  50. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York

    MATH  Google Scholar 

  51. Tang B, Shepherd M, Milios E, Heywood M (2005) Comparing and combining dimension reduction techniques for efficient text clustering. In: SIAM international workshop on feature selection for data mining - interfacing machine learning and statistics, Newport Beach, California, in conjunction with 2005 SIAM international conference on data mining, pp 1–10

  52. Walpole RE, Myers RH, Myers SL, Ye K (2007) Probability and statistics for engineers and scientists, 8th edn. Pearson Education, Upper Saddle River

    MATH  Google Scholar 

  53. Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42(4):2264–2275. https://doi.org/10.1016/j.eswa.2014.10.023

    Article  Google Scholar 

  54. Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics, ACL ’94. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 133–138. https://doi.org/10.3115/981732.981751

Download references

Acknowledgements

The authors acknowledge the Brazilian Research Agencies Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001, CNPq, FAPEMIG and FAPESP, the Natural Sciences and Engineering Research Council of Canada, the Boeing Company, CALDO, and the International Development Research Centre, Ottawa, Canada, for their financial support to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Victor Hugo Andrade Soares.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs: Propositions

The proofs related to the Proposition 1 (\(S(P, R) \in [0, 1]\)) and Proposition 2 (the Greedy and Relaxed methods preserve \(S(P, R) \in [0, 1]\)), previously described in Sect. 3.2, are presented here. To do so, we need to show that the statement \(0 \le S(P, R) \le 1\) is true. Based on the definition of Eq. 6, we must prove that:

$$\begin{aligned} 0\le & {} S(P, R) \le 1\nonumber \\ 0\le & {} \frac{S_{\mathrm{SF}}(P,R) \cdot (p_{\mathrm{Total}} + r_{\mathrm{Total}})}{2 \cdot p_{\mathrm{Total}} \cdot r_{\mathrm{Total}}} \le 1 \end{aligned}$$
(10)

In order to prove what is stated at Inequality 10, we divided it into Inequalities 11 and 13 and developed them separately, for the sake of clarity. Starting with Inequality 11:

$$\begin{aligned} 0 \le \frac{S_{\mathrm{SF}}(P,R) \cdot (p_{\mathrm{Total}} + r_{\mathrm{Total}})}{2 \cdot p_{\mathrm{Total}} \cdot r_{\mathrm{Total}}} \end{aligned}$$
(11)

We must assume that both documents are not empty and contain terms, i.e., \(p_{\mathrm{Total}} > 0\) and \(r_{\mathrm{Total}} > 0\) are true. Thus, demonstrate Eq. 11 is equivalent to demonstrate that \(S_{\mathrm{SF}}(P,R) \ge 0\) is true. That is equivalent to prove that the following statement is true, based on the definition of \(S_{\mathrm{SF}}(P,R)\) in Eq. 4:

$$\begin{aligned} \sum _{i=1}^m \sum _{j=1}^n \hbox {Sim}(p_i, r_j) \cdot \hbox {min}(\hbox {freq}(p_i), \hbox {freq}(r_j)) \cdot \delta (i,j) \ge 0 \end{aligned}$$
(12)

where the similarity \(\hbox {Sim}(p_i, r_j)\) is in the range [0, 1] (as defined in Eq. (1)), the frequency of a given document term is not negative (\(\hbox {freq}(\cdot ) \ge 0\)) and \(\delta (i,j) = 1\) if the term \(p_i\) is associated with the term \(r_j\) and \(\delta (i,j) = 0\) otherwise. Thus, every component of the left side of Inequality 12 is not negative, which makes \(S_{\mathrm{SF}}(P,R) \ge 0\) true and, consequentially, \(S(P, R) \ge 0\) is also true. The lower bound occurs for \(S_{\mathrm{SF}}(P,R) = 0\) (when there are no semantic relation between the terms of the documents), which results into \(S(P, R) = 0\). This holds for Propositions 1 and 2 of Sect. 3.2.

Using the definition in Eq. 6, the proof for \(S(P, R) \le 1\) is equivalent to demonstrate that:

$$\begin{aligned} \frac{S_{\mathrm{SF}}(P,R) \cdot (p_{\mathrm{Total}} + r_{\mathrm{Total}})}{2 \cdot p_{\mathrm{Total}} \cdot r_{\mathrm{Total}}} \le 1 \end{aligned}$$
(13)

which is equivalent to:

$$\begin{aligned} S_{\mathrm{SF}}(P,R) \cdot (p_{\mathrm{Total}} + r_{\mathrm{Total}}) \le 2 \cdot p_{\mathrm{Total}} \cdot r_{\mathrm{Total}} \end{aligned}$$

Inequality 13 is true if \(S_{\mathrm{SF}}(P,R) \le \hbox {min}(p_{\mathrm{Total}}, r_{\mathrm{Total}})\). Thus, replacing \(S_{\mathrm{SF}}(P,R)\) by its definition in Eq. 4, we have to prove that:

$$\begin{aligned} \sum _{i=1}^m \sum _{j=1}^n \hbox {Sim}(p_i, r_j) \cdot \hbox {min}(\hbox {freq}(p_i), \hbox {freq}(r_j)) \cdot \delta (i,j) \le \hbox {min}(p_{\mathrm{Total}}, r_{\mathrm{Total}}) \end{aligned}$$
(14)

Once more, it is important to note that \(\hbox {Sim}(p_i, r_j)\) is in the range [0, 1] [as defined in Eq. (1)], and thus the following inequality is true:

$$\begin{aligned} \sum _{i=1}^m\sum _{j=1}^n \hbox {Sim}(p_i, r_j) \cdot \delta (i,j) \le \sum _{i=1}^m \sum _{j=1}^n \delta (i,j) \end{aligned}$$
(15)

Inequality 14 can be simplified based on Inequality 15, which results in Inequality 16.

$$\begin{aligned} \sum _{i=1}^m{\sum _{j=1}^n{ \hbox {min}(\hbox {freq}(p_i), \hbox {freq}(r_j)) \cdot \delta (i,j) \le \hbox {min}(p_{\mathrm{Total}}, r_{\mathrm{Total}}) }} \end{aligned}$$
(16)

Replacing \(p_{\mathrm{Total}}\) and \(r_{\mathrm{Total}}\) in Inequality 16 by their calculations results in Inequality 17.

$$\begin{aligned} \sum _{i=1}^m{\sum _{j=1}^n{ \hbox {min}(\hbox {freq}(p_i), \hbox {freq}(r_j)) \cdot \delta (i,j) \le \hbox {min}\left( \sum _{i=1}^m{\hbox {freq}(p_i)}, \sum _{j=1}^n{\hbox {freq}(r_j)}\right) }} \end{aligned}$$
(17)

The \(\delta (i,j)\) values result from the assignment Problem (2), described in Sect. 3.2, i.e., \(\delta (i, j)= 1\) if \(p_i\) is optimally associated with \(r_j\), or \(\delta (i, j) = 0\) otherwise. The FGTM assignment defines that each term \(p_i \in P\) is associated with at most one term \(r_j \in R\) and vice versa, which is a constrain of the assignment problem. As in the exact solution, the Greedy Strategy, proposed in Sect. 3.2, satisfies that constrain. Thus, \(\sum _{i=1}^m \sum _{j=1}^n \delta (i,j) \le \hbox {min}(m,n)\) is true, which means that the maximum number of associated pairs for FGTM must be the number of terms in the document with fewer terms. Only associated terms, \(\delta (i,j) = 1\), contribute to the sum in the left side of Inequality 17. Thus, not associated terms, \(\delta (i,j) = 0\), can be removed from the sum of Inequality 17 without loss, which result in Inequality 18.

$$\begin{aligned} \sum _{i,j \in \delta (i,j) = 1} \hbox {min}(\hbox {freq}(p_i), \hbox {freq}(r_j)) \le \hbox {min}\left( \sum _{i=1}^m{\hbox {freq}(p_i)}, \sum _{j=1}^n{\hbox {freq}(r_j)}\right) \end{aligned}$$
(18)

Analyzing Inequality 18, we have the sum of minor frequencies of the \(\hbox {min}(m,n)\) best associated terms at the left side, while there is the minor sum of the frequencies of the terms of the documents in the right side. By definition, for each assigned term frequency value added at the sum of left side of the inequality, frequencies of equal or superior values (the term and its associated term) are added at the sums of the right side. Thus, the Inequality 18 is true, which means that Inequality 13 is also true.

Although the previous proofs are enough to show that Inequality 10 is true if the Exact Method (Proposition 1) and for the Greedy Strategy (Proposition 2) are used for the FGTM calculation, it is not enough for the Relaxed Strategy, as it relaxes the assignment constraint of the association problem (that a term cannot be associated with more than one term). As described in Sect. 3.2, the Relaxed Strategy computes the similarities between P and R and vice versa, selecting the lower result. The two steps of this strategy follow the Inequalities 19 and 20 when comparing \(S_{\mathrm{SF}}(P, R)\) and \(S_{\mathrm{SF}}(R, P)\), respectively:

$$\begin{aligned} S_{\mathrm{SF}}(P, R) = \sum _{i=1}^m \hbox {Sim}(p_i, r_j) \cdot \hbox {min}(\hbox {freq}(p_i), \hbox {freq}(r_j)) \le p_{\mathrm{Total}} \end{aligned}$$
(19)

where \(r_j = \mathrm{arg} \max _{r_j} \hbox {Sim}(p_i, r_j)\). By definition, \(\hbox {Sim}(p_i, r_j) \in [0,1]\) and \(p_{\mathrm{Total}} = \sum _{i=1}^m \hbox {freq}(p_i)\), thus the Inequality 19 is true. Likewise, \(S_{\mathrm{SF}}(R, P)\) is computed as follows:

$$\begin{aligned} S_{\mathrm{SF}}(R, P) = \sum _{j=1}^n \hbox {Sim}(p_i, r_j) \cdot \hbox {min}(\hbox {freq}(p_i), \hbox {freq}(r_j)) \le r_{\mathrm{Total}} \end{aligned}$$
(20)

where \(p_i = \mathrm{arg} \max _{p_i} \hbox {Sim}(p_i, r_j)\). Analogously, \(\hbox {Sim}(p_i, r_j) \in [0,1]\) and \(r_{\mathrm{Total}} = \sum _{j=1}^n \hbox {freq}(r_j)\), thus the Inequality 20 is also true.

The result of the Relaxed Strategy is the lower value between \(S_{\mathrm{SF}}(P, R)\) and \(S_{\mathrm{SF}}(R, P)\), i.e., \(\hbox {min}\{S(P,R), S(R,P)\}\). Therefore, as \(S_{\mathrm{SF}}(P, R) \le p_{\mathrm{Total}}\) and \(S_{\mathrm{SF}}(R, P) \le r_{\mathrm{Total}}\), the \(\hbox {min}\{S_{\mathrm{SF}}(P,R), S_{\mathrm{SF}}(R,P)\} \le \hbox {min}(p_{\mathrm{Total}}, r_{\mathrm{Total}})\), which satisfies the Inequality 13. Thus, the Proposition 2 is also true for Relaxed Strategy.

For all strategies, the upper bound of S(PR) is assessed when \(S_{\mathrm{SF}}(P,R) = p_{\mathrm{Total}} = r_{\mathrm{Total}}\), which means that all the terms of both documents have maximum similarity, i.e., \(\hbox {Sim}(p_i, r_j) = 1\) for each associated term and all terms in both documents are associated. This scenario happens when both documents are equivalent, syntactically or semantically. In this case, we can assume that \(x = S_{\mathrm{SF}}(P,R) = p_{\mathrm{Total}} = r_{\mathrm{Total}}\) and replace them at Eq. 6, which results:

$$\begin{aligned} S(P,R) = \frac{x \cdot (x + x)}{2\cdot x \cdot x} = \frac{x \cdot (2 \cdot x)}{2\cdot x \cdot x} = 1. \end{aligned}$$

Appendix B: Detailed computational complexity analysis

Here is presented the estimation of the asymptotic complexity of the algorithms in Table 2.

1.1 B.1 Preprocessing complexity analysis

Two preprocessing methods considered in this present paper are BOW and BOC. In BOW, the preprocessing performs \(\sum _i^N{T_i}\) operations to separate and count the terms, where N is the number of documents and \(T_i\) is the number of terms (including repetitions) contained in the \(i^{th}\) document. To remove the stopwords, \(S \cdot \log {T'}\) operations are performed, where S is the stopword list lengthFootnote 14 and \(T'\) is the total number of terms, without repetitions, in the document set. The computational complexity of TF–IDF normalization is \(O(NT')\). The VAR-TFIDF feature selection [26] also has complexity \(O(NT')\). Thus, the total complexity of the BOW preprocessing is \(O((T_1 + T_2 + \cdots + T_N) + (S \cdot \log {T'}) + (NT') + (NT'))\), which is equivalent to \(O(N\bar{T} + NT')\), where \(\bar{T}\) is the average number of terms in the document set. As the search and removal of stopwords are performed in logarithmic time and the number N of documents in a collection is usually greater than the list with S stopwords, the cost \(S \cdot \log {T'}\) is omitted in the notation O. Since the values of \(\bar{T}\) and \(T'\) depend on the characteristics of the document set, it is not possible to affirm that one is greater than the other. In a set of short documents, but with high diversity of subjects and vocabulary, the number of terms \(T'\) is usually much greater than the average size of the documents, \(\bar{T}\). On the other hand, in a set of large documents, but with little diversity of subjects and small vocabulary, the average size of the documents, \(\bar{T}\), may be greater than the number of terms, \(T'\).

In BOC method, the Wikipedia Miner toolkit [41] is used to identify concepts in the documents. Given as input the \(i^{th}\) document with length \(T_i\), the Wikipedia Miner toolkit builds a list with all possible n-grams in the document. A n-gram is a sequence of words found in the text which is not separated by punctuation mark or line break. For example, in the sentence “I love cookies,” the following n-grams are obtained:

  • “I love cookies”

  • “I love”

  • “love cookies”

  • “I”

  • “love”

  • “cookies”

As there is no maximum value of n, in the worst case, where in the document there is no punctuation mark or line break, the n-gram list will contain, approximately, the following number of n-grams:

$$\begin{aligned} \sum _{j=1}^{T_i}{j} = \frac{T_i(T_i + 1)}{2} \approx T_i^2 . \end{aligned}$$

After constructing the n-gram list, the Wikipedia Miner Toolkit searches for related concepts to each of the n-grams found. The tool uses a hash map structure to query the concepts. Assuming the complexity to query concepts approaches O(1) [28] and that the tool measures the relationship among all concepts found, the complexity for measuring the relationship among all concepts in the \(i^{th}\) document is \(O(C_i^2)\), where \(C_i\) is the number of concepts found in the \(i^{th}\) document. After computing the relationship, the tool iterates over and removes the concepts with lower weight using a user-defined threshold. The removal of concepts has complexity \(O(C_i)\).

After using the Wikipedia Miner Toolkit for all N documents, the document-concept table is constructed and normalized by TF–IDF method. This operation has complexity \(O(NC')\), where \(C'\) is the number of relevant concepts throughout the set document. Thus, the total complexity of the preprocessing by BOC method is estimated as \(O((T_1^2 + T_2^2 + \cdots + T_N^2) + (C_1^2 + C_2^2 + \cdots + C_N^2) + (C_1 + C_2 + \cdots + C_N) + (NC'))\). As the size of each document may vary, it is not possible to simplify the complexity. If we assume that all documents have the same size and number of concepts, the complexity can be represented as \(O((N\bar{T}^2) + (N\bar{C}^2) + (N\bar{C}) + (NC'))\), where \(\bar{T}\) and \(\bar{C}\) are the averages of the numbers of words and concepts, respectively, in the document set. As the number of concepts in a document is lower than the number of words in the same document, it follows that \(N\bar{T}^2 \ge N\bar{C}^2 \ge N\bar{C}\). Thus, \((N\bar{C}^2)\) and \((N\bar{C})\) are omitted in the notation O. As with the BOW model, it is not possible to say that \(\bar{T}^2 \ge C'\); therefore, the BOC preprocessing complexity is estimated as \(O(N\bar{T}^2 + NC')\).

1.2 B.2 Similarity measures complexity analysis

Considering that m and n are the number of terms in the documents P and R, respectively, and \(T = |P \cup R|\), that is, T is the number of terms of the union of P with R, the similarity between two documents using the Euclidean distance is calculated with \(2 \cdot T\) operations. Therefore, the asymptotic complexity is O(T). For the similarity calculation using cosine, \(3 \cdot T\) operations are performed, and thus, the asymptotic complexity is also O(T).

For both, GTM and FGTM measures, the first step is to construct a table with the similarity among all terms. For the calculation of similarity between terms, queries of uni-grams and tri-grams are made for each pair of words of the Google corpus. The corpus is structured into a hash map, and therefore, the computational complexity for queries approaches O(1) [28]. The complexity to construct the term-similarity table is \(O(T^2)\). Once the term-similarity matrix is created, the measures compute the document similarity between P and R. For the GTM measure, the worst case is when \(m = n = \frac{T}{2}\), which is when P and R have the same number of terms, all of them are different. In this case, the GTM calculation between documents is performed in \(4 \cdot \left( \frac{T}{2}\right) ^2 + 3 \cdot \frac{T}{2}\) operations. Thus, this asymptotic complexity can reduced to \(O(T^2)\). Unlike GTM, the FGTM’s worst case is when \(m = n = T\), that is, the documents are the same. When using the greedy implementation strategy (Method 1), FGTM performs \(T^2\log {T} + T^2\) operations; thus, the asymptotic complexity is \(O(T^2\log {T})\). When using the relaxed implementation strategy (Method 2), FGTM performs \(2 \cdot T^2 + T\) operations; thus, its asymptotic complexity is \(O(T^2)\).

1.3 B.3 Clustering algorithms complexity analysis

In the document clustering phase of this paper, we use the LDC and ELSDC algorithms. We also perform experiments using the consensus method in the LDC algorithm, in order to merge clusterings obtained by two different similarity measures.

In the LDC algorithm, the term-clustering and key-term selection phases have complexity \(O(NT'K^2I)\), where N and \(T'\) are the number of documents and terms in the dataset, respectively, K is the number of clusters and I is the number of iterationsFootnote 15 of the fuzzy c-means algorithm [43]. Finding seed documents has complexity \(O(NT'K + NKI)\). Document clustering is performed in two steps: First, the document-centroids are computed, which has complexity \(O(NT'K)\); second, the distances between all N documents to all K document-centroids are measured. When Euclidean distance or cosine similarity is adopted, this step has complexity \(O(NT'K)\), which results in the total complexity of \(O(NT'K)\) for this phase [43]. Alternatively, when using the measures GTM or the relaxed implementation of the FGTM, the complexity of this phase rises to \(O(NT'^2K)\) in the worst case. For the greedy implementation of the FGTM, the complexity is \(O(NT'^2\log {(T')}K)\) in the worst case.

In the ELSDC algorithm, the term-clustering and key-term selection phases remain the same LDC and have complexity \(O(NT'K^2I)\). Then, for each extracted key-term, the algorithm queries and extracts relevant Wikipedia concepts. This step has complexity \(O(T')\). Once done, the seed documents of the BOW and BOC models are found. In this phase, the BOW model has complexity \(O(NT'K + NKI)\) and the BOC has complexity \(O(NC'K + NKI)\). In the next phase, the centroids for both models, BOW and BOC, are computed. The distances between all N documents to the K centroids from BOW and BOC are also measured. Considering the BOW model, the complexity is \(O(NT'K)\) with Euclidean or cosine measures and \(O(NT'^2K)\) for GTM and the relaxed implementation for the FGTM. For the greedy implementation of the FGTM, the complexity is \(O(NT'^2\log {(T')K}\). Considering the BOC model, the complexity is \(O(NC'K)\), where \(C'\) is the total number of concepts in the document set.

The consensus method is implemented as the last step of ELSDC.Footnote 16 Firstly, the consensus method finds the documents with the same label in both clusterings. This task has complexity O(N). After that, these documents are considered a training set for a Naive Bayes classifier, which is used to classify the remaining documents. The training and classification of the Naive Bayes have linear complexity, estimated as \(O(NT')\) [35]. Therefore, the asymptotic complexity of the consensus method is \(O(NT' + N)\), which is equivalent to \(O(NT')\).

Following these asymptotic complexities, the analyses of the variants implemented in this present work are made as follows:

  1. 1.

    Euclidean this variant uses BOW preprocessing and clustering by Euclidean distance in the LDC algorithm. Complexity: \(N\bar{T} + NT' + NT'K^2I + 2NT'K + NKI = O(N\bar{T} + NT'K^2I)\), because \(NT'K^2I\) > \(NT'K\), NKI and \(NT'\).

  2. 2.

    Cosine this variant uses BOW preprocessing and clustering by distance (cosine similarity) in the LDC algorithm. It has the same complexity of “Euclidean” variant, therefore: \(O(N\bar{T} + NT'K^2I)\).

  3. 3.

    GTM this variant uses BOW preprocessing and clustering by semantic (GTM) in the LDC algorithm. Complexity: \(O(N \bar{T} + NT' + NT'K^2I + NT'K + NKI + NT'^2K) = O(N \bar{T} + NT'^2K)\), because \(T'\) > K, thus, \(NT'^2K\) > \(NT'K^2I\), \(NT'K\), NKI and \(NT'\).

  4. 4.

    FGTM (relaxed) this variant uses BOW preprocessing and clustering by semantic (FGTM relaxed implementation) in the LDC algorithm. It has the same complexity of “GTM,” therefore: \(O(N \bar{T} + NT'^2K)\).

  5. 5.

    FGTM (greedy) this variant uses BOW preprocessing and clustering by semantic (FGTM greedy implementation) in the LDC algorithm. The complexity of the greedy compared to the relaxed implementation is just replace terms \(T'^2\) with \(T'^2\log {T'}\), therefore: \(O(N \bar{T} + NT'^2\log {T'}K)\).

  6. 6.

    Euc \(+\) Wik this variant uses BOW and BOC preprocessing. The clustering is done by Euclidean distance in the ELSDC algorithm. Complexity: \(O(N\bar{T} + NT' + N\bar{T}^2 + NC'+ NT'K^2I + T' + 2NT'K + 2NKI + 2NC'K + NT') = O(N\bar{T}^2 + NT'K^2I)\), because usually \(T'\) > \(C'\) and I.

  7. 7.

    Euc \(+\) FGTM this variant uses BOW preprocessing, clustering by Euclidean distance and semantic (FGTM) in the LDC algorithm. For the FGTM will be consider the relaxed implementation. The results are merged by the consensus method. Complexity: \(O(N\bar{T} + NT' + NT'K^2I + 2NT'K + NKI + NT'^2K + NT') = O(N\bar{T} + NT'^2K)\).

  8. 8.

    Cos \(+\) FGTM this variant uses BOW preprocessing, clustering by distance (cosine similarity) and the semantic FGTM measure (considering the relaxed implementation) in the LDC algorithm. The results are merged by the consensus method. The complexity is the same of “Euc \(+\) FGTM,” therefore: \(O(N\bar{T} + NT'^2K)\).

  9. 9.

    Wik \(+\) FGTM this variant uses BOW and BOC preprocessing. The clustering is done by the semantic FGTM measure (considering the relaxed implementation) in the ELSDC algorithm. Complexity: \(O(N\bar{T} + NT' + N\bar{T}^2 + NC' + NT'K^2I + T' + NT'K + 2NKI + 2NC'K + NT'^2K + NT') = O(N\bar{T}^2 + NT'^2K)\).

Fig. 2
figure 2

Comparison of the average NMI results obtained by the Greedy and Relaxed implementations of FGTM for all eighteen document sets

Appendix C: Supplementary results: relaxed strategy

Experimental results using FGTM are done with two strategies described in Sect. 3.2: Greedy and Relaxed. Both methods have very similar results, as illustrated in Fig. 2, where the NMI mean values and standard deviations for stand-alone FGTM are presented. Considering these values for each document set, it is noticeable that the Greedy implementation outperforms the relaxed implementation only for the cbr-ilp-ir-son and cbr-ilp-ir-son-int sets. For all other 16 document sets, the results are not significantly different. Thus, for the sake of simplicity and compactness of the paper, we reported only the clustering results obtained by the Greedy Strategy (Method 1) in Sect. 6.

Here, the results obtained by the Relaxed Strategy (Method 2) are presented. Table 5 shows the NMI mean and standard deviation values for the relaxed stand-alone FGTM and its combinations with Euclidean, cosine and BOC (using Wikipedia concepts) by consensus method. The t-test was applied with \(95\%\) confidence to compare the results of the Relaxed Strategy (Table 5) and the results of the Greedy Strategy (Table 4). When the null hypothesis is rejected, there is statistical evidence to support that the compared results are different and the result is presented in bold in Table 5.

Table 5 Mean and standard deviation of NMI values over 50 clustering runs for the Relaxed implementation of the FGTM measure and its combinations to Euclidean, cosine and Wikipedia concepts for each document set

The absolute values of NMI are very similar for the vast majority of the datasets. Presented in bold in Table 5, there are four datasets for which the Relaxed Strategy had a result significantly worst than the Greedy Strategy, according to the results of the statistical test. Even so, the obtained values are close for both strategies.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Soares, V.H.A., Campello, R.J.G.B., Nourashrafeddin, S. et al. Combining semantic and term frequency similarities for text clustering. Knowl Inf Syst 61, 1485–1516 (2019). https://doi.org/10.1007/s10115-018-1278-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1278-7

Keywords

Navigation