Keywords

1 Introduction

The goal of text corpus clustering is to partition a collection of text documents into several groups, such that texts inside the same groups (or clusters) are similar and share common themes or have a common style, while documents in different clusters are very distinct in nature. To achieve this goal, text documents must first be transformed using models such as the Vector Space Model (VSM) [20] in order to transform the original documents into numerical vectors that can be used by clustering algorithms such as K-Means or hierarchical clustering. One difficulty with the VSM model is the large number of existing methods to transform text documents into vector representations. Many representation models exist, some are topic oriented, others focus on word embedding, while some methods are purely statistical representations. This abundance of methods in the literature allows for multiple vector representations of the same texts, all with different strengths and weaknesses. Applying clustering algorithms to these multiple representations can be seen as a multi-view clustering problem where the goal could be to find a consensus between the clustering partitions proposed under the various Vector Space Models [3]. Within this context, in this paper we propose a new method inspired from collaborative clustering, and which relies on the notion of Kolmogorov complexity to merge clustering partitions acquired from the clustering algorithms applied to different vector representations of text documents. Our proposed method is compared with state of the art methods applied to common text corpuses that can be found in the literature.

The remainder of this paper is organized as follows: Sect. 2 focuses on various related works about both text mining and the information theory model used in this paper. Section 3 presents our algorithm. Section 4 features our experimental results and some comparisons with other methods. Finally, in Sect. 5 we draw some conclusion and give some ideas on possible extensions of this work.

2 State of the Art

Cluster ensembles is an overall framework in which multiple partitions are combined in order to obtain a consensus clustering. Multi-view clustering is one of the specific problems covered in this area [5].

The problem of combining multiple data partitions into a single one has been tackled at least by two approaches, namely Clustering Ensembles [4, 9, 10, 16, 21, 22, 24, 26] and Multi-View Clustering [3, 6,7,8, 11,12,13, 18, 25, 27], also known as data fusion.

In ensemble learning and ensemble clustering, several algorithms will work on the same data set with the goal of achieving a single result that should be better that the partitions learned from the different algorithms. As one can see, in ensemble clustering, several algorithms work on the same data and therefore the same view. However, in the case of multi-view clustering like in the present work, we have several algorithms and each of them works in a different view of the same data. And since we are dealing with several views, the goal with multi-view clustering is to merge them while taking into considerations that there might be multiple truths [30].

It is worth mentioning, that the distinction between ensemble clustering and multi-view clustering is not always obvious in the literature and some confusion may exist with different naming conventions depending on the field of application. In the following subsection, we make a quick review of the literature for both multi-view and ensemble clustering with a particular focus on text mining applications and methods that are close to the one presented in this work.

2.1 State of the Art on Combining Multiple Clustering Partitions

There are many different applications that require to combine multiple clustering partitions: In [9], the authors make a proposal for music clustering using partitions obtained from different music feature sets. Among these sets, they employ several word-level features. They pose the ensemble clustering problem as a binary clustering in a space induced by the multiple partitions. Additionally, they explore various optimization criteria for finding consensus partitions and propose a strategy for determining the final number of clusters. It is interesting to note that they apply this proposal for

In [7], the authors work specifically on text clustering. They propose to generate several partitions from each view by using different feature representations and then applying a clustering algorithm over each one. Then, similarity matrices are computed in three different ways, namely two based on partition memberships and another one based on feature similarity. Finally, a combined similarity matrix is obtained from those three previous ones and a standard clustering technique is applied to produce the consensus partition. In the same direction, [3] use more diverse text representations as views, more specifically LDA [1], Word2Vec [14] and TF-IDF [19] and then apply the same idea as the former work.

It is woth noting that multi-view text clustering shouldn’t be confused with distributed clustering of texts [28], which mainly consists in distributing the clustering task without consideration for whether or not this is a multi-view task.

Another common application of multi-view clustering is multilingual clustering. In [18], for this specific application, the authors pose the multi-view clustering problem as a tensor decomposition as this approach was proven earlier to be theoretically efficient [11, 12].

2.2 Methods to Combine Multiple Partitions

In [21], the authors pose that an application of Cluster Ensembles is to combine partitions obtained from partial sets of features. As we have seen earlier, this is a case of multi-view clustering. Additionally, they pose that a motivation for using a cluster ensemble is to build a more robust solution that performs well over a wide range of data sets. Since the diversity of base partitions has a positive impact on the final consensus solution, it can be introduced mainly by using different sets of features in each partition, different parameter configurations of the same algorithm (values of k for k-means) and also using different and complementary base techniques. The authors also formulate consensus clustering as a hyper-graph cutting problem and solve in three different ways.

Co-association matrices are based on relative co-occurrence of two data points in the same cluster. They are another very common tool to tackle multi-view clustering. Several works exploit them in order to produce final partitions from several combinations of different data representations. [4] explore two strategies for producing cluster ensembles: Using different views and using different clustering algorithms or parameter configurations. [26] address the problem from a similarity matrix completion problem in which missing values are associated to uncertain data pairs, this is pair of data points whose common membership in every partition is not consistent. In the same path, [16] propose to weight the contribution of each co-association matrix based on a novel reliability measure of each partition within the ensemble.

Some other contributions employ an utility function to measure similarity between partitions and then directly maximize an objective function to obtain the consensus [10, 22, 24].

In [8], a hybrid clustering method based on weighted linear combination of distance matrices for textual and bibliometric information is proposed.

2.3 Multi-view Clustering Applications and Kolmogorov Complexity

In the work of [17, 23], the notion of minimum description length (MDL) is introduced, with the description length being the minimal number of bits needed by a Turing Machine to describe an object. This measure of the minimal number of bits is also known under the name Kolmogorov complexity.

If \(\mathcal {M}\) is a fixed Turing machine, the complexity of an object x given another object y using the machine \(\mathcal {M}\) is defined as \(K_{\mathcal {M}}(\mathbf {x} | \mathbf {y}) = \min _{p \in \mathcal {P}_\mathcal {M}} \left\{ l(p) : p(\mathbf {y}) = \mathbf {x} \right\} \) where \(\mathcal {P}_\mathcal {M}\) is the set of programs on \(\mathcal {M}\), \(p(\mathbf {y})\) designates the output of program p with argument y and l measures the length (in bits) of a program. When the argument \(\mathbf {y}\) is empty, we use the notation \(K_{\mathcal {M}}(\mathbf {x})\) and call this quantity the complexity of \(\mathbf {x}\). The main problem with this definition is that the complexity depends on a fixed Turing machine \(\mathcal {M}\). Furthermore, the universal complexity is not computable, since it is defined as a minimum over all programs of all machines.

In relation with this work, in [15], the authors solved the aforementioned problem by using a fixed Turing Machine before applying this notion of Kolmogorov complexity to collaborative clustering, which is a specific case of multi-view clustering where several clustering algorithms work together in a multi-view context but aim at improving each other partitions rather than merging them [2]. While collaborative clustering does not aim at a consensus, this application is still very close to what we try to achieve in this paper where we try to merge partitions of the same objects under multiple representations. For these reasons, we decided to use the same tool.

In the rest of this paper, just as the authors did in [15], we will consider that the Turing Machine \(\mathcal {M}\) is fixed, and to make the equations easier we will denote by \(K(\mathbf {x})\) the complexity of \(\mathbf {x}\) on the chosen machine. Then, we adapt the equations used in their original paper to our multi-view context for text mining and we use Kolmogorov complexity as a tool to compute the complexity of one partition given another partition. The algorithm to do so and how we use it is described in the next section.

3 Proposed Merging Method

3.1 Problem Definition

Let us consider a data set \(\mathcal {X}\) of n data points and a measure of similarity \(\mathsf {S}\) that allows to quantify the strength of the connection or closeness between any pair of data points in \(\mathcal {X}\). The problem of data clustering can be stated as inducing an equivalence relationFootnote 1 on \(\mathcal {X}\) such that points \(\mathsf {a},\mathsf {b}\) in the same equivalence class (that is a cluster) have a larger similarity value \(\mathsf {S}(\mathsf {a},\mathsf {b})\) in comparison with \(\mathsf {S}(\mathsf {a},\mathsf {c})\) or \(\mathsf {S}(\mathsf {b},\mathsf {c})\) for any other point \(\mathsf {c}\) in a different equivalence class.

The Multi-view clustering task considers that the information regarding to each data point in \(\mathcal {X}\) comes from multiple sources called views. After performing a clustering algorithm over each view several partitions are generated. Let us define this set of partitions as \(\mathcal {P}\), and denote each of them with a capital letter (e.g.: \( A \)).

A partition \( A \) is a set of \(| A |\) disjoint sets (the Power set of \(\mathcal {X}\)) called clusters of \(\mathcal {X}\). Let us define an agreement function \(\varOmega \) between two clusters as a mapping which attains lower values for clusters having a smaller overlap and higher values for clusters sharing more elements of \(\mathcal {X}\). In this work we employ the Jaccard similarity function to measure agreement between two clusters.

For a point \(\mathsf {p}\in \mathcal {X}\), its cluster in any partition \( A \in \mathcal {P}\) is denoted by \(\mathcal {N}_{\mathsf {p}}^{ A }\) and it is defined as:

$$\begin{aligned} \mathcal {N}_{\mathsf {p}}^{ A }=\{\mathsf {x}\in \mathcal {X}|\exists \mathbf {c}\in A \wedge \mathsf {p}\in \mathbf {c}\wedge \mathsf {x}\in \mathbf {c}\} \end{aligned}$$

Given a cluster \(\mathbf {c}\) and a partition \( B \) the function that maps \(\mathbf {c}\) to the cluster in \( B \) with the largest overlap is called maximum agreement function and it is defined as follows:

$$\begin{aligned} \varPhi _ B (\mathbf {c})&=\mathop {\mathbf {argmax}}\limits _{\mathbf {e}\in B }\varOmega (\mathbf {c},\mathbf {e}) \end{aligned}$$
(1)

3.2 The Algorithm

Our goal in this paper is to combine several partitions in order to build a final consensus. To this end, in our method we perform successive pairwise fusion procedures between partitions following a bottom-up strategy until we reach a single partition. This procedure is depicted in Algorithm 1.

Without loss of generality, when a fusion step is performed between two partitions \( A \) and \( B \), a new partition \( C \) is created. Since the successive partition fusions are performed by following the maximum agreement criteria between clusters as stated in Eq. (1), it is possible that some data points do not fit to this rule and hence be marked as exceptions during the execution of the merge operation. The set of data points marked as exceptions before the creation of partition \( C \) is denoted by \(\xi _{ C }\), formally,

$$\begin{aligned} \xi _{ C }=\{p\in \mathcal {X}|\mathcal {N}_{\mathsf {p}}^{ A }\cap \varPhi _ B (\mathbf {\mathcal {N}_{\mathsf {p}}^{ A }})=\emptyset \cup \mathcal {N}_{\mathsf {p}}^{ B }\cap \varPhi _ A (\mathbf {\mathcal {N}_{\mathsf {p}}^{ B }})=\emptyset \} \end{aligned}$$
(2)

Thus, when partition \( C \) is created, each point \(\mathsf {p}\in \xi _{ C }\) receives a weight \(W_ C (\mathsf {p},\mathbf {c})\) for every cluster \(\mathbf {c}\in C \). This weight is made up by the relative weights that both source partitions \( A \) and \( B \) contribute, namely \(\omega _ A (\mathsf {p},\mathbf {c})\) and \(\omega _ B (\mathsf {p},\mathbf {c})\). Without loss of generality, the contribution of each source partition is given by:

$$\begin{aligned} \omega _ A (\mathsf {p},\mathbf {c}) = {\left\{ \begin{array}{ll} \varOmega (\mathbf {c},\mathcal {N}_{\mathsf {p}}^{ A }) &{}\quad \text {if }\mathsf {p}\notin \xi _{A} \\ \varOmega (\mathbf {c},\varPhi _ A (\mathbf {c}))\cdot W_ A (\mathsf {p},\varPhi _ A (\mathbf {c})) &{}\quad \text {if } \mathsf {p}\in \xi _{A}\\ \end{array}\right. } \end{aligned}$$
(3)

Thus, the final weight \(W_ C (\mathsf {p},\mathbf {c})\) for each point \(\mathsf {p}\in \xi _{C}\) in each cluster \(\mathbf {c}\in C \) is given by:

$$\begin{aligned} W_ C (\mathsf {p},\mathbf {c})=\frac{\omega _ A (\mathsf {p},\mathbf {c})}{2}+\frac{\omega _ B (\mathsf {p},\mathbf {c})}{2} \end{aligned}$$

A more detailed description of this merging process is depicted in Algorithm 2. It is important to indicate that once a point is marked as an exception, it remains so through all the subsequent fusions. After the last fusion, each of these exception data points are assigned to one of the final clusters by picking the one whose membership weight is the highest. This exception resolution is described between lines 7–9 in Algorithm 1 where K(A|B) is the kolmogorov complexity of partition A knowing partition B [15]:

$$\begin{aligned} K(A|B) = K_B \times (\log K_A + \log K_B) + |\xi _{ C }| \times (\log n + \log K_A) \end{aligned}$$
(4)

with n the total number of points, \(K_A\) the number of clusters in partition A, \(K_B\) the number of clusters in partition B and \(\xi _{ C }\) the set of exceptions between partitions A and B as defined in Eq. (2).

figure a
figure b

4 Experimental Results

4.1 Experimental Settings

Since external class labels are available for each data set, let us consider the true clustering \( T \) and the final partition obtained by the clustering algorithm as \( C \). Then, two measures are employed to assess the quality of a clustering solution, namely Entropy and Purity. Entropy is defined in two parts: the former allows to measure the Entropy for a single partition and it is characterized for any partition \(\mathbf {c}\in C \) in Eq. (5). The latter is just a weighted average of the entropy computed for all the partitions in the final solution and it is defined in Eq. (6). Purity is defined in a similar way, that is first the Purity of a single partition is defined in Eq. (7) and then, the overall Purity of the partition is denoted as Eq. (8).

$$\begin{aligned} \mathsf {E}(\mathbf {c}) = -\frac{1}{\log | T |}\sum _{\mathbf {t}\in T }\frac{|\mathbf {c}\cap \mathbf {t}|}{|\mathbf {c}|}\log \frac{|\mathbf {c}\cap \mathbf {t}|}{|\mathbf {c}|} \end{aligned}$$
(5)
$$\begin{aligned} \mathsf {Entropy}( C ) = \sum _{\mathbf {c}\in C }\frac{|\mathbf {c}|}{n} \mathsf {E}(\mathbf {c}) \end{aligned}$$
(6)
$$\begin{aligned} \mathsf {P}(\mathbf {c}) = \frac{1}{| c |}\max _{\mathbf {t}\in T }|\mathbf {c}\cap \mathbf {t}| \end{aligned}$$
(7)
$$\begin{aligned} \mathsf {Purity}( C ) = \sum _{\mathbf {c}\in C }\frac{|\mathbf {c}|}{n} \mathsf {P}(\mathbf {c}) \end{aligned}$$
(8)

Entropy measures the degree in which the true classes are dispersed within each cluster. A good solution is the one that does not break the true clusters into too many parts. Purity is targeted to measure the extent to which each cluster contains documents from mostly a single true class. Thus, a good solution should present homogeneous clusters in terms of the true classes of the contained documents.

Since the quality of the overall solution depends on the initial source k-Means clusterings, which in turn have a random nature, we follow the scheme presented in [29] to eliminate some of this sensitivity in the performance assessment. This is, we use several values for k and for each specific value, the overall clustering procedure is repeated 10 times and the best performance solution is kept. Additionally, since partition quality improves as the number of clusters increases, relative performances are reported for each clustering solution. To compute the relative entropy, we divide the entropy attained by a particular solution by the smallest entropy for that particular data set and value of k. In case of relative purity and in order to allow the same interpretation of the relative entropy, we divide the best Purity attained for that particular data set and value of k by the entropy value obtained by the clustering solution under evaluation. Since these two ratios represent the extent to which a specific algorithm performed worse than the best algorithm, for each dataset better solutions are closer to 1.0 and they are worse as the become greater then 1.0. Finally, as a performance summary for each solution the average relative performance across all data sets are reported for each clustering solution.

4.2 Results and Interpretations

The result Tables 1, 2, 3 and 4 show the relative performances attained by the proposal, each source clustering and another ensemble method recently proposed in [3].

Table 1. Average relative entropy
Table 2. Relative entropy

As we can see from Tables 1 and 2, the results on the relative entropy show that our proposed method achieves significantly better results than the method of Fraj et al. [3] on the same data sets.

Table 3. Average relative purity

Going more into details, from Table 2 we can see that overall the TFIDF first and the LDA view second have the best results in term en entropy and are used as baseline for the relative entropy. We can see that for many data set our proposed method not only is close from the best entropy result, but that it achieves better results on average than the 3 original lda, skipgram and tfidf views, and always better results than the method from Fraj et al.

Since each view may hold its own truth, it is only logical that we rarely achieve fusion results that are better than all original view. This is a common problem in multi-view clustering [30] and should be considered as normal. Regardless, it is worth mentioning that our proposed method still achieves the best results in the case of the BBCSport data set with 15 clusters in terms of relative entropy.

Table 4. Relative purity

From Tables 3 and 4, we can see that the results in term of purity are the same than the one we had with entropy, thus enabling us to affirm that our proposed method proved superior than the one of Fraj et al. on all data sets regardless of the number of clusters.

Like for entropy, we can see that we rarely achieve the best results among views, but that we still do better than the average of the 3 original views, and from Table 3 we can see that our algorithm remains very competitive even when compare to the best view.

The best performances of our proposed algorithm for relative purity are for the BBCSport data set with 15 clusters, Reuters-R8 with 15 clusters and WebbKB with 10 clusters. For all 3 cases, we not only get better results than other methods in the literature, but we also do better than the best views in term of relative purity.

5 Conclusion and Future Works

We have presented a new clustering fusion method applied to the case of multi-view text corpus clustering. Our method was applied to 4 data sets that are very common in the literature (20Newsgroup, Reutors-R8, WebKB and BBCSport) and has proved to be competitive with state of the art methods. Unlike previously proposed methods, our algorithm relies on the notion of Kolmogorov complexity and information compression thus giving it a solid theoretical background on how to best fusion the clustering partitions.

In our future works, we plan on coupling our proposed method with existing collaborative method so that we could have a collaborative step first, and a merging step then. We hope that doing so may help to detect incompatible or noisy views, but could also ease the merging process by creating closer partition with collaborative clustering before hand. Other possible extensions of this work include applications on merging multi-view clustering partitions in fields other than text mining and natural language processing.