Elsevier

Pattern Recognition Letters

Volume 31, Issue 6, 15 April 2010, Pages 469-477
Pattern Recognition Letters

Dynamic hierarchical algorithms for document clustering

https://doi.org/10.1016/j.patrec.2009.11.011Get rights and content

Abstract

In this paper, two clustering algorithms called dynamic hierarchical compact and dynamic hierarchical star are presented. Both methods aim to construct a cluster hierarchy, dealing with dynamic data sets. The first creates disjoint hierarchies of clusters, while the second obtains overlapped hierarchies. The experimental results on several benchmark text collections show that these methods not only are suitable for producing hierarchical clustering solutions in dynamic environments effectively and efficiently, but also offer hierarchies easier to browse than traditional algorithms. Therefore, we advocate its use for tasks that require dynamic clustering, such as information organization, creation of document taxonomies and hierarchical topic detection.

Introduction

The World Wide Web and the number of text documents managed in organizational intranets continue to grow at an amazing speed. Managing, accessing, searching and browsing large repositories of text documents require efficient organization of the information. In dynamic information environments, such as the World Wide Web or the stream of newspaper articles, it is usually desirable to apply adaptive methods for document organization such as clustering.

Static clustering methods mainly rely on having the whole collection ready before applying the algorithm. Unlike them, the incremental methods are able to process new data as they are added to the collection. In addition, dynamic algorithms have the ability to update the clustering when data are added or removed from the collection. These algorithms allow us dynamically tracking the ever-changing large scale information being put or removed from the web everyday, without having to perform complete reclustering.

Hierarchical clustering algorithms have an additional interest, because they provide data-views at different levels of abstraction, making them ideal for people to visualize and interactively explore large document collections. Besides, clusters very often include sub-clusters, and the hierarchical structure is indeed a natural constraint on the underlying application domain.

In the context of hierarchical document clustering, six major challenges must be addressed: (1) Very high dimensionality of the data: the computational complexity should be linear with respect to the number of dimensions (terms). (2) Very large size of text collections: the algorithms must be efficient and scalable to large data sets. (3) Documents often have several topics: it is important to avoid confining each document to only one cluster. Thus, overlapping between document clusters should be allowed. (4) Dynamic data sets: the algorithms must be able to update the hierarchy when documents arrive (or are removed). (5) The insensitivity to the input order: the generated set of clusters must be unique, independently on the arrival order of the documents. It is one of the major issues in incremental and dynamic algorithms, and (6) The number of clusters is unknown prior to the clustering: it is difficult to specify a reasonable level of the hierarchy. Instead of that, it makes more sense to let the clustering algorithm find it out by itself.

Agglomerative and divisive are two general categories of hierarchical clustering algorithms. Both of them have been applied to document clustering. UPGMA (Jain and Dubes, 1988) of agglomerative algorithms and Bisecting K-Means (BKM) (Steinbach et al., 2000) of divisive methods are reported to be the most accurate one in its category (Li et al., 2008). These hierarchical methods neither can deal with dynamic data sets nor allow overlapping between clusters.

There are some incremental algorithms that update the cluster hierarchy when new documents arrive, such as DC-tree (Wai-chiu and Wai-chee Fu, 2000) and IHC (Widyantoro and Yen, 2002). They are based on a tree structure and obtain disjoint document hierarchies. In DC-Tree the document assignments to clusters are irrevocable, whereas IHC is relatively not sensitive to the input order. DC-Tree defines also several parameters, thus its tunning is problematic.

On the other hand, several static hierarchical algorithms have been proposed for overlapped clustering of documents, including HFTC (Beil et al., 2002), Malik’s method (Malik and Kender, 2006) and HSTC (Maslowska, 2003). HFTC and Malik’s algorithm attempt to address the hierarchical document clustering using the notion of frequent itemsets. Each cluster consists of a set of documents containing all terms of each frequent term set. HSTC algorithm provides the methodology for organizing the base clusters identified by STC algorithm (Zamir and Etziony, 1998) into a navigable hierarchy. A base cluster consists of a set of documents that share a common phrase. Like STC, the time complexity of HSTC is quite high with respect to the number of terms. In our previous work (Gil-García et al., 2006), we presented a static framework for agglomerative hierarchical clustering based on graphs. From this framework we derive hierarchical star algorithm, which obtains overlapped cluster hierarchies. This method uses a cover routine based on a greedy heuristic that takes into account the number of non-covered neighbors of each document, but it is not able to deal with dynamic data.

To the best of our knowledge, there are no hierarchical algorithms for document clustering that combine both processing of dynamic data and obtaining of overlapped clusters.

In this paper, we present a dynamic hierarchical agglomerative framework for document clustering. This approach attempts to address the challenges mentioned above. Two specific algorithms obtained from the proposed framework: dynamic hierarchical compact and dynamic hierarchical star are also presented. The first creates disjoint hierarchies of clusters, while the second produces overlapped hierarchies. The experimental results on several benchmark text collections show that these methods not only are suitable for producing hierarchical clustering solutions in dynamic environments effectively and efficiently, but also offer hierarchies easier to browse than traditional algorithms.

The remainder of the paper is organized as follows: Section 2 describes the dynamic hierarchical framework. Section 3 presents the two methods derived from the framework. The comparison with traditional hierarchical algorithms is shown in Section 4. Finally, conclusions are presented in Section 5.

Section snippets

Dynamic hierarchical agglomerative framework

Our framework is an agglomerative method based on graphs. It is a dynamic version of the static hierarchical framework introduced in (Gil-García et al., 2006). It uses a multi-layered clustering to update the hierarchy when new documents arrive (or are removed). The granularity increases with the layer of the hierarchy, with the top layer being the most general and the leaf nodes being the most specific. The process in each layer involves two steps: construction of a graph and obtaining a cover

Specific algorithms

In this paper, we present two specific algorithms obtained from the abovementioned framework. The first method is  dynamic hierarchical compact (DHC). It uses the connected component cover, and therefore, disjoint hierarchies are obtained. The second method is dynamic hierarchical star (DHS). In this case, a star cover is proposed for obtaining overlapped hierarchies.

Experimental results

The performance of the proposed algorithms have been evaluated using 15 benchmark text collections, whose general characteristics are summarized in Table 1. They are heterogeneous in terms of document size, number of topics and document distribution. Human annotators identified the topics in each collection. Notice that the topics of the first five collections are overlapped. The overlapping degree is defined as the number of topics in which a document is included on the average. The manually

Conclusion

In this paper, two clustering algorithms called dynamic hierarchical compact and dynamic hierarchical star have been proposed. Its most important novelty is the capability to handle dynamic data sets. DHS also builds overlapped cluster hierarchies. Other key features of the proposed algorithms are the insensitivity to the input order, a well-defined stop condition and linear computational complexity w.r.t. the number of dimensions.

The experiments were conducted on 15 benchmark text collections.

References (21)

  • Y. Li et al.

    Text document clustering based on frequent word meaning sequences

    Data Knowl. Eng.

    (2008)
  • E. Amigó et al.

    A comparison of extrinsic clustering evaluation metrics based on formal constraints

    Inform. Retrieval.

    (2009)
  • J. Aslam et al.

    Static and dynamic information organization with star clusters

  • Beil, F., Ester, M., Xu, X., 2002. Frequent term-based text clustering. In: KDD 2002. ACM Press, pp....
  • M. Bruynooghe

    Classification ascendante hiTrarchique, des grands ensembles de donnTes: Un algorithme rapide fondT sur la construction des voisinages rTductibles

    Les Cahiers de l’Analyse de DonnTes

    (1978)
  • Fung, B., Wang, K., Ester, M., 2003. Hierarchical document clustering using frequent itemsets. In: Third SIAM Internat....
  • Gil-García, R., Badía-Contelles, J.M., Pons-Porrata, A., 2003. Extended star clustering algorithm. In: CIARP 2003....
  • R. Gil-García et al.

    A general framework for agglomerative hierarchical clustering algorithms

  • A. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • Karypis, G., 2002. Cluto 2.0 clustering toolkit....
There are more references available in the full text version of this article.

Cited by (72)

  • A generic, cluster-centred lossless compression framework for joint auroral data

    2021, Journal of Visual Communication and Image Representation
    Citation Excerpt :

    Thus, we consider to construct the dataset hierarchy in a dynamic manner, e.g., by using the clustering algorithms based on tree structure [60] or graph [61], in the future work.

  • A simplex method-based social spider optimization algorithm for clustering analysis

    2017, Engineering Applications of Artificial Intelligence
    Citation Excerpt :

    The objective of data clustering is to gather data that share a high degree of likeness within a given cluster; this group of data will also be dissimilar from other data. Clustering algorithms have been applied to a wide range of fields and applications, such as data analysis, data mining (Ng and Han, 1994), image segmentation (Bhanu and Peng, 2000), and pattern recognition (Kamel and Selim, 1994) and outlier detection (Anaya-Sánchez et al., 2010; Gil-García and Pons-Porrata, 2010; Mahdavi et al., 2008; Friedman et al., 2007; Moshtaghi, 2011; Liao et al., 2008). The Social Spider optimization (SSO) algorithm (Cuevas et al., 2013; Cuevas and Cienfuegos, 2014) was proposed by Erik Cuevas in 2013 and was based on the simulation of cooperative behavior of social spiders.

  • Cluster Editing with Overlapping Communities

    2023, Leibniz International Proceedings in Informatics, LIPIcs
View all citing articles on Scopus
View full text