An overlapping cluster algorithm to provide non-exhaustive clustering

https://doi.org/10.1016/j.ejor.2005.06.056Get rights and content

Abstract

The partitioning clustering is a technique to classify n objects into k disjoint clusters, and has been developed for years and widely used in many applications. In this paper, a new overlapping cluster algorithm is defined. It differs from traditional clustering algorithms in three respects. First, the new clustering is overlapping, because clusters are allowed to overlap with one another. Second, the clustering is non-exhaustive, because an object is permitted to belong to no cluster. Third, the goals considered in this research are the maximization of the average number of objects contained in a cluster and the maximization of the distances among cluster centers, while the goals in previous research are the maximization of the similarities of objects in the same clusters and the minimization of the similarities of objects in different clusters. Furthermore, the new clustering is also different from the traditional fuzzy clustering, because the object–cluster relationship in the new clustering is represented by a crisp value rather than that represented by using a fuzzy membership degree. Accordingly, a new overlapping partitioning cluster (OPC) algorithm is proposed to provide overlapping and non-exhaustive clustering of objects. Finally, several simulation and real world data sets are used to evaluate the effectiveness and the efficiency of the OPC algorithm, and the outcomes indicate that the algorithm can generate satisfactory clustering results.

Introduction

Clustering is an important technique for information retrieval, data mining, pattern recognition, and image segmentation [8]. Because of its widespread usage, many variants of clustering methods have been proposed. A recent research of [4] classified clustering methods into the following categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Each category has its own constraints and features to fit in a particular scenario. This paper will focus on partitioning cluster methods; hence the basic assumption and the constraints along with this line will be discussed.

The partitioning methods cluster n objects into k clusters, where k is specified by the user. Each object can be defined by multiple attributes. There have been many definitions of distance being proposed to measure the similarity between two objects, where the Euclidian distance is perhaps the most popular one. The farther the distance between two objects, the more dissimilar they are.

The following are the basic assumptions of the traditional partitioning cluster methods:

  • One object can only belong to one cluster.

  • Each object is assigned to the nearest cluster.

The well-known K-Means and K-Medoids algorithms were developed based on the assumptions above. Since one object can only belong to one cluster, the K-Means and K-Medoids methods assign an object to the nearest cluster. These algorithms are designed to find clusters of objects so that the sum of all distances from objects to their cluster centers should be as small as possible. The K-Means and K-Medoids algorithms have been successfully used in numerous applications [5], [11], [12], [16].

In this paper, the fundamental assumption of the traditional partitioning clustering algorithm is extended so that each object now can belong to multiple clusters rather than a single cluster. In other words, these clusters may overlap with one another. The reason behind this extension is that, in some circumstances, it is not appropriate to let an object belong to a single cluster. For example, documents can be partitioned into a number of clusters, where each cluster represents the documents in a certain area. Since a document may be related to several areas, it is natural for a document to belong to more than one cluster. Moreover, the overlapping clustering is different from the fuzzy clustering since the relationship between object and cluster is crisp rather than that represented by a fuzzy membership degree.

In a typical clustering algorithm such as K-Means and K-Medoids, the goal is twofold: one is to maximize the dissimilarity of objects in different clusters and the other is to maximize the similarity of objects in the same cluster. In the new overlapping cluster algorithm, these two goals are changed as follows.

Firstly, the dissimilarity of objects in different clusters will no longer be considered. Instead, the dissimilarity among cluster centers is considered. In practice, a cluster center can represent a typical pattern of objects in that particular group. For example, in a cluster of documents the cluster center represents the typical document in that group. Similarly, in a cluster of customers, the cluster center is the typical customer that best represents the purchasing behavior of this market segment. Therefore, if the centers of clusters are kept far away from each other, then the representative patterns of these clusters will be more distinguishable. In other words, since each center represents the core concept of a group, by keeping these centers distant, then a set of core concepts, which are clearly separated and with clear semantics, can be identified.

Secondly, the similarity of objects in the same cluster will no longer be considered. Instead, a threshold is set such that all objects with distances from a cluster center no more than the threshold belong to this cluster. In this way, an object can simultaneously belong to multiple clusters if the distances from this object to all these centers are no more than the given threshold. But if an object is far away from all cluster centers, this object would not be included in any cluster, and this would result in a non-exhaustive clustering. Furthermore, it is attempted to maximize the number of objects contained in a cluster. This is because the more general the concept of a class is, more objects the class can cover.

Finally, this research will propose a new algorithm to solve the overlapping cluster problem. Section 2 will give a review of existing partitioning methods. Section 3 presents the overlapping partitioning cluster (OPC) algorithm. In Section 4, both synthetic data sets and real data sets are used to evaluate the proposed algorithm. Comparisons are made between the results obtained from the suggested clustering algorithm with those obtained from the traditional partition methods. Section 5 presents the conclusion and the future research areas.

Section snippets

Partitioning cluster methods

As pointed out by [8], the methods in partitioning clustering can be roughly classified into two major approaches: the K-Means approach and the K-Medoid approach. They both use the distance among the objects to evaluate the similarity of the objects. The basic idea is to assign each object to the nearest cluster to minimize the intra-distance of objects. Each user has to decide a k number as the number of clusters.

The K-Means algorithm performs the following steps. First, it randomly choose k

The overlapping partitioning cluster algorithm

The basic idea of partitioning cluster is to group similar objects into a cluster. A set of numerical attributes is associated with each object. For example, the age, income, monthly spending, average consumption amount and other attributes can describe a customer. Since the Euclidean distance is a well-known method to compute the distance, the paper assumes that the distance of objects is computed based on the Euclidean formula. However, since there are other variants of distance definition

Evaluation

The evaluation of the OPC algorithm contains two parts. The first part compares OPC with the traditional partitioning methods, K-Means and K-Medoids, using synthetic data sets. The second part evaluates OPC by using real data sets, the Abalone data set and the Telecom data set.

In the experiment, the OPC, K-Means and K-Medoid algorithms were implemented in C language and tested on a Celeron(R) CPU Windows-XP system with 1024 megabytes of main memory. Since the objective of the K-Means and K

Conclusion

This paper has introduced a new overlapping clustering algorithm, which is to partition n objects into k non-exhaustive clusters that may overlap with each other. The new algorithm differs from the traditional clustering problem in three respects. First, these clusters are allowed to overlap with each other. Second, an object is permitted to belong to on cluster. Third, the goals of the new algorithm, including maximizing the distance among cluster centers and maximizing the average number of

Acknowledgements

The work was supported in part by the MOE Program for Promoting Academic Excellence of Universities under the Grant Number 91-H-FA07-1-4. We express our gratitude to three anonymous referees for their many helpful and pinpointed suggestions.

References (20)

There are more references available in the full text version of this article.

Cited by (41)

  • Blockchain technology forecasting by patent analytics and text mining

    2021, Blockchain: Research and Applications
    Citation Excerpt :

    Text mining results and their interpretations would help technology investment decision-making to be more wise [34,49]. Clustering is an effective text mining technique [50,51], is an unsupervised technique, its goal is to put similar objects in the groups that their members have more similarities with each other, however, have more dissimilarities with the other groups' members [52]. Clustering is a continuous process that includes collecting data, determining a similarity criterion between data, selecting an appropriate clustering method, evaluating the performance of the selective method, and finally interpreting the results of clustering [53].

  • Grid density overlapping hierarchical algorithm for clustering of carbonate reservoir rock types: A case from Mishrif Formation of West Qurna-1 oilfield, Iraq

    2019, Journal of Petroleum Science and Engineering
    Citation Excerpt :

    The grid density clustering algorithm has the advantage of finding arbitrarily shaped clusters (Agrawal et al., 1998, 2005; Mann and Kaur, 2013). The overlapping clustering algorithms have an advantage in the analysis of the overlapping relationship between different clusters (Chen and Hu, 2006; Arochevillarruel et al., 2014; De Andrade et al., 2014). Hierarchical clustering algorithms have a great advantage in the analysis of the hierarchical relationship among data clusters (Ward, 1963) with similar data points or clusters as a metric.

  • Multi-source homogeneous data clustering for multi-target detection from cluttered background with misdetection

    2017, Applied Soft Computing Journal
    Citation Excerpt :

    Furthermore, overlapping clusters are involved where clusters can be overlapped if the corresponding targets are closely distributed. In existing research, mixed data among overlapping clusters are considered to be outliers [35], to belong to one or multiple clusters [31–33] or to belong to a given cluster to a certain degree [34]; see also [36,37]. None of these existing clustering approaches, however, exactly meet our requirements.

  • Fuzzy evaluated quantum cellular automata approach for watershed image analysis

    2017, Quantum Inspired Computational Intelligence: Research and Applications
  • Integrating bibliometrics and roadmapping: A case of strategic promotion for the ground source heat pump in China

    2016, Renewable and Sustainable Energy Reviews
    Citation Excerpt :

    Specifically elements related to sub-technologies are often categorized in more than one group, though the groups should be related [32]. Many scholars have developed algorithms meant to optimize the K-means approach, including the Fuzzy c-means (FCM) [33–36] and Overlapping Cluster (OPC) algorithms proposed by Chen and Hu [37]. These algorithms are based on the concept of fuzzy clustering and dictate that subordinate principles and objective functions can address the problems intrinsic to the K-means approach.

View all citing articles on Scopus
View full text