An overlapping cluster algorithm to provide non-exhaustive clustering
Introduction
Clustering is an important technique for information retrieval, data mining, pattern recognition, and image segmentation [8]. Because of its widespread usage, many variants of clustering methods have been proposed. A recent research of [4] classified clustering methods into the following categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Each category has its own constraints and features to fit in a particular scenario. This paper will focus on partitioning cluster methods; hence the basic assumption and the constraints along with this line will be discussed.
The partitioning methods cluster n objects into k clusters, where k is specified by the user. Each object can be defined by multiple attributes. There have been many definitions of distance being proposed to measure the similarity between two objects, where the Euclidian distance is perhaps the most popular one. The farther the distance between two objects, the more dissimilar they are.
The following are the basic assumptions of the traditional partitioning cluster methods:
- •
One object can only belong to one cluster.
- •
Each object is assigned to the nearest cluster.
The well-known K-Means and K-Medoids algorithms were developed based on the assumptions above. Since one object can only belong to one cluster, the K-Means and K-Medoids methods assign an object to the nearest cluster. These algorithms are designed to find clusters of objects so that the sum of all distances from objects to their cluster centers should be as small as possible. The K-Means and K-Medoids algorithms have been successfully used in numerous applications [5], [11], [12], [16].
In this paper, the fundamental assumption of the traditional partitioning clustering algorithm is extended so that each object now can belong to multiple clusters rather than a single cluster. In other words, these clusters may overlap with one another. The reason behind this extension is that, in some circumstances, it is not appropriate to let an object belong to a single cluster. For example, documents can be partitioned into a number of clusters, where each cluster represents the documents in a certain area. Since a document may be related to several areas, it is natural for a document to belong to more than one cluster. Moreover, the overlapping clustering is different from the fuzzy clustering since the relationship between object and cluster is crisp rather than that represented by a fuzzy membership degree.
In a typical clustering algorithm such as K-Means and K-Medoids, the goal is twofold: one is to maximize the dissimilarity of objects in different clusters and the other is to maximize the similarity of objects in the same cluster. In the new overlapping cluster algorithm, these two goals are changed as follows.
Firstly, the dissimilarity of objects in different clusters will no longer be considered. Instead, the dissimilarity among cluster centers is considered. In practice, a cluster center can represent a typical pattern of objects in that particular group. For example, in a cluster of documents the cluster center represents the typical document in that group. Similarly, in a cluster of customers, the cluster center is the typical customer that best represents the purchasing behavior of this market segment. Therefore, if the centers of clusters are kept far away from each other, then the representative patterns of these clusters will be more distinguishable. In other words, since each center represents the core concept of a group, by keeping these centers distant, then a set of core concepts, which are clearly separated and with clear semantics, can be identified.
Secondly, the similarity of objects in the same cluster will no longer be considered. Instead, a threshold is set such that all objects with distances from a cluster center no more than the threshold belong to this cluster. In this way, an object can simultaneously belong to multiple clusters if the distances from this object to all these centers are no more than the given threshold. But if an object is far away from all cluster centers, this object would not be included in any cluster, and this would result in a non-exhaustive clustering. Furthermore, it is attempted to maximize the number of objects contained in a cluster. This is because the more general the concept of a class is, more objects the class can cover.
Finally, this research will propose a new algorithm to solve the overlapping cluster problem. Section 2 will give a review of existing partitioning methods. Section 3 presents the overlapping partitioning cluster (OPC) algorithm. In Section 4, both synthetic data sets and real data sets are used to evaluate the proposed algorithm. Comparisons are made between the results obtained from the suggested clustering algorithm with those obtained from the traditional partition methods. Section 5 presents the conclusion and the future research areas.
Section snippets
Partitioning cluster methods
As pointed out by [8], the methods in partitioning clustering can be roughly classified into two major approaches: the K-Means approach and the K-Medoid approach. They both use the distance among the objects to evaluate the similarity of the objects. The basic idea is to assign each object to the nearest cluster to minimize the intra-distance of objects. Each user has to decide a k number as the number of clusters.
The K-Means algorithm performs the following steps. First, it randomly choose k
The overlapping partitioning cluster algorithm
The basic idea of partitioning cluster is to group similar objects into a cluster. A set of numerical attributes is associated with each object. For example, the age, income, monthly spending, average consumption amount and other attributes can describe a customer. Since the Euclidean distance is a well-known method to compute the distance, the paper assumes that the distance of objects is computed based on the Euclidean formula. However, since there are other variants of distance definition
Evaluation
The evaluation of the OPC algorithm contains two parts. The first part compares OPC with the traditional partitioning methods, K-Means and K-Medoids, using synthetic data sets. The second part evaluates OPC by using real data sets, the Abalone data set and the Telecom data set.
In the experiment, the OPC, K-Means and K-Medoid algorithms were implemented in C language and tested on a Celeron(R) CPU Windows-XP system with 1024 megabytes of main memory. Since the objective of the K-Means and K
Conclusion
This paper has introduced a new overlapping clustering algorithm, which is to partition n objects into k non-exhaustive clusters that may overlap with each other. The new algorithm differs from the traditional clustering problem in three respects. First, these clusters are allowed to overlap with each other. Second, an object is permitted to belong to on cluster. Third, the goals of the new algorithm, including maximizing the distance among cluster centers and maximizing the average number of
Acknowledgements
The work was supported in part by the MOE Program for Promoting Academic Excellence of Universities under the Grant Number 91-H-FA07-1-4. We express our gratitude to three anonymous referees for their many helpful and pinpointed suggestions.
References (20)
Validating Fuzzy Partitions obtained through c-shells clustering
Pattern Recognition Letters
(1996)Hybrid mining approach in the design of credit scoring models
Expert Systems with Applications
(2005)- et al.
Optimizing storage utilization in R-tree dynamic index structure for spatial databases
The Journal of Systems and Software
(2001) The EM algorithm for graphical association models with missing data
Computational Statistics and Data Analysis
(1995)- et al.
Selecting the right objective measure for association analysis
Information Systems
(2004) Unsupervised fuzzy clustering with multi-center clusters
Fuzzy Sets and Systems
(2002)- et al.
On a class of fuzzy c-numbers clustering procedures for fuzzy data
Fuzzy Sets and Systems
(1996) - M. Ester, H.P. Kriegel, X. Xu, Knowledge discovery in large spatial data bases: Focusing techniques for efficient class...
- et al.
Techniques of cluster algorithms in data mining
Data Mining and Knowledge Discovery Journal
(2002) - et al.
Data Mining: Concepts and Techniques
(2001)
Cited by (41)
Process fragments discovery from emails: Functional, data and behavioral perspectives discovery
2023, Information SystemsBlockchain technology forecasting by patent analytics and text mining
2021, Blockchain: Research and ApplicationsCitation Excerpt :Text mining results and their interpretations would help technology investment decision-making to be more wise [34,49]. Clustering is an effective text mining technique [50,51], is an unsupervised technique, its goal is to put similar objects in the groups that their members have more similarities with each other, however, have more dissimilarities with the other groups' members [52]. Clustering is a continuous process that includes collecting data, determining a similarity criterion between data, selecting an appropriate clustering method, evaluating the performance of the selective method, and finally interpreting the results of clustering [53].
Grid density overlapping hierarchical algorithm for clustering of carbonate reservoir rock types: A case from Mishrif Formation of West Qurna-1 oilfield, Iraq
2019, Journal of Petroleum Science and EngineeringCitation Excerpt :The grid density clustering algorithm has the advantage of finding arbitrarily shaped clusters (Agrawal et al., 1998, 2005; Mann and Kaur, 2013). The overlapping clustering algorithms have an advantage in the analysis of the overlapping relationship between different clusters (Chen and Hu, 2006; Arochevillarruel et al., 2014; De Andrade et al., 2014). Hierarchical clustering algorithms have a great advantage in the analysis of the hierarchical relationship among data clusters (Ward, 1963) with similar data points or clusters as a metric.
Multi-source homogeneous data clustering for multi-target detection from cluttered background with misdetection
2017, Applied Soft Computing JournalCitation Excerpt :Furthermore, overlapping clusters are involved where clusters can be overlapped if the corresponding targets are closely distributed. In existing research, mixed data among overlapping clusters are considered to be outliers [35], to belong to one or multiple clusters [31–33] or to belong to a given cluster to a certain degree [34]; see also [36,37]. None of these existing clustering approaches, however, exactly meet our requirements.
Fuzzy evaluated quantum cellular automata approach for watershed image analysis
2017, Quantum Inspired Computational Intelligence: Research and ApplicationsIntegrating bibliometrics and roadmapping: A case of strategic promotion for the ground source heat pump in China
2016, Renewable and Sustainable Energy ReviewsCitation Excerpt :Specifically elements related to sub-technologies are often categorized in more than one group, though the groups should be related [32]. Many scholars have developed algorithms meant to optimize the K-means approach, including the Fuzzy c-means (FCM) [33–36] and Overlapping Cluster (OPC) algorithms proposed by Chen and Hu [37]. These algorithms are based on the concept of fuzzy clustering and dictate that subordinate principles and objective functions can address the problems intrinsic to the K-means approach.