An overlapping cluster algorithm to provide non-exhaustive clustering

doi:10.1016/j.ejor.2005.06.056

European Journal of Operational Research

Volume 173, Issue 3, 16 September 2006, Pages 762-780

https://doi.org/10.1016/j.ejor.2005.06.056 Get rights and content

Abstract

The partitioning clustering is a technique to classify n objects into k disjoint clusters, and has been developed for years and widely used in many applications. In this paper, a new overlapping cluster algorithm is defined. It differs from traditional clustering algorithms in three respects. First, the new clustering is overlapping, because clusters are allowed to overlap with one another. Second, the clustering is non-exhaustive, because an object is permitted to belong to no cluster. Third, the goals considered in this research are the maximization of the average number of objects contained in a cluster and the maximization of the distances among cluster centers, while the goals in previous research are the maximization of the similarities of objects in the same clusters and the minimization of the similarities of objects in different clusters. Furthermore, the new clustering is also different from the traditional fuzzy clustering, because the object–cluster relationship in the new clustering is represented by a crisp value rather than that represented by using a fuzzy membership degree. Accordingly, a new overlapping partitioning cluster (OPC) algorithm is proposed to provide overlapping and non-exhaustive clustering of objects. Finally, several simulation and real world data sets are used to evaluate the effectiveness and the efficiency of the OPC algorithm, and the outcomes indicate that the algorithm can generate satisfactory clustering results.

Introduction

Clustering is an important technique for information retrieval, data mining, pattern recognition, and image segmentation [8]. Because of its widespread usage, many variants of clustering methods have been proposed. A recent research of [4] classified clustering methods into the following categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Each category has its own constraints and features to fit in a particular scenario. This paper will focus on partitioning cluster methods; hence the basic assumption and the constraints along with this line will be discussed.

The partitioning methods cluster n objects into k clusters, where k is specified by the user. Each object can be defined by multiple attributes. There have been many definitions of distance being proposed to measure the similarity between two objects, where the Euclidian distance is perhaps the most popular one. The farther the distance between two objects, the more dissimilar they are.

The following are the basic assumptions of the traditional partitioning cluster methods:

•
One object can only belong to one cluster.
•
Each object is assigned to the nearest cluster.

The well-known K-Means and K-Medoids algorithms were developed based on the assumptions above. Since one object can only belong to one cluster, the K-Means and K-Medoids methods assign an object to the nearest cluster. These algorithms are designed to find clusters of objects so that the sum of all distances from objects to their cluster centers should be as small as possible. The K-Means and K-Medoids algorithms have been successfully used in numerous applications [5], [11], [12], [16].

In this paper, the fundamental assumption of the traditional partitioning clustering algorithm is extended so that each object now can belong to multiple clusters rather than a single cluster. In other words, these clusters may overlap with one another. The reason behind this extension is that, in some circumstances, it is not appropriate to let an object belong to a single cluster. For example, documents can be partitioned into a number of clusters, where each cluster represents the documents in a certain area. Since a document may be related to several areas, it is natural for a document to belong to more than one cluster. Moreover, the overlapping clustering is different from the fuzzy clustering since the relationship between object and cluster is crisp rather than that represented by a fuzzy membership degree.

In a typical clustering algorithm such as K-Means and K-Medoids, the goal is twofold: one is to maximize the dissimilarity of objects in different clusters and the other is to maximize the similarity of objects in the same cluster. In the new overlapping cluster algorithm, these two goals are changed as follows.

Firstly, the dissimilarity of objects in different clusters will no longer be considered. Instead, the dissimilarity among cluster centers is considered. In practice, a cluster center can represent a typical pattern of objects in that particular group. For example, in a cluster of documents the cluster center represents the typical document in that group. Similarly, in a cluster of customers, the cluster center is the typical customer that best represents the purchasing behavior of this market segment. Therefore, if the centers of clusters are kept far away from each other, then the representative patterns of these clusters will be more distinguishable. In other words, since each center represents the core concept of a group, by keeping these centers distant, then a set of core concepts, which are clearly separated and with clear semantics, can be identified.

Secondly, the similarity of objects in the same cluster will no longer be considered. Instead, a threshold is set such that all objects with distances from a cluster center no more than the threshold belong to this cluster. In this way, an object can simultaneously belong to multiple clusters if the distances from this object to all these centers are no more than the given threshold. But if an object is far away from all cluster centers, this object would not be included in any cluster, and this would result in a non-exhaustive clustering. Furthermore, it is attempted to maximize the number of objects contained in a cluster. This is because the more general the concept of a class is, more objects the class can cover.

Finally, this research will propose a new algorithm to solve the overlapping cluster problem. Section 2 will give a review of existing partitioning methods. Section 3 presents the overlapping partitioning cluster (OPC) algorithm. In Section 4, both synthetic data sets and real data sets are used to evaluate the proposed algorithm. Comparisons are made between the results obtained from the suggested clustering algorithm with those obtained from the traditional partition methods. Section 5 presents the conclusion and the future research areas.

Section snippets

Partitioning cluster methods

As pointed out by [8], the methods in partitioning clustering can be roughly classified into two major approaches: the K-Means approach and the K-Medoid approach. They both use the distance among the objects to evaluate the similarity of the objects. The basic idea is to assign each object to the nearest cluster to minimize the intra-distance of objects. Each user has to decide a k number as the number of clusters.

The K-Means algorithm performs the following steps. First, it randomly choose k

The overlapping partitioning cluster algorithm

The basic idea of partitioning cluster is to group similar objects into a cluster. A set of numerical attributes is associated with each object. For example, the age, income, monthly spending, average consumption amount and other attributes can describe a customer. Since the Euclidean distance is a well-known method to compute the distance, the paper assumes that the distance of objects is computed based on the Euclidean formula. However, since there are other variants of distance definition

Evaluation

The evaluation of the OPC algorithm contains two parts. The first part compares OPC with the traditional partitioning methods, K-Means and K-Medoids, using synthetic data sets. The second part evaluates OPC by using real data sets, the Abalone data set and the Telecom data set.

In the experiment, the OPC, K-Means and K-Medoid algorithms were implemented in C language and tested on a Celeron(R) CPU Windows-XP system with 1024 megabytes of main memory. Since the objective of the K-Means and K

Conclusion

This paper has introduced a new overlapping clustering algorithm, which is to partition n objects into k non-exhaustive clusters that may overlap with each other. The new algorithm differs from the traditional clustering problem in three respects. First, these clusters are allowed to overlap with each other. Second, an object is permitted to belong to on cluster. Third, the goals of the new algorithm, including maximizing the distance among cluster centers and maximizing the average number of

Acknowledgements

The work was supported in part by the MOE Program for Promoting Academic Excellence of Universities under the Grant Number 91-H-FA07-1-4. We express our gratitude to three anonymous referees for their many helpful and pinpointed suggestions.

References (20)

R.N. Dave
Validating Fuzzy Partitions obtained through c-shells clustering
Pattern Recognition Letters
(1996)
N.C. Hsieh
Hybrid mining approach in the design of credit scoring models
Expert Systems with Applications
(2005)
P.W. Huang et al.
Optimizing storage utilization in R-tree dynamic index structure for spatial databases
The Journal of Systems and Software
(2001)
S.T. Lauritzen
The EM algorithm for graphical association models with missing data
Computational Statistics and Data Analysis
(1995)
P.N. Tan et al.
Selecting the right objective measure for association analysis
Information Systems
(2004)
C.W. Tao
Unsupervised fuzzy clustering with multi-center clusters
Fuzzy Sets and Systems
(2002)
M.S. Yang et al.
On a class of fuzzy c-numbers clustering procedures for fuzzy data
Fuzzy Sets and Systems
(1996)
M. Ester, H.P. Kriegel, X. Xu, Knowledge discovery in large spatial data bases: Focusing techniques for efficient class...
J. Grabmeier et al.
Techniques of cluster algorithms in data mining
Data Mining and Knowledge Discovery Journal
(2002)
J. Han et al.
Data Mining: Concepts and Techniques
(2001)

There are more references available in the full text version of this article.

Cited by (41)

Process fragments discovery from emails: Functional, data and behavioral perspectives discovery
2023, Information Systems
Significant research work has been conducted in the area of business process (BP) mining leading to mature solutions for discovering process knowledge. These solutions were generally limited to the analysis of structured event logs generated by BP management systems (BPMS). Given the recent spread of digital workplaces, there have been several initiatives to extend the scope of these analysis to consider other information systems (IS) supporting BP execution informally. More precisely, emailing systems have attracted much attention as they are widely used as alternative tools to collaboratively perform BP activities. However, due to the unstructured nature of email logs data, traditional process mining techniques could not be applied or at least not directly applied. Existing approaches that discovered BP from emails are usually supervised or at least require significant human intervention. They focused on discovering BP with respect to their behavioral perspective (i.e. that defines the conditions for activity execution) while neglecting the discovery of their data perspective (i.e. that defines the informational entities manipulated by BP activities). In addition, they did not studied how emailing systems are used in the context of BP executions. They assume that emailing systems are used in the same way employees use ordinary BPMS. However, employees actually use emails to perform poorly structured BP fragments (i.e. parts) rather than complete and well-structured ones. These BP fragments are not necessary defined in advance as in the case of BPMS. This induces the need to discover BP functional perspective (i.e. that defines what a BP performs and what are its activities). Furthermore, employees use emails with different purposes when talking about BP activities (e.g. information about activity execution, request or planning activity execution, etc.). This results in the occurrence of new event types referring to the purpose of considering activities in emails rather than events referring only to their execution.
In this paper, we propose to discover BP from email logs with respect to their functional, data and behavioral perspectives. The paper first formalizes these perspectives. Then, it introduces a completely non-supervised approach for discovering them based on: (i) speech act detection for recognizing the purposes of considering activities in emails, (ii) overlapping clustering of activities to discover their manipulated artifacts (i.e. informational entities), (iii) overlapping clustering of BP elements (i.e. activities, artifacts and activity actors) to discover BP fragments and, (iv) mining sequencing constraints between event types deduced from activities and speech acts to discover behavioral perspective. Our approach is finally validated using the public email dataset Enron.
Blockchain technology forecasting by patent analytics and text mining
2021, Blockchain: Research and Applications
Citation Excerpt :
Text mining results and their interpretations would help technology investment decision-making to be more wise [34,49]. Clustering is an effective text mining technique [50,51], is an unsupervised technique, its goal is to put similar objects in the groups that their members have more similarities with each other, however, have more dissimilarities with the other groups' members [52]. Clustering is a continuous process that includes collecting data, determining a similarity criterion between data, selecting an appropriate clustering method, evaluating the performance of the selective method, and finally interpreting the results of clustering [53].
Information technologies (ITs) have been playing an important role in improving our society, and the fast evolution of ITs creates a competitive environment not only for companies but also for regions. Hence, recognizing the future trend of technologies can be effective in decision-making with regard to technology selection and investment. Blockchain technology with its vast and impressive applications has received considerable attention from researchers, investors, and public agencies. The purpose of this research is to investigate blockchain technology to explore its trends according to their classification by use of the World Intellectual Property Organization (WIPO) database. Furthermore, we particularly evaluate the registered patents in the world's most well-known patent databases such as the USA patent database. We drew the current technology trends in blockchain patents by applying the text mining and clustering approach. The results represent that the registered patents in the USA patent database have been achieved in the growth phase. That means, attention to the blockchain is rising nowadays and most patents focused the cryptocurrencies and their applications in finance. However, blockchain technology is in the emergence phase and is evolving by researchers and inventors.
Grid density overlapping hierarchical algorithm for clustering of carbonate reservoir rock types: A case from Mishrif Formation of West Qurna-1 oilfield, Iraq
2019, Journal of Petroleum Science and Engineering
Citation Excerpt :
The grid density clustering algorithm has the advantage of finding arbitrarily shaped clusters (Agrawal et al., 1998, 2005; Mann and Kaur, 2013). The overlapping clustering algorithms have an advantage in the analysis of the overlapping relationship between different clusters (Chen and Hu, 2006; Arochevillarruel et al., 2014; De Andrade et al., 2014). Hierarchical clustering algorithms have a great advantage in the analysis of the hierarchical relationship among data clusters (Ward, 1963) with similar data points or clusters as a metric.
The primary goal of this study is to solve a major challenge of reservoir rock typing in quantitatively understanding the relationship between geological and petrophysical properties of carbonate reservoir. This paper first presents an improved overlapping index based on the grid density to define the overlapping relationship between different clusters and a grid density overlapping hierarchical algorithm (GDOH) for clustering of carbonate reservoir rock types. GDOH clustering algorithm is a hybrid type which integrates the gird, density, overlapping index and the hierarchical clustering method using the number of grid divisions and the threshold value of the overlapping index as the key parameters. Taking the Mishrif Formation of West Qurna Oilfield-1 as a case, the GDOH clustering algorithm is employed for clustering of carbonate reservoir rock types. As a result, ten geological rock types are identified according to the lithological type and diagenetic type using the thin section data. We finally got four reservoir rock types using the GDOH clustering algorithm with the overlapping index of core porosity and the permeability data as the metric. These reservoir rock types are different in reservoir qualities and porosity-permeability relationships. Each reservoir rock type also shows a similar set of capillary pressure curves and a typical and unique set of pore throat size distribution. The clustering result gives us some clues to understand the control of depositional and diagenesis features on the variation of reservoir rock types in reservoir quality and porosity-permeability relationship. Compared with the model-based petrophysical rock-typing techniques such as Winland r₃₅, FZI and FZI-Star (FZI*), GDOH algorithm shows some apparent advantages in the evaluation of reservoir quality and prediction of permeability in three dimensions based on geologic principles in addition to understanding the relationship between reservoir rock type and petrophysical static and dynamic rock type.
Multi-source homogeneous data clustering for multi-target detection from cluttered background with misdetection
2017, Applied Soft Computing Journal
Citation Excerpt :
Furthermore, overlapping clusters are involved where clusters can be overlapped if the corresponding targets are closely distributed. In existing research, mixed data among overlapping clusters are considered to be outliers [35], to belong to one or multiple clusters [31–33] or to belong to a given cluster to a certain degree [34]; see also [36,37]. None of these existing clustering approaches, however, exactly meet our requirements.
This paper investigates a particular data mining problem which is to ‘identify’ an unknown number of targets based on homogeneous observations that are collected via multiple independent sources. This particular clustering problem corresponds to a significant problem of multi-target detection in the multi-sensor/scan context. No prior information is given about either the level of clutter (namely noisy data) or the number of targets/clusters, both of which have to be learned online from the data. In addition, the data-points from the same source cannot be grouped into the same cluster (namely the cannot link, CL, constraint) and the sizes of the generated clusters need to be bounded by the number of data sources. In the proposed approach, a density-based clustering mechanism is proposed firstly to identify dense regions as clusters and to remove clutter at the coarser level; the CL constraint is then applied for finer data mining and to distinguish overlapping clusters. Illustrative datasets are employed to demonstrate the validity of the present clustering approach for multi-target detection and estimation in cluttered environments which are affected by both misdetection and clutter.
Fuzzy evaluated quantum cellular automata approach for watershed image analysis
2017, Quantum Inspired Computational Intelligence: Research and Applications
Fuzzy approaches in a low-level image processing method to partition the homogeneous regions are important challenges in image segmentation. The analysis of the fuzziness in data produces comparable or improved solutions compared with the respective crisp approaches. The novel approach proposed in this chapter has been found to enhance the functionality of the fuzzy rule base and thus enhance the established potentiality of new fuzzy-based segmentation domain with the help of partitioned quantum cellular automata. Image segmentation among overlapping land cover areas on satellite images is a very crucial problem. To detect the belongingness is an important problem for mixed-pixel classification. This new approach to pixel classification is a hybrid method of fuzzy c-means and partitioned quantum cellular automata methods. This new unsupervised method is able to detect clusters using a two-dimensional partitioned cellular automaton model based on fuzzy segmentations. This method detects the overlapping areas in satellite images by analyzing uncertainties from fuzzy set membership parameters. As a discrete, dynamical system, a cellular automaton explores uniformly interconnected cells with states. In the second phase of our method, we use a two-dimensional partitioned quantum cellular automaton to prioritize allocations of mixed pixels among overlapping land cover areas. We tested our method on the Tilaiya Reservoir catchment area of the Barakar River for the first time. The clustered regions are compared with well-known fuzzy C-means and K-means methods and also with the ground truth information. The results show the superiority of our new method.
Integrating bibliometrics and roadmapping: A case of strategic promotion for the ground source heat pump in China
2016, Renewable and Sustainable Energy Reviews
Citation Excerpt :
Specifically elements related to sub-technologies are often categorized in more than one group, though the groups should be related [32]. Many scholars have developed algorithms meant to optimize the K-means approach, including the Fuzzy c-means (FCM) [33–36] and Overlapping Cluster (OPC) algorithms proposed by Chen and Hu [37]. These algorithms are based on the concept of fuzzy clustering and dictate that subordinate principles and objective functions can address the problems intrinsic to the K-means approach.
The ground source heat pump (GSHP) is one of the most promising energy-efficient technologies in development. However, the degree to which the application of GSHP has been promoted in China remains unsatisfactory. Critics of GSHP development in China have asserted that a more thorough understanding of GSHP is paramount for technology road mapping. This, in turn, has affected policymakers׳ development of regulations that facilitate R&D related to GSHP. Because many researchers have proposed specific terms and categorization approaches in relation to GSHP, it is imperative to transform and analyze a substantive knowledge system derived from massive amounts of qualitative information to produce a roadmap for the development of GSHP. To this end, we employed a bibiliometrics approach to analyze patent information. First, we engaged in semantic labeling of patent files and recorded the co-occurrence of the terms associated with the GSHP׳s ontology. Second, we employed an algorithm called the Fuzzy Overlapping Cluster (FOPC) to analyze the co-occurrence information. In doing so, we sought to classify patent data and further define sub-technologies associated with GSHP. Third, we used accumulative patent numbers to develop a logistic model for observing development trends related to each GHSP sub-technology. Fourth, we leveraged social network analysis to calculate and graphically illustrate interdependence among GSHP sub-technologies. The results these analytic approaches produce allowed us to conclude that (a) GSHP should be categorized into four sub-technologies: the water source heat pump, the ground coupled heat pump, the heat pump/system operation technique, and central air-conditioning system, and (b) the government should revise building codes and standards with a consideration of GCHP, as well as the heat pump/system operation techniques.

View all citing articles on Scopus

View full text

An overlapping cluster algorithm to provide non-exhaustive clustering

Abstract

Introduction

Section snippets

Partitioning cluster methods

The overlapping partitioning cluster algorithm

Evaluation

Conclusion

Acknowledgements

Pattern Recognition Letters

Expert Systems with Applications

The Journal of Systems and Software

Computational Statistics and Data Analysis

Information Systems

Fuzzy Sets and Systems

Fuzzy Sets and Systems

Techniques of cluster algorithms in data mining

Data Mining and Knowledge Discovery Journal

Data Mining: Concepts and Techniques