Clustering of multi-view relational data based on particle swarm optimization
Introduction
Clustering algorithms separate a set of objects into groups or clusters, so that objects within a given cluster have high degree of similarity but are dissimilar regarding objects in other clusters. These methods are widely applied in many areas, including fields such as data mining, statistics, biology, machine learning, document retrieval and pattern recognition (Armano, Farmani, 2016, Han, Kamber, Pei, 2012, Jain, Murty, Flynn, 1999).
In several real-world applications, data have multiple representations or sources, which usually have complementary information and can be used to improve the accuracy of the clustering task (Wang, Dou, Liu, Lv, Li, 2016, Zhang, Wang, Zong, Yu, 2016, Zhang, Zhao, Zong, Liu, Yu, 2014). Each view may represent different data aspects (Jiang, Qiu, & Wang, 2016a). Therefore, it is essential to integrate these views to generate more robust and accurate clustering, instead of considering only a single data view (Li, Nie, Huang, Huang, 2015, Zhang, Wang, Huang, Zheng, 2017).
According to Xu, Han, Nie, and Li (2017) and Long, Yu, and Zhang (2008), the efficient clustering of multi-view data is challenging. Xu, Wang, and Lai (2016) pointed out that combining multiple views of a data set to improve performance in data clustering is a significant research challenge. Multi-view data clustering is usually modeled as an optimization problem. Therefore, nature-inspired meta-heuristics can be interesting tools for evolving a set of candidate solutions and thus determining a near-optimal partitioning of the data set.
The number of possible partitions grows exponentially when the size of data set and the number of clusters rises (Jain & Dubes, 1988). Instead of exhaustive enumeration of all possible partitions to achieve optimal global partition, heuristic methods can be used (Han et al., 2012). In particular, meta-heuristics are methods which coordinate solutions in the search space trying to avoid local optima to solve generic optimization problems (Gendreau & Potvin, 2010). Once data clustering can be modeled as an optimization problem, nature-inspired meta-heuristics (such as Particle Swarm Optimization - PSO) can be applied for evolving a set of candidate solutions and thus determining a near-optimal partitioning of the data set.
PSO has become one of the most popular population-based meta-heuristic due to its simplicity and versatility (Filho, Pimentel, Souza, Oliveira, 2015, Rana, Jasola, Kumar, 2011). These facts have motivated researchers to propose PSO-based algorithms for hard and soft clustering of vector data. PSO-based clustering methods have shown improved results when compared to traditional partitioning algorithms such as K-means, K-medoids and Fuzzy C-means.
The two most common representations of objects in which clustering can be based upon are vector data and relational data. In vector data, each object is described by a vector of quantitative or qualitative values. In this case, data is represented by a n × l data matrix, assuming n objects described by l attributes. Alternatively, when each pair of objects is represented by a relationship (expressed, for example, by dissimilarity), the set of relationships is called relational data. Usually, the set of objects is represented by a single dissimilarity matrix containing the relational descriptions of all pairs of objects. Despite the fact that clustering of relational data has received less attention, according to Frigui, Hwang, and Chung-Hoon Rhee (2007), relational data clustering is more general in the sense that it applies to situations in which the objects to be clustered cannot be described by vector data.
Also according to Frigui et al. (2007), relational clustering is also more practical for situations in which the complexity of the distance used to compare the objects is high, when the distance measure does not have a closed form or when groups of similar objects cannot be effectively represented by a single prototype (e.g., center). Confidentiality is potentially another advantage of the use of relational data once the attributes of objects can be kept in secret or may even not be available. For example, clients of a bank can be represented by a relational matrix, which hides all the attribute information except for dissimilarities among clients (Horta, de Andrade, & Campello, 2011).
When objects are described by several dissimilarity matrices, these matrices form a multi-view relational data set. For example, the relationship among multimedia objects may be described by multiple (dis)similarity matrices. In image database categorization, there may be several matrices such as the one that encodes color information, another for texture information and another for structure information (Frigui et al., 2007). Thus, it is important the existence of multi-view relational techniques that deal with complex data and provide accurate data clustering. Since each matrix represents a different aspect of the data set, dissimilarity matrices have different influences on the clustering process. Therefore, the computation of relevance weights for each matrix is essential (Frigui et al., 2007).
This study aimed to develop a hybrid approach and investigate its application to improve the clustering accuracy in the problem of clustering multi-view relational data. Therefore, the main contribution of this paper is to introduce two hybrid methods that combine PSO and hard clustering algorithms based on multiple dissimilarity matrices with relevance weights. These methods take advantage of the global convergence ability of PSO and the local exploitation of hard clustering algorithms in the position update step, aiming to improve the balance between exploitation and exploration processes. The choice of a proper clustering validity index as objective function is vital to the success of a clustering algorithm. Therefore, the second contribution of this work is the investigation and adaptation of several validity clustering indices. These indices are modified to consider the distances between each pair of objects contained in p dissimilarity matrices and relevance weights for each matrix.
Two sets of experiments on eleven real-world data sets including image and document data sets were conducted to evaluate the proposed methods. In the first study, the objective was to investigate which fitness function are the most suitable to cluster multi-view relational data. The fitness functions considered in this study are adaptations from several clustering validity indices, in which some of them are based only on intra-cluster homogeneity whereas the other is based on both compactness and separation of clusters. In the second study, the proposed algorithms were compared to seven algorithms suitable to deal with relational data and two other algorithms appropriate to deal with vector data.
Regarding fitness criteria, our experimental results suggest that the Silhouette index, the Xu index, and the Intra-cluster homogeneity can be promising alternatives to provide multi-view relational data clustering with reasonable accuracy. The results have shown that our approach significantly outperformed the other relational algorithms considering two evaluation indexes in the majority of cases. This reinforces the importance of the application of techniques such as PSO-based clustering algorithms in the field of expert systems and machine learning. Such application enhances classification accuracy and cluster compactness (Alswaitti, Albughdadi, & Isa, 2018). The results obtained by the proposed approach are promising, and encourage the use of these methods to other real-world applications.
The remainder of this paper is structured as follows. Section 2 presents a review of some related works. Section 3 introduces previous knowledge needed for understanding the central concepts. The proposed methods are presented in Section 4. Eleven multi-view data sets were used in experiments to compare methods according to two external indexes, and the empirical results are shown in Section 5. Section 6 presents the concluding remarks.
Section snippets
Related works
Concerning multi-view data clustering, there are several approaches in literature focused on vector data whereas only a few methods are focused on relational data. This section discusses some relevant papers related to multi-view data clustering and their respective contributions.
Bickel and Scheffer (2004) addressed the problem of multi-view clustering for vector data, so that the available attributes could be divided into two independent subsets assuming that each subset is sufficient for
Background
In this section, a brief introduction to the particle swarm optimization will be provided. In addition, the algorithms mentioned in the previous section on which this work is based will be further explained.
Clustering of multi-view relational data based on PSO
This section introduces two hard clustering algorithms based on Particle Swarm Optimization for multi-view relational data, namely PSORWL and PSORWG, which can partition objects simultaneously taking into account their relational descriptions provided by several dissimilarity matrices. There are few relational clustering algorithms able to simultaneously operate on several dissimilarity matrices in literature like CARDR, CARDF (Frigui et al., 2007), MRDCARWL and MRDCARWG (De Carvalho et al.,
Empirical results
In this section, experiments on eleven real-world multi-view data sets were conducted to evaluate the performance of proposed PSO-based algorithms. First, the data sets as well and the evaluation measures were briefly described in Section 5.1. Then, the effect of different fitness functions on the quality of partitions found by each function was analyzed in Section 5.2. The performance of the proposed approach compared to baseline methods is evaluated in Section 5.3. Finally, a discussion of
Conclusion
This study proposed a new approach for clustering of multi-view relational data based on the PSO clustering algorithm combined with modified versions of MRDCA-RWL and MRDCA-RWG approaches. These proposed methods take advantage of the global convergence ability of PSO and the local exploitation of the hard clustering algorithms in the position update step, and consequently, better quality partitions of the multi-view relational data are found.
Different experiments were conducted to evaluate the
Author contribution statement
As doctor candidate, Rene Pereira de Gusmão participated actively in the conception of the models, in the implementation of the models, in the experimental evaluation of the models, in the writing of the paper,
As superviser, Francisco de A. T. Carvalho participated actively in the conception of the models and in the writing of the paper.
Acknowledgement
The authors are grateful to the anonymous referees for their careful revision, valuable suggestions, and comments which improved this paper. The authors thank to Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq (303187/2013-1) and Fundação de Amparo à Ciência e Tecnologia do Estado de Pernambuco - FACEPE (PBPG-0396- 1.03/14) (Brazilian agencies) for their partial financial support.
References (59)
- et al.
Research on particle swarm optimization based clustering: A systematic review of literature and techniques
Swarm and Evolutionary Computation
(2014) - et al.
Density-based particle swarm optimization algorithm for data clustering
Expert Systems with Applications
(2018) - et al.
Multiobjective clustering analysis using particle swarm optimization
Expert Systems with Applications
(2016) - et al.
On measuring the distance between histograms
Pattern Recognition
(2002) Multi-view clustering via spectral partitioning and local refinement
Information Processing & Management
(2016)- et al.
Partitioning hard clustering algorithms based on multiple dissimilarity matrices
Pattern Recognition
(2012) - et al.
Clustering and aggregation of relational data with applications to image database categorization
Pattern Recognition
(2007) - et al.
Nerf c-means: Non-euclidean relational fuzzy clustering
Pattern Recognition
(1994) - et al.
Evolutionary fuzzy clustering of relational data
Theoretical Computer Science
(2011) - et al.
Evolutionary multi-objective optimization for multi-view clustering
2016 IEEE congress on evolutionary computation (CEC)
(2016)