Clustering of multi-view relational data based on particle swarm optimization

https://doi.org/10.1016/j.eswa.2018.12.053Get rights and content

Highlights

  • The paper provides multi-view clustering algorithms for relational data.

  • The algorithms are based on Particle Swarm Optimization.

  • They are able to select the relevant views for the clustering task.

  • The algorithms consider eleven different fitness functions.

  • Experiments with real multi-view data sets shows their usefulness.

Abstract

Clustering of multi-view data has received increasing attention since it explores multiple views of data sets aiming at improving clustering accuracy. Particle Swarm Optimization (PSO) is a well-known population-based meta-heuristic successfully used in cluster analysis. This paper introduces two hybrid clustering methods for multi-view relational data. These hybrid methods combine PSO and hard clustering algorithms based on multiple dissimilarity matrices. These methods take advantage of the global convergence ability of PSO and the local exploitation of hard clustering algorithms in the position update step, aiming to improve the balance between exploitation and exploration processes. Moreover, the paper provides adapted versions of 11 fitness functions suitable for vector data aiming at dealing with multi-view relational data. Two performance criteria were used to evaluate the clustering quality using the two proposed methods over eleven real-world data sets including image and document data sets. Among new findings, it was observed that the top three fitness functions are Silhouette index, Xu index and Intra-cluster homogeneity. The performance of the proposed algorithms was compared with previous single and multi-view relational clustering algorithms. The results show that the proposed methods significantly outperformed the other algorithms in the majority of cases. The results reinforce the importance of the application of techniques such as PSO-based clustering algorithms in the field of expert systems and machine learning. Such application enhances classification accuracy and cluster compactness. Besides, the proposed algorithms can be useful tools in content-based image retrieval systems, providing good categorizations and automatically learning relevance weights for each cluster of images and sets of views.

Introduction

Clustering algorithms separate a set of objects into groups or clusters, so that objects within a given cluster have high degree of similarity but are dissimilar regarding objects in other clusters. These methods are widely applied in many areas, including fields such as data mining, statistics, biology, machine learning, document retrieval and pattern recognition (Armano, Farmani, 2016, Han, Kamber, Pei, 2012, Jain, Murty, Flynn, 1999).

In several real-world applications, data have multiple representations or sources, which usually have complementary information and can be used to improve the accuracy of the clustering task (Wang, Dou, Liu, Lv, Li, 2016, Zhang, Wang, Zong, Yu, 2016, Zhang, Zhao, Zong, Liu, Yu, 2014). Each view may represent different data aspects (Jiang, Qiu, & Wang, 2016a). Therefore, it is essential to integrate these views to generate more robust and accurate clustering, instead of considering only a single data view (Li, Nie, Huang, Huang, 2015, Zhang, Wang, Huang, Zheng, 2017).

According to Xu, Han, Nie, and Li (2017) and Long, Yu, and Zhang (2008), the efficient clustering of multi-view data is challenging. Xu, Wang, and Lai (2016) pointed out that combining multiple views of a data set to improve performance in data clustering is a significant research challenge. Multi-view data clustering is usually modeled as an optimization problem. Therefore, nature-inspired meta-heuristics can be interesting tools for evolving a set of candidate solutions and thus determining a near-optimal partitioning of the data set.

The number of possible partitions grows exponentially when the size of data set and the number of clusters rises (Jain & Dubes, 1988). Instead of exhaustive enumeration of all possible partitions to achieve optimal global partition, heuristic methods can be used (Han et al., 2012). In particular, meta-heuristics are methods which coordinate solutions in the search space trying to avoid local optima to solve generic optimization problems (Gendreau & Potvin, 2010). Once data clustering can be modeled as an optimization problem, nature-inspired meta-heuristics (such as Particle Swarm Optimization - PSO) can be applied for evolving a set of candidate solutions and thus determining a near-optimal partitioning of the data set.

PSO has become one of the most popular population-based meta-heuristic due to its simplicity and versatility (Filho, Pimentel, Souza, Oliveira, 2015, Rana, Jasola, Kumar, 2011). These facts have motivated researchers to propose PSO-based algorithms for hard and soft clustering of vector data. PSO-based clustering methods have shown improved results when compared to traditional partitioning algorithms such as K-means, K-medoids and Fuzzy C-means.

The two most common representations of objects in which clustering can be based upon are vector data and relational data. In vector data, each object is described by a vector of quantitative or qualitative values. In this case, data is represented by a n × l data matrix, assuming n objects described by l attributes. Alternatively, when each pair of objects is represented by a relationship (expressed, for example, by dissimilarity), the set of relationships is called relational data. Usually, the set of objects is represented by a single dissimilarity matrix containing the relational descriptions of all pairs of objects. Despite the fact that clustering of relational data has received less attention, according to Frigui, Hwang, and Chung-Hoon Rhee (2007), relational data clustering is more general in the sense that it applies to situations in which the objects to be clustered cannot be described by vector data.

Also according to Frigui et al. (2007), relational clustering is also more practical for situations in which the complexity of the distance used to compare the objects is high, when the distance measure does not have a closed form or when groups of similar objects cannot be effectively represented by a single prototype (e.g., center). Confidentiality is potentially another advantage of the use of relational data once the attributes of objects can be kept in secret or may even not be available. For example, clients of a bank can be represented by a relational matrix, which hides all the attribute information except for dissimilarities among clients (Horta, de Andrade, & Campello, 2011).

When objects are described by several dissimilarity matrices, these matrices form a multi-view relational data set. For example, the relationship among multimedia objects may be described by multiple (dis)similarity matrices. In image database categorization, there may be several matrices such as the one that encodes color information, another for texture information and another for structure information (Frigui et al., 2007). Thus, it is important the existence of multi-view relational techniques that deal with complex data and provide accurate data clustering. Since each matrix represents a different aspect of the data set, dissimilarity matrices have different influences on the clustering process. Therefore, the computation of relevance weights for each matrix is essential (Frigui et al., 2007).

This study aimed to develop a hybrid approach and investigate its application to improve the clustering accuracy in the problem of clustering multi-view relational data. Therefore, the main contribution of this paper is to introduce two hybrid methods that combine PSO and hard clustering algorithms based on multiple dissimilarity matrices with relevance weights. These methods take advantage of the global convergence ability of PSO and the local exploitation of hard clustering algorithms in the position update step, aiming to improve the balance between exploitation and exploration processes. The choice of a proper clustering validity index as objective function is vital to the success of a clustering algorithm. Therefore, the second contribution of this work is the investigation and adaptation of several validity clustering indices. These indices are modified to consider the distances between each pair of objects contained in p dissimilarity matrices and relevance weights for each matrix.

Two sets of experiments on eleven real-world data sets including image and document data sets were conducted to evaluate the proposed methods. In the first study, the objective was to investigate which fitness function are the most suitable to cluster multi-view relational data. The fitness functions considered in this study are adaptations from several clustering validity indices, in which some of them are based only on intra-cluster homogeneity whereas the other is based on both compactness and separation of clusters. In the second study, the proposed algorithms were compared to seven algorithms suitable to deal with relational data and two other algorithms appropriate to deal with vector data.

Regarding fitness criteria, our experimental results suggest that the Silhouette index, the Xu index, and the Intra-cluster homogeneity can be promising alternatives to provide multi-view relational data clustering with reasonable accuracy. The results have shown that our approach significantly outperformed the other relational algorithms considering two evaluation indexes in the majority of cases. This reinforces the importance of the application of techniques such as PSO-based clustering algorithms in the field of expert systems and machine learning. Such application enhances classification accuracy and cluster compactness (Alswaitti, Albughdadi, & Isa, 2018). The results obtained by the proposed approach are promising, and encourage the use of these methods to other real-world applications.

The remainder of this paper is structured as follows. Section 2 presents a review of some related works. Section 3 introduces previous knowledge needed for understanding the central concepts. The proposed methods are presented in Section 4. Eleven multi-view data sets were used in experiments to compare methods according to two external indexes, and the empirical results are shown in Section 5. Section 6 presents the concluding remarks.

Section snippets

Related works

Concerning multi-view data clustering, there are several approaches in literature focused on vector data whereas only a few methods are focused on relational data. This section discusses some relevant papers related to multi-view data clustering and their respective contributions.

Bickel and Scheffer (2004) addressed the problem of multi-view clustering for vector data, so that the available attributes could be divided into two independent subsets assuming that each subset is sufficient for

Background

In this section, a brief introduction to the particle swarm optimization will be provided. In addition, the algorithms mentioned in the previous section on which this work is based will be further explained.

Clustering of multi-view relational data based on PSO

This section introduces two hard clustering algorithms based on Particle Swarm Optimization for multi-view relational data, namely PSORWL and PSORWG, which can partition objects simultaneously taking into account their relational descriptions provided by several dissimilarity matrices. There are few relational clustering algorithms able to simultaneously operate on several dissimilarity matrices in literature like CARDR, CARDF (Frigui et al., 2007), MRDCARWL and MRDCARWG (De Carvalho et al.,

Empirical results

In this section, experiments on eleven real-world multi-view data sets were conducted to evaluate the performance of proposed PSO-based algorithms. First, the data sets as well and the evaluation measures were briefly described in Section 5.1. Then, the effect of different fitness functions on the quality of partitions found by each function was analyzed in Section 5.2. The performance of the proposed approach compared to baseline methods is evaluated in Section 5.3. Finally, a discussion of

Conclusion

This study proposed a new approach for clustering of multi-view relational data based on the PSO clustering algorithm combined with modified versions of MRDCA-RWL and MRDCA-RWG approaches. These proposed methods take advantage of the global convergence ability of PSO and the local exploitation of the hard clustering algorithms in the position update step, and consequently, better quality partitions of the multi-view relational data are found.

Different experiments were conducted to evaluate the

Author contribution statement

As doctor candidate, Rene Pereira de Gusmão participated actively in the conception of the models, in the implementation of the models, in the experimental evaluation of the models, in the writing of the paper,

As superviser, Francisco de A. T. Carvalho participated actively in the conception of the models and in the writing of the paper.

Acknowledgement

The authors are grateful to the anonymous referees for their careful revision, valuable suggestions, and comments which improved this paper. The authors thank to Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq (303187/2013-1) and Fundação de Amparo à Ciência e Tecnologia do Estado de Pernambuco - FACEPE (PBPG-0396- 1.03/14) (Brazilian agencies) for their partial financial support.

References (59)

  • S. Rana et al.

    A review on particle swarm optimization algorithms and their applications to data clustering

    Artificial Intelligence Review

    (2011)
  • C.J.V. Rijsbergen

    Information retrieval

    (1979)
  • Y. Shi et al.

    Parameter selection in particle swarm optimization

  • Y.-M. Xu et al.

    Weighted multi-view clustering with feature selection

    Pattern Recognition

    (2016)
  • G.-Y. Zhang et al.

    Multi-view collaborative locally adaptive clustering with Minkowski metric

    Expert Systems with Applications

    (2017)
  • X. Zhang et al.

    Multi-view clustering via graph regularized symmetric nonnegative matrix factorization

    2016 IEEE international conference on cloud computing and big data analysis (ICCCBDA)

    (2016)
  • X. Zhang et al.

    Multi-view clustering via multi-manifold regularized nonnegative matrix factorization

    2014 IEEE international conference on data mining

    (2014)
  • S. Bickel et al.

    Multi-view clustering

    Data mining, 2004. ICDM ’04. Fourth IEEE international conference on

    (2004)
  • X. Cai et al.

    Multi-view k-means clustering on big data

    Proceedings of the twenty-third international joint conference on artificial intelligence

    (2013)
  • M. Ceci et al.

    Semi-supervised multi-view learning for gene network reconstruction

    PLOS ONE

    (2015)
  • X. Chen et al.

    Tw-k-means: Automated two-level variable weighting clustering algorithm for multiview data

    IEEE Transactions on Knowledge and Data Engineering

    (2013)
  • Y. Cheung et al.

    Unsupervised feature selection with feature clustering

    2012 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology

    (2012)
  • C.-H. Chou et al.

    A new cluster validity measure and its application to image compression

    Pattern Analysis and Applications

    (2004)
  • S. Das et al.

    Metaheuristic clustering

    (2009)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1979)
  • D.V. der Merwe et al.

    Data clustering using particle swarm optimization

    Evolutionary computation, 2003. CEC’03. The 2003 congress on

    (2003)
  • E. Dimitriadou et al.

    An examination of indexes for determining the number of clusters in binary data sets

    Psychometrika

    (2002)
  • J.C. Dunn

    A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters

    Journal of Cybernetics

    (1973)
  • T.M.S. Filho et al.

    Hybrid methods for fuzzy clustering based on fuzzy c-means and improved particle swarm optimization

    Expert Systems with Applications

    (2015)
  • Cited by (0)

    View full text