Elsevier

Neurocomputing

Volume 247, 19 July 2017, Pages 102-114
Neurocomputing

Multitask fuzzy Bregman co-clustering approach for clustering data with multisource features

https://doi.org/10.1016/j.neucom.2017.03.062Get rights and content

Abstract

In usual real-world clustering problems, the set of features extracted from the data has two problems which prevent the methods from accurate clustering. First, the features extracted from the samples provide poor information for clustering purpose. Second, the feature vector usually has a high-dimensional multi-source nature, which results in a complex cluster structure in the feature space. In this paper, we propose to use a combination of multi-task clustering and fuzzy co-clustering techniques, to overcome these two problems. In addition, the Bregman divergence is used as the concept of dissimilarity in the proposed algorithm, in order to create a general framework which enables us to use any kind of Bregman distance function, which is consistent with the data distribution and the structure of the clusters. The experimental results indicate that the proposed algorithm can overcome the two mentioned problems, and manages the complexity and weakness of the features, which results in appropriate clustering performances.

Introduction

Clustering is one of the basic problems in the field of machine learning which is used in various areas such as bioinformatics (protein structure analysis [1], genetic classification [2], [3]), trade and marketing (classification and analysis of customer behavior [4], classification of companies, and production chain management [5]), computer science (document classification [6], image segmentation [7]), social sciences (analysis of behavioral pattern of society [8], social media analysis [9]), medical application (medical image analysis [10], [11]) and so on. The purpose of the clustering is assigning the data samples to different groups in a way that the samples within a group are as similar as possible, while the samples in different groups have minimum similarities.

With the growth of data in many real-world applications and advancement in data processing systems, there is more requests for development of powerful and efficient data processing and data mining algorithms. On the other hand, the increasing of dimensionality and complexity of data which should be processed caused data processing and data mining algorithms, and on the top of them, the data clustering algorithms as the basic tools, to encounter many challenges. The main sources of these challenges are the types of the features used to describe the data and the dimensionality of the feature vectors. We know that the data itself, due to its high dimensionality and the correlation of its components, is not used directly in the clustering and data mining algorithms. However, using feature extraction algorithms, a set of features is extracted from the data which by putting them together, a feature vector or descriptor which represents the data is produced. These produced descriptors are used in data mining algorithms. The weakness of the features and the complexity of the distribution of the feature vectors in the feature space are two main challenges in the field of data clustering. The main purpose of this paper is to provide a solution to reduce the impact of these factors on the performance of the clustering algorithms.

The structural complexity of the distribution of the feature vectors is a common challenge in many applications. In several types of the real-world problems, the feature space is formed by the combination of various features obtained from different feature sources. These features describe data from different aspects and usually produce different cluster structures. The inconsistency of the cluster structures produced by different features introduces difficulties in putting them together to achieve a unified clustering. The clustering of the countries economically is a good example. The features available in this area for each country consist of the unemployment rate, the number of exports and imports, and the gross production rate. By considering each of these attributes separately, different cluster structures are obtained, but putting these features together and producing a new feature space, make the clustering process very difficult.

The challenge of weak features occurs in the cases where the features describing the data contain unreliable information for the mentioned clustering task. In these cases, the clustering algorithm cannot extract the correct and appropriate cluster structures for the data. For example, if the only feature used in an image segmentation task is the location of the pixels, the resulting segmentation would be completely wrong.

In the last decade, different algorithms and methods have been proposed to solve each of these two problems. Using the collaborative clustering algorithms [12], [13], [14] was the most common solution used for data clustering with multi-source features and high structural complexity. In these clustering algorithms, first, the data are clustered separately based on feature blocks of each source. Then, using the results obtained from the clustering of each source, a new feature space is defined. By clustering this new feature space, the general clustering of the data is performed. The passive performance in these algorithms is their main weakness, in a way that there is no difference between different sources. In fact, the structural differences between clusters in different sources, such as differences in the range of their attributes, the power of structural clustering of each source and so on, are ignored. Because of these limitations, the researchers look for the methods that work directly with the feature blocks of different sources and control the impact of the features of each source in the process of final clustering in order to achieve an appropriate result. Co-clustering [15], [16], [17], [18] is one of the solutions which is proposed for this kind of the clustering problems. In this approach, by assigning the weights to each feature or each set of features, not only the data are assigned to the clusters but also the assignments of the features to the clusters are also calculated. In this way, each feature or each set of features plays a more influential role in a special cluster. This makes a balance between the cluster structures of different feature sources and produces an appropriate clustering result based on the whole feature space.

Regarding the weakness of the features, transfer learning is a useful approach used for clustering. In this approach, a small amount of the data is partitioned in a supervised manner, and the actual amount of the data is clustered based on the information received from this clustering [19]. Using the supervision in a part of the clustering process is the main problem in this technique. The multi-task learning is one of the methods introduced after transfer learning, which can handle the weakness of the features significantly [20], [21], [22], [23], [24]. In this method, it is assumed that although the existing features are weak ones in the considered clustering task, these features may be appropriate for the other tasks. As a result, if the data is simultaneously clustered based on the assumed task and the other ones, instead of being clustered based on single-task, information received from the other clustering tasks can produce a strong guide for the assumed clustering one.

In recent years, according to these approaches, various algorithms have been proposed which have focused on solving one of the two mentioned problems. Compared to the other algorithms, multi-task clustering methods, especially the ones based on Bregman divergence [22], [23], co-clustering methods, and multi-source fuzzy clustering algorithms [25], [26] achieve better and more reliable results. Thus, a combination of these techniques (i.e. multi-task and fuzzy co-clustering with Bregman divergence as dissimilarity measure), may result in an algorithm which overcomes both the mentioned challenges in the clustering process, simultaneously. This algorithm can be used as a co-clustering algorithm, a multi-task clustering algorithm, or a combination of both of them, by assigning different values to its parameters.

In this paper, such an algorithm is introduced, which by a combination of multi-task clustering and fuzzy co-clustering ideas, tries to solve the weakness of the data features and the complexity of the cluster structures, simultaneously. In the proposed algorithm, the multi-task clustering framework presented by Zhang and Zhang [22] has been used. This framework performs the multi-task clustering based on minimization of the cost function consisting of a local and a global part. In minimization of the local part of the cost function, the goal is to cluster any task without considering the other clustering tasks. In the proposed algorithm, the fuzzy co-clustering algorithm with Bregman divergence [27] is used in the local part. In the global cost function minimization phase, this fact is regarded that if the two clusters in two different tasks have the same data with the same membership values, the centers of those clusters should also be similar to each other. As a result, this part of the cost function attempts to put the center of similar clusters in different tasks, closer to each other. Totally, the goal of the proposed algorithm is to find the three unknown variables for each task: centers of the clusters, the data membership values, and the impact factors of the feature sources. The proposed algorithm in an iterative process attempts to estimate these three unknown variables based on minimization of the cost function.

In a nutshell, the main contribution of this paper can be summarized as follows. Combining the multi-task clustering and fuzzy co-clustering techniques in order to handle the weakness of the data features and reduce the structural complexity of the clusters of multi-source features. In addition, the Bregman divergence is used as the concept of dissimilarity to overcome the non-linearity of the data. The proposed algorithm offers multiple parameters, which let it perform various moods, from fuzzy to crisp, from multi-source co-clustering to single source clustering, and from multi-task to single task clustering.

In order to evaluate the proposed algorithm, it used in clustering of different well-known datasets. The results of the proposed algorithm are compared with those of various multi-task clustering and fuzzy co-clustering methods. The experimental results indicate that using Euclidean distance as the simplest Bregman divergence, although the proposed algorithm performs weaker than multitask kernel clustering method, it achieves better results compared to the other multi-task clustering algorithms. By changing the Bregman divergence and choosing more complex spaces, the proposed algorithm performs like multitask kernel clustering method. But, we should bear in mind that the choice of an appropriate core in the kernel based algorithms and an appropriate Bregman divergence impose an extra cost on the system.

Later in this paper, the Bregman divergence is briefly introduced in Section 2. In Section 3, the formulation and the procedure of the proposed algorithm are discussed. Section 4 dedicated to the experimental results. Finally, conclusions are given in Section 5.

Section snippets

Bregman divergence

The data clustering is usually defined based on minimization of distances from the center of the clusters or from the data representing the cluster. Thus, the definition of the distance becomes a fundamental issue in this area. It is shown that a large family of distance functions can be rewritten as a standard form called Bregman divergence.

If ϕ( · ) is a differentiable convex function, Bregman divergence of this function will be defined as follows [28]: dϕ(x,y)=ϕ(x)ϕ(y)ϕ(y),xywhere ∇

The proposed algorithm

In this section, details of the proposed algorithm are explained. First, the formulation of the proposed method is given. Then, the parameters and the optimization process used in this algorithm are discussed.

Experimental results

In this section, the performance of the proposed algorithm is evaluated using a number of well-known datasets, and the results are compared with other clustering methods. The experiments are classified into three parts. In the first part, the proposed algorithm is used in document classification application and the results are compared with three multi-task clustering algorithms. Furthermore, the effect of different values of the free parameters on the performance of the proposed algorithm is

Conclusion

In this paper, a multi-task clustering algorithm is proposed to improve the clustering accuracy, based on local information and information received from the relation of the clusters in different tasks. The co-clustering idea used in the proposed algorithm can handle the complexity of the data distribution and results in more accurate clustering. The experimental results show that two factors in the proposed algorithm have significant effects on the performance of the proposed algorithm; that

Alireza Sokhandan is Ph.D. Student in Artificial Intelligence at University of Isfahan, Isfahan, Iran. He received his B.Sc. in Information Technology Engineering in 2010 and M.Sc. in Mechatronics Engineering in 2012 from University of Tabriz, Tabriz, Iran. His research interests include image processing and computer vision, machine learning, and evolutionary algorithms.

References (42)

  • M. Hanmandlu et al.

    Color segmentation by fuzzy co-clustering of chrominance color features

    Neurocomputing

    (November 2013)
  • H. Izakian et al.

    Agreement-based fuzzy C-means for clustering data with blocks of features

    Neurocomputing

    (March 2014)
  • L. Bregman

    The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming

    USSR Comput. Math. Math. Phys.

    (1967)
  • J. Zhou et al.

    An improved method to detect correct protein folds using partial clustering

    BMC Bioinf.

    (January 2013)
  • W. Li et al.

    Ultrafast clustering algorithms for metagenomic sequence analysis

    Briefings Bioinf.

    (May 2012)
  • G. Ho et al.

    Customer grouping for better resources allocation using GA based clustering technique

    Expert Syst. Appl.

    (August 2011)
  • A. Tagarelli et al.

    A segment-based approach to clustering multi-topic documents

    Knowl. Inf. Syst.

    (March 2013)
  • M. Gong et al.

    Fuzzy C-means clustering with local information and kernel metric for image segmentation

    IEEE Trans. Image Proces.

    (Febuary 2013)
  • M.I. Lopez et al.

    Classification via clustering for predicting final marks based on student participation in forums

  • V. Loia et al.

    Semantic web content analysis: a study in proximity-based collaborative clustering

    IEEE Trans. Fuzzy Syst.

    (December 2007)
  • L. Coletta et al.

    Collaborative fuzzy clustering algorithms: some refinements and design guidelines

    IEEE Trans. Fuzzy Syst.

    (June 2012)
  • Cited by (0)

    Alireza Sokhandan is Ph.D. Student in Artificial Intelligence at University of Isfahan, Isfahan, Iran. He received his B.Sc. in Information Technology Engineering in 2010 and M.Sc. in Mechatronics Engineering in 2012 from University of Tabriz, Tabriz, Iran. His research interests include image processing and computer vision, machine learning, and evolutionary algorithms.

    Peyman Adibi was born in Isfahan, Iran, in 1975. He received the B.S. degree in Computer Engineering from Isfahan University of Technology, Isfahan, Iran, in 1998, and the M.S. and Ph.D. degrees in Computer Engineering from Amirkabir University of Technology, Tehran, Iran, in 2001 and 2009, respectively. Since 2010, he has been with the Artificial Intelligence Department, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran, where he is currently an Assistant Professor. His current research interests include machine learning, soft-computing, and computer vision.

    Mohammad Reza Sajadi was born in Tehran in 1987. He was studying Mechanical Engineering at University of Shahid Rajaee, Tehran, Iran, which led to get B.Sc. in 2008. Then, he was studying Mechatronics Engineering at School of Engineering Emerging Technologies at University of Tabriz, Tabriz, Iran, which led to get Master's Degree in 2012. He has practical experience in the manufacturing of electric car, surgical robot, wrist rehabilitation robot and 7 Dof redundant manipulator. His surgical robot gained the position of selective design in Fifth National Festival of HAREKAT.

    View full text