Elsevier

Information Sciences

Volume 585, March 2022, Pages 209-231
Information Sciences

PR-FCM: A polynomial regression-based fuzzy C-means algorithm for attribute-associated data

https://doi.org/10.1016/j.ins.2021.11.056Get rights and content

Highlights

  • A novel fuzzy c-means algorithm is proposed for attribute-associated data clustering.

  • The parameters of proposed algorithms are fully investigated through synthetic datasets.

  • The proposed algorithm performs better compared with others on synthetic, real-world, and tunnel boring machine datasets.

Abstract

Partitioning data into internally homogeneous parts is an important problem when mining in situ engineering data. In this paper, a polynomial regression-based fuzzy c-means (PR-FCM) clustering algorithm that utilizes the functional relationships among the attributes of the input dataset is proposed. In this algorithm, a polynomial regression equation is taken as the center of each cluster instead of the cluster prototype used in conventional FCM, and the difference between a sample and a cluster prototype is defined as the distance between the actual value of one attribute and the corresponding predicted value provided by its own polynomial regression equation. An alternating optimization method is designed to optimize the new clustering objective function of the proposed algorithm. A series of experiments on synthetic and real-world datasets are conducted to evaluate the performance of the PR-FCM algorithm, which exhibits higher effectiveness and possesses more advantages than the original FCM algorithm. The PR-FCM algorithm is applied to tunnel boring machine (TBM) operation data from a TBM project in China. The experimental results show that the proposed algorithm can effectively cluster TBM operation data.

Introduction

With the development of information technology, massive operational data have been measured and recorded for complex engineering systems, promoting the development of data-driven techniques for handling in situ engineering data in recent years. Substantial literature has demonstrated that the information mined from operation data can be used to improve the design, control, and execution of engineering systems [1], [2], [3], [4]. However, the operating state of an engineering system usually changes because it experiences different working conditions, which means that the corresponding operation data patterns change greatly as well. Thus, it is necessary to partition these operation data such that the characteristics in the same part of the dataset are more similar than those in other parts to facilitate the design and analysis of engineering systems, as discussed in Refs. [5], [6]. Data clustering is a strong tool for solving this problem.

Data clustering is an important branch of unsupervised learning; the purpose is to divide the input data according to some criteria so that the data in the same cluster are as similar as possible [7], [8]. One of the most widely used clustering methods is the fuzzy c-means (FCM) algorithm [9], [10]. The FCM algorithm assigns membership statuses to each datum, and these statuses are inversely related to the relative distances from the datum to the cluster prototypes that act as the cluster centers in FCM. The closer a datum is to the center of a cluster, the higher its degree of membership is with this cluster. Because of the concept of membership in FCM, many aspects of real life possess no clear boundaries. Therefore, this approach is widely used in various areas, such as pattern recognition [11], image segmentation [12], and fault detection [13]. For human emotion recognition, Liliana et al. [14] proposed an algorithm to detect facial expressions based on the active appearance model and semisupervised FCM. Liu et al. [15] used the FCM algorithm to detect defect edges in infrared images, providing better performance than that of classic edge detection operators. In [16], Zhao et al. proposed an enhanced gravitation search-based FCM algorithm to identify abnormal power system data. By adding two features representing pixels to the FCM algorithm, Kalti and Mohamed [17] improved the accuracy of image segmentation. Ramos et al. [18] used density-oriented FCM and kernel FCM algorithms to design data-driven fault diagnosis systems. Barraza et al. [19] applied the fireworks algorithm to the FCM algorithm to find the optimal number of clusters required for achieving a better clustering effect. In [20], the context-aware spatial constraints and local membership matrix information were incorporated into the classic FCM algorithm, forming a robust FCM algorithm for the segmentation of brain tissues in magnetic resonance imaging. Zhao et al. [21] clustered acoustic emission signals by a combination of the FCM algorithm and principal component analysis.

The abovementioned works have provided insights into the availability and potential benefits of FCM for engineering design and analysis. However, the engineering data belonging to different clusters usually overlap greatly, which means that the traditional FCM algorithm cannot provide accurate clustering results in some engineering practices since it partitions data according to their spatial distances. It is necessary to utilize other data patterns, especially nonlinear data patterns, to improve the clustering accuracy of the algorithm. The kernel method has been introduced to fuzzy clustering algorithms to map the input data into a higher-dimensional feature space [22]. The kernel-based fuzzy clustering (FKCM) algorithm is able to recognize the nonlinearity of data, and it has been successfully applied in image segmentation, incomplete data clustering, and noisy data clustering tasks in recent years [23], [24]. In addition, shell-based fuzzy clustering is another algorithm that can address clustering problems with nonlinear data. Unlike the kernel-based clustering method, shell-based clustering takes shells as prototypes and divides the input data into different shells in space [25], [26]. To represent clusters with contour formats [27], shell-based clustering has been applied as an alternative to the FCM algorithm in recent years [28]. In addition, other extensions of the FCM algorithm, such as the relational FCM algorithm [29] and the FCM algorithm that incorporates spatial information [30], are available for representing complex data structures. Some related works regarding extensions of the FCM algorithm are briefly listed in Table 1.

For in situ engineering data, however, it is commonly known that the relationships among the data attributes vary considerably under different working conditions or different operation states. Thus, some in situ engineering data, such as economic data, meteorological data, and equipment detection data, can be seen as functional data. To address this issue, the regression-based FCM algorithm has been proposed. Hathaway et al. [39] proposed a fuzzy c-regression model (FCRM) for functional data based on the ideas of expectation maximization (EM) and regression. Wedel and Steenkamp [40] proposed algorithms for fuzzy clusterwise regression by improving the target function within the framework of preference analysis. Yamakawa [41] added principal component analysis to the FCRM, solving the problem of the FCRM not performing well on high-dimensional datasets. Conversely, Sato-Ilk [42] introduced the kernel method to the FCRM to map it to a high dimension; this approach was aimed at data with complex spatial distributions. Zhao [43] improved the measurement process between samples and clusters by balancing the fitting deviation and spatial distance, endowing the developed clustering method with a richer physical meaning.

Previous works on regression-based FCM algorithms illustrate that these approaches can solve problems involving the partitioning of data functional relationships to some extent. In practice, however, engineering equipment often suffers from poor working environments, causing the working status of the equipment to vary. In this context, in situ engineering data tend to present complex nonlinear functional relationships, which are hard to partition with the previous linear regression-based FCM algorithms because of the limited approximation capability of the linear regression technique for data with complex nonlinear functional structures. To address this issue, this paper presents a polynomial regression-based FCM (PR-FCM) clustering algorithm to solve the problem of clustering complex in situ engineering data. The proposed algorithm clusters the input data based on the nonlinear functional relationships among the attributes of the data but not their spatial distributions. In the proposed algorithm, we use PR to describe these nonlinear functional relationships and utilize the PR of each cluster to replace the traditional clustering prototypes and construct a new fuzzy clustering-based objective function. Then, the corresponding optimization method for the proposed clustering objective function is designed. The PR-FCM algorithm can extract the information contained in the nonlinear functional relationships among the data attributes. Therefore, compared to the existing fuzzy clustering algorithm, the PR-FCM algorithm performs better for data with nonlinear functional relationships. The main contributions of the work are summarized as follows:

  • 1)

    PR is introduced into the FCM framework to represent complex functional structures, and a corresponding iterative optimization method is designed to optimize the parameters of the algorithm. The proposed PR-FCM algorithm partitions data with nonlinear functional relationships; this task is difficult to complete with the traditional clustering algorithm.

  • 2)

    The computational complexity of the proposal is analyzed theoretically. In addition, the factors that affect the performance of the PR-FCM algorithm are studied systematically, including the polynomial order and the sample size. These investigations provide meaningful guidance and great convenience for the setup of the algorithm and further improve its effectiveness.

  • 3)

    The PR-FCM algorithm is compared with several benchmark algorithms on not only synthetic datasets but also real-world and tunnel boring machine (TBM) datasets. The results indicate that the PR-FCM algorithm performs better than other approaches and imply the applicability of the PR-FCM algorithm in engineering data mining.

The remainder of the paper is distributed as follows. 2 Preliminaries, 3 Proposed algorithm introduce the details of FCM, the PR algorithm, and the proposed algorithm. In 4 Experiments on synthetic datasets, 5 Experiments on real-world datasets, several synthetic datasets and real-world datasets are used to test and compare the performance of the PR-FCM algorithm with that of several benchmark algorithms. In Section 6, the proposed algorithm is applied to a real TBM operation dataset to demonstrate its effectiveness and advances in engineering data clustering. Some conclusions are given in Section 7.

Section snippets

Fuzzy c-means

FCM is an algorithm used to cluster a dataset X=x1,x2,,xnRs×n into c fuzzy clusters according to the following objective function:JU,V=i=1ck=1nuikmdik2

and the conditioni=1cuik=1k=1,2,,n;i,k:uik0,1where uik is the membership of the k-th datum in the dataset relative to the i-th cluster, U=[uik]c×n is the fuzzy partition matrix, m is the weighted index, and c is the number of clusters. dik is the spatial distance between the k-th datum and the prototype of the i-th cluster and is defined

PR-FCM algorithm

In this paper, a PR-FCM algorithm is proposed. To cluster input data based on the functional relationships among their attributes, one attribute of the data xj=[xj,1,xj,2,,xj,n] is defined as the dependent variable, and the other attributes are independent variables. Removing the xj, a new matrix of independent variables can be composed asXnew=1x1,1xj-1,1xj+1,1xs,11x1,2xj-1,2xj+1,2xs,21x1,nxj-1,nxj+1,nxs,n

dik in Eq. (1) is defined asdik=xj,k-PRi,kwhere PRi is an estimation of the k

Experiments on synthetic datasets

The PR-FCM algorithm is first validated on synthetic datasets. Each dataset is given a denomination according to the attributes, clusters, and relationships among its attributes. For instance, N400A2C2F1 means that the dataset has two attributes and 400 object data and that it can eventually be divided into two clusters. F1 represents the functional relationships among the attributes. In this experiment, one attribute is set as the dependent variable, and the other attributes are independent

Experiments on real-world datasets

In this section, four real-world datasets are used to further assess the validity of the PR-FCM algorithm. In these experiments, the parameters of the proposed algorithm are as follows: the fuzzification parameter m is 2, the order of PR-FCM is 2, the threshold value ε is 10-2, and the maximum number of iterations is 100.

Engineering application with the TBM operation dataset

In this section, the proposed algorithm is applied to an in situ TBM dataset obtained from a tunnel project in a city in China. The tunnel is 2000 m long and 6.4 m in diameter. From the ground surface to the tunnel floor, various geological layers, including clay, sand, and rock layers, are unevenly distributed. To excavate the tunnel, an earth pressure balance shield TBM is used; this TBM consists of a cutterhead, a chamber, a screw conveyor, a tail skin, and other auxiliary subsystems. The

Conclusion

In this paper, a new clustering algorithm referred to as PR-FCM is proposed based on PR and the FCM algorithm. The proposed algorithm is constructed under the FCM framework, but the utilized distance metric is based on the error between the real value of one attribute and that estimated by the PR model of each cluster. An alternating optimization method is designed to obtain the optimal data partitions. Considering that the in situ engineering data of different clusters usually overlap but

CRediT authorship contribution statement

Yong Pang: Methodology, Software, Formal analysis, Validation, Visualization, Investigation, Writing – original draft. Maolin Shi: Conceptualization, Methodology, Data curation. Liyong Zhang: Conceptualization, Writing – review & editing. Xueguan Song: Supervision, Writing – review & editing, Resources, Funding acquisition. Wei Sun: Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key R&D Program of China [grant number 2018YFB1702502] and the National Natural Science Foundation of China [grant number 52075068].

References (49)

  • L. Junyan et al.

    Defects’ geometric feature recognition based on infrared image edge detection

    Infrared Phys. Technol.

    (2014)
  • A. Rodríguez Ramos et al.

    A novel fault diagnosis scheme applying fuzzy clustering algorithms

    Appl. Soft Comput.

    (2017)
  • A. Kouhi et al.

    Robust FCM clustering algorithm with combined spatial constraint and membership matrix local information for brain MRI segmentation

    Expert Syst. Appl.

    (2020)
  • G. Zhao et al.

    Clustering of AE signals collected during torsional tests of 3D braiding composite shafts using PCA and FCM

    Compos. B Eng.

    (2019)
  • L.u. Wang et al.

    Dynamic imbalanced business credit evaluation based on Learn++ with sliding time window and weight sampling and FCM with multiple kernels

    Inf. Sci.

    (2020)
  • A. Gavioli et al.

    Identification of management zones in precision agriculture: an evaluation of alternative cluster analysis methods

    Biosyst. Eng.

    (2019)
  • R.J. Hathaway et al.

    Relational duals of the c-means clustering algorithms

    Pattern Recogn.

    (1989)
  • M.A. Khalilia

    Improvements to the relational fuzzy c-means clustering algorithm

    Pattern Recogn.

    (2014)
  • Y. Peng

    Fuzzy graph clustering

    Inf. Sci.

    (2021)
  • M. Shi

    A fuzzy c-means algorithm based on the relationship among attributes of data and its application in tunnel boring machine

    Knowl.-Based Syst.

    (2020)
  • M. Wedel et al.

    A fuzzy clusterwise regression approach to benefit segmentation

    Int. J. Res. Mark.

    (1989)
  • R.J. Campello

    A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment

    Pattern Recogn. Lett.

    (2007)
  • J.C. Bezdek

    Pattern recognition with fuzzy objective function algorithms

    (2013)
  • C. Liu et al.

    LDS-FCM: A linear dynamical system based fuzzy C-means method for tactile recognition

    IEEE Trans. Fuzzy Syst.

    (2019)
  • Cited by (23)

    • An energy consumption optimization strategy for Wireless sensor networks via multi-objective algorithm

      2024, Journal of King Saud University - Computer and Information Sciences
    View all citing articles on Scopus
    1

    Yong Pang and Maolin Shi are equally contributed to this paper.

    View full text