PR-FCM: A polynomial regression-based fuzzy C-means algorithm for attribute-associated data
Introduction
With the development of information technology, massive operational data have been measured and recorded for complex engineering systems, promoting the development of data-driven techniques for handling in situ engineering data in recent years. Substantial literature has demonstrated that the information mined from operation data can be used to improve the design, control, and execution of engineering systems [1], [2], [3], [4]. However, the operating state of an engineering system usually changes because it experiences different working conditions, which means that the corresponding operation data patterns change greatly as well. Thus, it is necessary to partition these operation data such that the characteristics in the same part of the dataset are more similar than those in other parts to facilitate the design and analysis of engineering systems, as discussed in Refs. [5], [6]. Data clustering is a strong tool for solving this problem.
Data clustering is an important branch of unsupervised learning; the purpose is to divide the input data according to some criteria so that the data in the same cluster are as similar as possible [7], [8]. One of the most widely used clustering methods is the fuzzy c-means (FCM) algorithm [9], [10]. The FCM algorithm assigns membership statuses to each datum, and these statuses are inversely related to the relative distances from the datum to the cluster prototypes that act as the cluster centers in FCM. The closer a datum is to the center of a cluster, the higher its degree of membership is with this cluster. Because of the concept of membership in FCM, many aspects of real life possess no clear boundaries. Therefore, this approach is widely used in various areas, such as pattern recognition [11], image segmentation [12], and fault detection [13]. For human emotion recognition, Liliana et al. [14] proposed an algorithm to detect facial expressions based on the active appearance model and semisupervised FCM. Liu et al. [15] used the FCM algorithm to detect defect edges in infrared images, providing better performance than that of classic edge detection operators. In [16], Zhao et al. proposed an enhanced gravitation search-based FCM algorithm to identify abnormal power system data. By adding two features representing pixels to the FCM algorithm, Kalti and Mohamed [17] improved the accuracy of image segmentation. Ramos et al. [18] used density-oriented FCM and kernel FCM algorithms to design data-driven fault diagnosis systems. Barraza et al. [19] applied the fireworks algorithm to the FCM algorithm to find the optimal number of clusters required for achieving a better clustering effect. In [20], the context-aware spatial constraints and local membership matrix information were incorporated into the classic FCM algorithm, forming a robust FCM algorithm for the segmentation of brain tissues in magnetic resonance imaging. Zhao et al. [21] clustered acoustic emission signals by a combination of the FCM algorithm and principal component analysis.
The abovementioned works have provided insights into the availability and potential benefits of FCM for engineering design and analysis. However, the engineering data belonging to different clusters usually overlap greatly, which means that the traditional FCM algorithm cannot provide accurate clustering results in some engineering practices since it partitions data according to their spatial distances. It is necessary to utilize other data patterns, especially nonlinear data patterns, to improve the clustering accuracy of the algorithm. The kernel method has been introduced to fuzzy clustering algorithms to map the input data into a higher-dimensional feature space [22]. The kernel-based fuzzy clustering (FKCM) algorithm is able to recognize the nonlinearity of data, and it has been successfully applied in image segmentation, incomplete data clustering, and noisy data clustering tasks in recent years [23], [24]. In addition, shell-based fuzzy clustering is another algorithm that can address clustering problems with nonlinear data. Unlike the kernel-based clustering method, shell-based clustering takes shells as prototypes and divides the input data into different shells in space [25], [26]. To represent clusters with contour formats [27], shell-based clustering has been applied as an alternative to the FCM algorithm in recent years [28]. In addition, other extensions of the FCM algorithm, such as the relational FCM algorithm [29] and the FCM algorithm that incorporates spatial information [30], are available for representing complex data structures. Some related works regarding extensions of the FCM algorithm are briefly listed in Table 1.
For in situ engineering data, however, it is commonly known that the relationships among the data attributes vary considerably under different working conditions or different operation states. Thus, some in situ engineering data, such as economic data, meteorological data, and equipment detection data, can be seen as functional data. To address this issue, the regression-based FCM algorithm has been proposed. Hathaway et al. [39] proposed a fuzzy c-regression model (FCRM) for functional data based on the ideas of expectation maximization (EM) and regression. Wedel and Steenkamp [40] proposed algorithms for fuzzy clusterwise regression by improving the target function within the framework of preference analysis. Yamakawa [41] added principal component analysis to the FCRM, solving the problem of the FCRM not performing well on high-dimensional datasets. Conversely, Sato-Ilk [42] introduced the kernel method to the FCRM to map it to a high dimension; this approach was aimed at data with complex spatial distributions. Zhao [43] improved the measurement process between samples and clusters by balancing the fitting deviation and spatial distance, endowing the developed clustering method with a richer physical meaning.
Previous works on regression-based FCM algorithms illustrate that these approaches can solve problems involving the partitioning of data functional relationships to some extent. In practice, however, engineering equipment often suffers from poor working environments, causing the working status of the equipment to vary. In this context, in situ engineering data tend to present complex nonlinear functional relationships, which are hard to partition with the previous linear regression-based FCM algorithms because of the limited approximation capability of the linear regression technique for data with complex nonlinear functional structures. To address this issue, this paper presents a polynomial regression-based FCM (PR-FCM) clustering algorithm to solve the problem of clustering complex in situ engineering data. The proposed algorithm clusters the input data based on the nonlinear functional relationships among the attributes of the data but not their spatial distributions. In the proposed algorithm, we use PR to describe these nonlinear functional relationships and utilize the PR of each cluster to replace the traditional clustering prototypes and construct a new fuzzy clustering-based objective function. Then, the corresponding optimization method for the proposed clustering objective function is designed. The PR-FCM algorithm can extract the information contained in the nonlinear functional relationships among the data attributes. Therefore, compared to the existing fuzzy clustering algorithm, the PR-FCM algorithm performs better for data with nonlinear functional relationships. The main contributions of the work are summarized as follows:
- 1)
PR is introduced into the FCM framework to represent complex functional structures, and a corresponding iterative optimization method is designed to optimize the parameters of the algorithm. The proposed PR-FCM algorithm partitions data with nonlinear functional relationships; this task is difficult to complete with the traditional clustering algorithm.
- 2)
The computational complexity of the proposal is analyzed theoretically. In addition, the factors that affect the performance of the PR-FCM algorithm are studied systematically, including the polynomial order and the sample size. These investigations provide meaningful guidance and great convenience for the setup of the algorithm and further improve its effectiveness.
- 3)
The PR-FCM algorithm is compared with several benchmark algorithms on not only synthetic datasets but also real-world and tunnel boring machine (TBM) datasets. The results indicate that the PR-FCM algorithm performs better than other approaches and imply the applicability of the PR-FCM algorithm in engineering data mining.
The remainder of the paper is distributed as follows. 2 Preliminaries, 3 Proposed algorithm introduce the details of FCM, the PR algorithm, and the proposed algorithm. In 4 Experiments on synthetic datasets, 5 Experiments on real-world datasets, several synthetic datasets and real-world datasets are used to test and compare the performance of the PR-FCM algorithm with that of several benchmark algorithms. In Section 6, the proposed algorithm is applied to a real TBM operation dataset to demonstrate its effectiveness and advances in engineering data clustering. Some conclusions are given in Section 7.
Section snippets
Fuzzy c-means
FCM is an algorithm used to cluster a dataset into fuzzy clusters according to the following objective function:
and the conditionwhere is the membership of the -th datum in the dataset relative to the -th cluster, is the fuzzy partition matrix, is the weighted index, and is the number of clusters. is the spatial distance between the -th datum and the prototype of the -th cluster and is defined
PR-FCM algorithm
In this paper, a PR-FCM algorithm is proposed. To cluster input data based on the functional relationships among their attributes, one attribute of the data is defined as the dependent variable, and the other attributes are independent variables. Removing the , a new matrix of independent variables can be composed as
in Eq. (1) is defined aswhere is an estimation of the
Experiments on synthetic datasets
The PR-FCM algorithm is first validated on synthetic datasets. Each dataset is given a denomination according to the attributes, clusters, and relationships among its attributes. For instance, N400A2C2F1 means that the dataset has two attributes and 400 object data and that it can eventually be divided into two clusters. F1 represents the functional relationships among the attributes. In this experiment, one attribute is set as the dependent variable, and the other attributes are independent
Experiments on real-world datasets
In this section, four real-world datasets are used to further assess the validity of the PR-FCM algorithm. In these experiments, the parameters of the proposed algorithm are as follows: the fuzzification parameter m is 2, the order of PR-FCM is 2, the threshold value ε is 10-2, and the maximum number of iterations is 100.
Engineering application with the TBM operation dataset
In this section, the proposed algorithm is applied to an in situ TBM dataset obtained from a tunnel project in a city in China. The tunnel is 2000 m long and 6.4 m in diameter. From the ground surface to the tunnel floor, various geological layers, including clay, sand, and rock layers, are unevenly distributed. To excavate the tunnel, an earth pressure balance shield TBM is used; this TBM consists of a cutterhead, a chamber, a screw conveyor, a tail skin, and other auxiliary subsystems. The
Conclusion
In this paper, a new clustering algorithm referred to as PR-FCM is proposed based on PR and the FCM algorithm. The proposed algorithm is constructed under the FCM framework, but the utilized distance metric is based on the error between the real value of one attribute and that estimated by the PR model of each cluster. An alternating optimization method is designed to obtain the optimal data partitions. Considering that the in situ engineering data of different clusters usually overlap but
CRediT authorship contribution statement
Yong Pang: Methodology, Software, Formal analysis, Validation, Visualization, Investigation, Writing – original draft. Maolin Shi: Conceptualization, Methodology, Data curation. Liyong Zhang: Conceptualization, Writing – review & editing. Xueguan Song: Supervision, Writing – review & editing, Resources, Funding acquisition. Wei Sun: Project administration, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Key R&D Program of China [grant number 2018YFB1702502] and the National Natural Science Foundation of China [grant number 52075068].
References (49)
- et al.
Adaptive fault detection and diagnosis using an evolving fuzzy classifier
Inf. Sci.
(2013) - et al.
Dynamic load prediction of tunnel boring machine (TBM) based on heterogeneous in-situ data
Autom. Constr.
(2018) - et al.
Recurrent neural networks for real-time prediction of TBM operating parameters
Autom. Constr.
(2019) - et al.
Design of fuzzy system-fuzzy neural network-backstepping control for complex robot system
Inf. Sci.
(2021) - et al.
Fuzzy forecasting based on linear combinations of independent variables, subtractive clustering algorithm and artificial bee colony algorithm
Inf. Sci.
(2019) - et al.
Time-series clustering–a decade review
Inform. Syst.
(2015) - et al.
A review of clustering techniques and developments
Neurocomputing
(2017) - et al.
Fuzzy granular gravitational clustering algorithm for multivariate data
Inf. Sci.
(2014) - et al.
FCM-RDpA: TSK fuzzy regression model construction using fuzzy C-means clustering, regularization, Droprule, and Powerball Adabelief
Inf. Sci.
(2021) - et al.
Extended power-based aggregation of distance functions and application in image segmentation
Inf. Sci.
(2019)
Defects’ geometric feature recognition based on infrared image edge detection
Infrared Phys. Technol.
A novel fault diagnosis scheme applying fuzzy clustering algorithms
Appl. Soft Comput.
Robust FCM clustering algorithm with combined spatial constraint and membership matrix local information for brain MRI segmentation
Expert Syst. Appl.
Clustering of AE signals collected during torsional tests of 3D braiding composite shafts using PCA and FCM
Compos. B Eng.
Dynamic imbalanced business credit evaluation based on Learn++ with sliding time window and weight sampling and FCM with multiple kernels
Inf. Sci.
Identification of management zones in precision agriculture: an evaluation of alternative cluster analysis methods
Biosyst. Eng.
Relational duals of the c-means clustering algorithms
Pattern Recogn.
Improvements to the relational fuzzy c-means clustering algorithm
Pattern Recogn.
Fuzzy graph clustering
Inf. Sci.
A fuzzy c-means algorithm based on the relationship among attributes of data and its application in tunnel boring machine
Knowl.-Based Syst.
A fuzzy clusterwise regression approach to benefit segmentation
Int. J. Res. Mark.
A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment
Pattern Recogn. Lett.
Pattern recognition with fuzzy objective function algorithms
LDS-FCM: A linear dynamical system based fuzzy C-means method for tactile recognition
IEEE Trans. Fuzzy Syst.
Cited by (23)
Assessment and regression of carbon emissions from the building and construction sector in China: A provincial study using machine learning
2024, Journal of Cleaner ProductionAn energy consumption optimization strategy for Wireless sensor networks via multi-objective algorithm
2024, Journal of King Saud University - Computer and Information SciencesA critical review on inconsistency mechanism, evaluation methods and improvement measures for lithium-ion battery energy storage systems
2024, Renewable and Sustainable Energy ReviewsAn autocorrelation incremental fuzzy clustering framework based on dynamic conditional scoring model
2023, Information SciencesPairwise constraints-based semi-supervised fuzzy clustering with multi-manifold regularization
2023, Information Sciences
- 1
Yong Pang and Maolin Shi are equally contributed to this paper.