PR-FCM: A polynomial regression-based fuzzy C-means algorithm for attribute-associated data

doi:10.1016/j.ins.2021.11.056

Information Sciences

Volume 585, March 2022, Pages 209-231

https://doi.org/10.1016/j.ins.2021.11.056 Get rights and content

Highlights

•
A novel fuzzy c-means algorithm is proposed for attribute-associated data clustering.
•
The parameters of proposed algorithms are fully investigated through synthetic datasets.
•
The proposed algorithm performs better compared with others on synthetic, real-world, and tunnel boring machine datasets.

Abstract

Partitioning data into internally homogeneous parts is an important problem when mining in situ engineering data. In this paper, a polynomial regression-based fuzzy c-means (PR-FCM) clustering algorithm that utilizes the functional relationships among the attributes of the input dataset is proposed. In this algorithm, a polynomial regression equation is taken as the center of each cluster instead of the cluster prototype used in conventional FCM, and the difference between a sample and a cluster prototype is defined as the distance between the actual value of one attribute and the corresponding predicted value provided by its own polynomial regression equation. An alternating optimization method is designed to optimize the new clustering objective function of the proposed algorithm. A series of experiments on synthetic and real-world datasets are conducted to evaluate the performance of the PR-FCM algorithm, which exhibits higher effectiveness and possesses more advantages than the original FCM algorithm. The PR-FCM algorithm is applied to tunnel boring machine (TBM) operation data from a TBM project in China. The experimental results show that the proposed algorithm can effectively cluster TBM operation data.

Introduction

With the development of information technology, massive operational data have been measured and recorded for complex engineering systems, promoting the development of data-driven techniques for handling in situ engineering data in recent years. Substantial literature has demonstrated that the information mined from operation data can be used to improve the design, control, and execution of engineering systems [1], [2], [3], [4]. However, the operating state of an engineering system usually changes because it experiences different working conditions, which means that the corresponding operation data patterns change greatly as well. Thus, it is necessary to partition these operation data such that the characteristics in the same part of the dataset are more similar than those in other parts to facilitate the design and analysis of engineering systems, as discussed in Refs. [5], [6]. Data clustering is a strong tool for solving this problem.

Data clustering is an important branch of unsupervised learning; the purpose is to divide the input data according to some criteria so that the data in the same cluster are as similar as possible [7], [8]. One of the most widely used clustering methods is the fuzzy c-means (FCM) algorithm [9], [10]. The FCM algorithm assigns membership statuses to each datum, and these statuses are inversely related to the relative distances from the datum to the cluster prototypes that act as the cluster centers in FCM. The closer a datum is to the center of a cluster, the higher its degree of membership is with this cluster. Because of the concept of membership in FCM, many aspects of real life possess no clear boundaries. Therefore, this approach is widely used in various areas, such as pattern recognition [11], image segmentation [12], and fault detection [13]. For human emotion recognition, Liliana et al. [14] proposed an algorithm to detect facial expressions based on the active appearance model and semisupervised FCM. Liu et al. [15] used the FCM algorithm to detect defect edges in infrared images, providing better performance than that of classic edge detection operators. In [16], Zhao et al. proposed an enhanced gravitation search-based FCM algorithm to identify abnormal power system data. By adding two features representing pixels to the FCM algorithm, Kalti and Mohamed [17] improved the accuracy of image segmentation. Ramos et al. [18] used density-oriented FCM and kernel FCM algorithms to design data-driven fault diagnosis systems. Barraza et al. [19] applied the fireworks algorithm to the FCM algorithm to find the optimal number of clusters required for achieving a better clustering effect. In [20], the context-aware spatial constraints and local membership matrix information were incorporated into the classic FCM algorithm, forming a robust FCM algorithm for the segmentation of brain tissues in magnetic resonance imaging. Zhao et al. [21] clustered acoustic emission signals by a combination of the FCM algorithm and principal component analysis.

The abovementioned works have provided insights into the availability and potential benefits of FCM for engineering design and analysis. However, the engineering data belonging to different clusters usually overlap greatly, which means that the traditional FCM algorithm cannot provide accurate clustering results in some engineering practices since it partitions data according to their spatial distances. It is necessary to utilize other data patterns, especially nonlinear data patterns, to improve the clustering accuracy of the algorithm. The kernel method has been introduced to fuzzy clustering algorithms to map the input data into a higher-dimensional feature space [22]. The kernel-based fuzzy clustering (FKCM) algorithm is able to recognize the nonlinearity of data, and it has been successfully applied in image segmentation, incomplete data clustering, and noisy data clustering tasks in recent years [23], [24]. In addition, shell-based fuzzy clustering is another algorithm that can address clustering problems with nonlinear data. Unlike the kernel-based clustering method, shell-based clustering takes shells as prototypes and divides the input data into different shells in space [25], [26]. To represent clusters with contour formats [27], shell-based clustering has been applied as an alternative to the FCM algorithm in recent years [28]. In addition, other extensions of the FCM algorithm, such as the relational FCM algorithm [29] and the FCM algorithm that incorporates spatial information [30], are available for representing complex data structures. Some related works regarding extensions of the FCM algorithm are briefly listed in Table 1.

For in situ engineering data, however, it is commonly known that the relationships among the data attributes vary considerably under different working conditions or different operation states. Thus, some in situ engineering data, such as economic data, meteorological data, and equipment detection data, can be seen as functional data. To address this issue, the regression-based FCM algorithm has been proposed. Hathaway et al. [39] proposed a fuzzy c-regression model (FCRM) for functional data based on the ideas of expectation maximization (EM) and regression. Wedel and Steenkamp [40] proposed algorithms for fuzzy clusterwise regression by improving the target function within the framework of preference analysis. Yamakawa [41] added principal component analysis to the FCRM, solving the problem of the FCRM not performing well on high-dimensional datasets. Conversely, Sato-Ilk [42] introduced the kernel method to the FCRM to map it to a high dimension; this approach was aimed at data with complex spatial distributions. Zhao [43] improved the measurement process between samples and clusters by balancing the fitting deviation and spatial distance, endowing the developed clustering method with a richer physical meaning.

Previous works on regression-based FCM algorithms illustrate that these approaches can solve problems involving the partitioning of data functional relationships to some extent. In practice, however, engineering equipment often suffers from poor working environments, causing the working status of the equipment to vary. In this context, in situ engineering data tend to present complex nonlinear functional relationships, which are hard to partition with the previous linear regression-based FCM algorithms because of the limited approximation capability of the linear regression technique for data with complex nonlinear functional structures. To address this issue, this paper presents a polynomial regression-based FCM (PR-FCM) clustering algorithm to solve the problem of clustering complex in situ engineering data. The proposed algorithm clusters the input data based on the nonlinear functional relationships among the attributes of the data but not their spatial distributions. In the proposed algorithm, we use PR to describe these nonlinear functional relationships and utilize the PR of each cluster to replace the traditional clustering prototypes and construct a new fuzzy clustering-based objective function. Then, the corresponding optimization method for the proposed clustering objective function is designed. The PR-FCM algorithm can extract the information contained in the nonlinear functional relationships among the data attributes. Therefore, compared to the existing fuzzy clustering algorithm, the PR-FCM algorithm performs better for data with nonlinear functional relationships. The main contributions of the work are summarized as follows:

1)
PR is introduced into the FCM framework to represent complex functional structures, and a corresponding iterative optimization method is designed to optimize the parameters of the algorithm. The proposed PR-FCM algorithm partitions data with nonlinear functional relationships; this task is difficult to complete with the traditional clustering algorithm.
2)
The computational complexity of the proposal is analyzed theoretically. In addition, the factors that affect the performance of the PR-FCM algorithm are studied systematically, including the polynomial order and the sample size. These investigations provide meaningful guidance and great convenience for the setup of the algorithm and further improve its effectiveness.
3)
The PR-FCM algorithm is compared with several benchmark algorithms on not only synthetic datasets but also real-world and tunnel boring machine (TBM) datasets. The results indicate that the PR-FCM algorithm performs better than other approaches and imply the applicability of the PR-FCM algorithm in engineering data mining.

The remainder of the paper is distributed as follows. 2 Preliminaries, 3 Proposed algorithm introduce the details of FCM, the PR algorithm, and the proposed algorithm. In 4 Experiments on synthetic datasets, 5 Experiments on real-world datasets, several synthetic datasets and real-world datasets are used to test and compare the performance of the PR-FCM algorithm with that of several benchmark algorithms. In Section 6, the proposed algorithm is applied to a real TBM operation dataset to demonstrate its effectiveness and advances in engineering data clustering. Some conclusions are given in Section 7.

Section snippets

Fuzzy c-means

FCM is an algorithm used to cluster a dataset $X = \{x_{1}, x_{2}, \dots, x_{n}\} \subset R^{s \times n}$ into $c$ fuzzy clusters according to the following objective function: $J (U, V) = \sum_{i = 1}^{c} \sum_{k = 1}^{n} u_{ik}^{m} d_{ik}^{2}$

and the condition $\sum_{i = 1}^{c} u_{ik} = 1 (k = 1, 2, \dots, n; \forall i, k : u_{ik} \in [0, 1])$ where $u_{ik}$ is the membership of the $k$ -th datum in the dataset relative to the $i$ -th cluster, $U = {[u_{ik}]}_{c \times n}$ is the fuzzy partition matrix, $m$ is the weighted index, and $c$ is the number of clusters. $d_{ik}$ is the spatial distance between the $k$ -th datum and the prototype of the $i$ -th cluster and is defined

PR-FCM algorithm

In this paper, a PR-FCM algorithm is proposed. To cluster input data based on the functional relationships among their attributes, one attribute of the data $x_{j} = [x_{j, 1}, x_{j, 2}, \dots, x_{j, n}]$ is defined as the dependent variable, and the other attributes are independent variables. Removing the $x_{j}$ , a new matrix of independent variables can be composed as $X_{n e w} = [\begin{matrix} 1 & x_{1, 1} & \dots & x_{(j - 1), 1} & x_{(j + 1), 1} & \dots & x_{s, 1} \\ 1 & x_{1, 2} & \dots & x_{(j - 1), 2} & x_{(j + 1), 2} & \dots & x_{s, 2} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 1 & x_{1, n} & \dots & x_{(j - 1), n} & x_{(j + 1), n} & \dots & x_{s, n} \end{matrix}]$

$d_{ik}$ in Eq. (1) is defined as $d_{ik} = x_{j, k} - {PR}_{i, k}$ where ${PR}_{i}$ is an estimation of the $k$

Experiments on synthetic datasets

The PR-FCM algorithm is first validated on synthetic datasets. Each dataset is given a denomination according to the attributes, clusters, and relationships among its attributes. For instance, N400A2C2F1 means that the dataset has two attributes and 400 object data and that it can eventually be divided into two clusters. F1 represents the functional relationships among the attributes. In this experiment, one attribute is set as the dependent variable, and the other attributes are independent

Experiments on real-world datasets

In this section, four real-world datasets are used to further assess the validity of the PR-FCM algorithm. In these experiments, the parameters of the proposed algorithm are as follows: the fuzzification parameter m is 2, the order of PR-FCM is 2, the threshold value ε is 10^-2, and the maximum number of iterations is 100.

Engineering application with the TBM operation dataset

In this section, the proposed algorithm is applied to an in situ TBM dataset obtained from a tunnel project in a city in China. The tunnel is 2000 m long and 6.4 m in diameter. From the ground surface to the tunnel floor, various geological layers, including clay, sand, and rock layers, are unevenly distributed. To excavate the tunnel, an earth pressure balance shield TBM is used; this TBM consists of a cutterhead, a chamber, a screw conveyor, a tail skin, and other auxiliary subsystems. The

Conclusion

In this paper, a new clustering algorithm referred to as PR-FCM is proposed based on PR and the FCM algorithm. The proposed algorithm is constructed under the FCM framework, but the utilized distance metric is based on the error between the real value of one attribute and that estimated by the PR model of each cluster. An alternating optimization method is designed to obtain the optimal data partitions. Considering that the in situ engineering data of different clusters usually overlap but

CRediT authorship contribution statement

Yong Pang: Methodology, Software, Formal analysis, Validation, Visualization, Investigation, Writing – original draft. Maolin Shi: Conceptualization, Methodology, Data curation. Liyong Zhang: Conceptualization, Writing – review & editing. Xueguan Song: Supervision, Writing – review & editing, Resources, Funding acquisition. Wei Sun: Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key R&D Program of China [grant number 2018YFB1702502] and the National Natural Science Foundation of China [grant number 52075068].

References (49)

A. Lemos et al.
Adaptive fault detection and diagnosis using an evolving fuzzy classifier
Inf. Sci.
(2013)
W. Sun et al.
Dynamic load prediction of tunnel boring machine (TBM) based on heterogeneous in-situ data
Autom. Constr.
(2018)
X. Gao et al.
Recurrent neural networks for real-time prediction of TBM operating parameters
Autom. Constr.
(2019)
K. Zheng et al.
Design of fuzzy system-fuzzy neural network-backstepping control for complex robot system
Inf. Sci.
(2021)
S. Zeng et al.
Fuzzy forecasting based on linear combinations of independent variables, subtractive clustering algorithm and artificial bee colony algorithm
Inf. Sci.
(2019)
S. Aghabozorgi et al.
Time-series clustering–a decade review
Inform. Syst.
(2015)
A. Saxena et al.
A review of clustering techniques and developments
Neurocomputing
(2017)
M.A. Sanchez et al.
Fuzzy granular gravitational clustering algorithm for multivariate data
Inf. Sci.
(2014)
Z. Shi et al.
FCM-RDpA: TSK fuzzy regression model construction using fuzzy C-means clustering, regularization, Droprule, and Powerball Adabelief
Inf. Sci.
(2021)
M. Delić et al.
Extended power-based aggregation of distance functions and application in image segmentation
Inf. Sci.
(2019)

L. Junyan et al.

Defects’ geometric feature recognition based on infrared image edge detection

Infrared Phys. Technol.

(2014)

A. Rodríguez Ramos et al.

A novel fault diagnosis scheme applying fuzzy clustering algorithms

Appl. Soft Comput.

(2017)

A. Kouhi et al.

Robust FCM clustering algorithm with combined spatial constraint and membership matrix local information for brain MRI segmentation

Expert Syst. Appl.

(2020)

G. Zhao et al.

Clustering of AE signals collected during torsional tests of 3D braiding composite shafts using PCA and FCM

Compos. B Eng.

(2019)

L.u. Wang et al.

Dynamic imbalanced business credit evaluation based on Learn++ with sliding time window and weight sampling and FCM with multiple kernels

Inf. Sci.

(2020)

A. Gavioli et al.

Identification of management zones in precision agriculture: an evaluation of alternative cluster analysis methods

Biosyst. Eng.

(2019)

R.J. Hathaway et al.

Relational duals of the c-means clustering algorithms

Pattern Recogn.

(1989)

M.A. Khalilia

Improvements to the relational fuzzy c-means clustering algorithm

Pattern Recogn.

(2014)

Y. Peng

Fuzzy graph clustering

Inf. Sci.

(2021)

M. Shi

A fuzzy c-means algorithm based on the relationship among attributes of data and its application in tunnel boring machine

Knowl.-Based Syst.

(2020)

M. Wedel et al.

A fuzzy clusterwise regression approach to benefit segmentation

Int. J. Res. Mark.

(1989)

R.J. Campello

A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment

Pattern Recogn. Lett.

(2007)

J.C. Bezdek

Pattern recognition with fuzzy objective function algorithms

(2013)

C. Liu et al.

LDS-FCM: A linear dynamical system based fuzzy C-means method for tactile recognition

IEEE Trans. Fuzzy Syst.

(2019)

Cited by (23)

Assessment and regression of carbon emissions from the building and construction sector in China: A provincial study using machine learning
2024, Journal of Cleaner Production
The building and construction sector is a major contributor to carbon emissions in China. Hence, it is crucial to explore the characteristics and trends of building carbon emissions to achieve the carbon peak and neutrality. While previous studies have made efforts to analyze the influencing factors through different approaches, developing an effective and intelligent regression model based on machine learning algorithms remains challenging in predicting the carbon emission trend. This study analyzed carbon emissions and per capita indicators of the building and construction sector in 30 provincial regions in China from 2005 to 2021. While embodied and operational carbon emissions contribute equally to the total emissions, the results showed a significant spatial-temporal correlation. Considering the emissions as target features, nine alternative machine learning regression models were developed using eight identified explanatory features incorporating scale, economic, technological, and classification factors. Based on performance metrics encompassing root mean squared error, coefficient of determination, and mean absolute percentage error, the stacking ensemble regression model was identified to have superior performance. This model was further employed to conduct a sensitivity analysis of explanatory features on carbon emissions. The results indicated that urbanization rate and population were the most sensitive factors, with varying effects on different target features. These findings can be used to predict carbon emission trends and promote carbon reduction policies in the building industry.
An energy consumption optimization strategy for Wireless sensor networks via multi-objective algorithm
2024, Journal of King Saud University - Computer and Information Sciences
Deploying relay nodes is a significant mechanism to prolong the network lifetime of wireless sensor networks (WSNs). However, most existing studies overlook the energy consumption of relay nodes, leading to imperfections in the optimization process. Additionally, there is also a lack of analysis of conflicts between different optimization objectives. In this regard, a multi-objective antlion with fuzzy clustering algorithm (MOALO-FCM) is designed to obtain a better trade-off between different optimization objectives. And an adaptive membership function revision strategy is introduced to improve the energy balance of relay nodes. To verify the performance of the proposed algorithm, simulation experiments are set in 2-dimensional and 3-dimensional scenes with correlative algorithms, respectively. The main evaluation indexes include performance of Pareto optimal solution sets, the life cycle of network, the energy consumption of sensor nodes, the energy consumption of relay nodes, the number of living nodes, and the running time of algorithms. The results indicate that the proposed algorithm has better performance in various indexes.
A critical review on inconsistency mechanism, evaluation methods and improvement measures for lithium-ion battery energy storage systems
2024, Renewable and Sustainable Energy Reviews
With the rapid development of electric vehicles and smart grids, the demand for battery energy storage systems is growing rapidly. The large-scale battery system leads to prominent inconsistency issues. This work systematically reviewed the causes, hazards, evaluation methods and improvement measures of lithium-ion battery inconsistency. From material to manufacture and usage, the process and conditions of each link affect battery consistency. The hazards of battery pack inconsistency include increasing system failure rate, reducing service performance and accelerating life decay. Inconsistency evaluation methods are summarized as statistics-based, machine learning-based and information fusion-based methods. Moreover, the improvement measures of battery inconsistency are reviewed from the aspects of the production process, sorting technology, topology optimization, equalization control and thermal management. In addition, the future works on challenges and prospects of battery inconsistency research are revealed, in hope of inspiring the efficient operation and maintenance of large-scale battery energy storage systems.
An autocorrelation incremental fuzzy clustering framework based on dynamic conditional scoring model
2023, Information Sciences
This paper focuses on the real-time dynamic clustering analysis of power load data based on the dynamic conditional score (DCS) model, and an autocorrelation increment fuzzy C-means clustering algorithm based on the DCS model is proposed. (1) The paper addresses the problem that current power load clustering methods, when performing time series data mining, tend to focus on capturing the mean structure while ignoring the variance characteristics of the data, making it difficult to effectively capture the structural information of time series data. The DCS model is used as the statistical model basis for clustering analysis, and the time series is clustered based on the estimated conditional moment characteristics of the model, dynamically capturing data features such as the mean, variance, and sequence correlation of time series data, effectively improving the clustering performance. (2) This paper also addresses the issue that current power load clustering methods tend to focus on static datasets of user power loads and cannot effectively handle the data stream clustering problem with time series characteristics in practical applications. The DCS model parameter dataset and the autocorrelation increment fuzzy clustering algorithm are used to conduct a dynamic data flow analysis of user electricity behaviour evolution and pattern continuous updating research for power loads. The clustering results are dynamically updated based on the user's power load data stream using the proposed algorithm, achieving research on a universal clustering model and secure and efficient algorithms in a big data environment. (3) The paper verifies the clustering performance of the proposed method using power load time series data provided by a Chinese power supply company as a case dataset. The clustering evaluation index shows that the proposed algorithm has high clustering accuracy and good clustering performance. Additionally, different power supply recommendations are proposed for different customer electricity types in the obtained clustering results to provide more personalized power services.
Pairwise constraints-based semi-supervised fuzzy clustering with multi-manifold regularization
2023, Information Sciences
Introducing a handful of pairwise constraints into fuzzy clustering models to revise memberships has been proven beneficial to boosting clustering performance. However, current pairwise constraints-based semi-supervised fuzzy clustering methods suffer from common deficiencies, i.e., the insufficient and imprecise revisions of memberships, by which the further improvement of clustering performance may be encumbered. To yield more pleasurable results, this paper proposes a new pairwise constraints-based semi-supervised fuzzy clustering method with multi-manifold regularization (MMRFCM), which can overcome the above deficiencies simultaneously. Firstly, data are regarded as located in various manifolds, and the multi-manifold regularization is delicately designed to sufficiently revise memberships for all data objects to guarantee good overall clustering performance. Secondly, local structural information is incorporated into designed multi-manifold regularization to ensure the precision and stability of the revisions on memberships. Thirdly, the approximated non-linear similarities evolving from ensemble p-Laplacian are applied to discover implicit local structures more thoroughly to further strengthen the effect of the multi-manifold regularization. Based on these strategies, MMRFCM efficiently exploits pairwise constraints to sufficiently and precisely modify memberships during the clustering process and thus achieves excellent results. Like most fuzzy clustering methods, MMRFCM is solved by alternative updates and the solutions are locally optimal. In the comprehensive experiments conducted on different types of datasets, MMRFCM successfully outperforms several classical and state-of-the-art fuzzy clustering methods in terms of clustering accuracy (CA), normalized mutual information (NMI), and adjusted rand index (ARI). The excellent results demonstrate the superiority, stability, and reliability of the proposed method.
A blockchain-based evaluation approach to analyse customer satisfaction using AI techniques
2023, Heliyon
Due to technological advancements and consumer demands, online shopping creates new features and adapts to new standards. A robust customer satisfaction prediction model concerning trust and privacy platforms can encourage an organization to make better decisions about its service and quality. This study presented an approach to predict consumer satisfaction using the blockchain-based framework combining the Multi-Dimensional Naive Bayes-K Nearest Neighbor (MDNB-KNN) and the Multi-Objective Logistic Particle Swarm Optimization Algorithm (MOL-PSOA). A regression model is employed to quantify the impact of various production factors on customer satisfaction. The proposed method yields better levels of measurement for customer satisfaction (98%), accuracy (95%), necessary time (60%), precision (95%), and recall (95%) compared to existing studies. Measuring consumer satisfaction with a trustworthy platform facilitates to development of the conceptual and practical distinctions influencing customers' purchasing decisions.

View all citing articles on Scopus

¹: Yong Pang and Maolin Shi are equally contributed to this paper.

View full text

PR-FCM: A polynomial regression-based fuzzy C-means algorithm for attribute-associated data

Highlights

Abstract

Introduction

Section snippets

Fuzzy c-means

PR-FCM algorithm

Experiments on synthetic datasets

Experiments on real-world datasets

Engineering application with the TBM operation dataset

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Inf. Sci.

Autom. Constr.

Autom. Constr.

Inf. Sci.

Inf. Sci.

Inform. Syst.

Neurocomputing

Inf. Sci.

Inf. Sci.

Inf. Sci.

Infrared Phys. Technol.

Appl. Soft Comput.

Expert Syst. Appl.

Compos. B Eng.

Inf. Sci.

Biosyst. Eng.

Pattern Recogn.

Pattern Recogn.

Inf. Sci.

Knowl.-Based Syst.

Int. J. Res. Mark.

Pattern Recogn. Lett.

Pattern recognition with fuzzy objective function algorithms

LDS-FCM: A linear dynamical system based fuzzy C-means method for tactile recognition

IEEE Trans. Fuzzy Syst.