An Approach to Clustering Using the Expectation-Maximization and Selection of Attributes ReliefF Applied to Water Treatment Plants process

dos Santos, Fábio Cosme Rodrigues; Librantz, André Felipe Henriques; Sassi, Renato José

doi:10.1007/978-3-319-75193-1_67

Fábio Cosme Rodrigues dos Santos^15,16,
André Felipe Henriques Librantz¹⁵ &
Renato José Sassi¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10657))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

2045 Accesses

Abstract

The water treatment process contains several physico-chemical parameters relevant to decision making and the water quality scenarios’ identification. Some scenarios are evident and can be observed without the application of mathematical or statistical techniques, however some of these scenarios are difficult to distinguish, and it is necessary to use computational intelligence techniques for solution. In this context, the paper aims to show the application of the expectation-maximization (EM) algorithm for data clusters of the coagulation process and the ReliefF algorithm to determine the importance of the physico-chemical parameters, using the WEKA tool to analyze historical dataset of a water treatment plant. The results were favorable to the scenarios’ identification and to determine the relevance of the parameters related to the process.

You have full access to this open access chapter, Download conference paper PDF

Classification, Association and Clustering of Water Body Data: Application to Water Quality Monitoring

Article 28 August 2017

Leonardo Bertholdo, Celmar Guimarães da Silva, … Luiz Camolesi Júnior

A Comparison of Categorical Attribute Data Clustering Methods

Using combined clustering algorithms and association rules for better management of the amount of water delivered to the irrigation network of Abyek Plain, Iran

Article 11 November 2021

Seyed Hassan Mirhashemi & Farhad Mirzaei

Keywords

1 Introduction

Fresh water available in rivers or dams contains micro-organisms and impurities that may be harmful to human health. When captured, it is adduced to the treatment plant with the objective of removing all undesirable substances, in order to meet the potability requirements defined by legislation [5, 12].

According to [11], the water treatment process has several physical-chemical parameters that must be monitored and controlled in order to guarantee the quality of the final product. One of the more complex subsystems is coagulation, which consists in destabilizing the dirt particles, so they are retained in the later processes of the water treatment plant [4].

The coagulation subsystem has as main parameters the raw water quality and intermediate subsystems’ parameters, in this order, making possible the measurement of the efficacy of the coagulation process. The reference dosage of coagulant in water is determined by bench-testing the treatment process or by measuring the electrical charge every time the raw water quality scenario has changes [2, 17].

Although the scenarios represent droughts and rainy seasons, it is not easy to classify them, especially when there is a possibility of adjusting the chemical dosage reference values as a result of the transition between the periods that most affect quality of raw water [2].

One way to find possible scenarios in water treatment plant is the use of computational intelligence techniques, specifically data clustering techniques [6]. According to [13], the use of artificial neural network architecture for clustering Self Organizing Maps (SOM) showed good results in the detection of contamination problems in wellsprings and river, allowing the development of corrective measures to avoid health problems.

In [5] the possibility of internal problems detection in the water treatment plant caused by equipment faults or wellspring conditions was shown, also using the SOM network, allowing to anticipate measures so that the quality of the treated water is kept within the limits settled down.

Several segments of the sanitation area have applied SOM network as a technique for solving problems, presenting favorable results, such as [7] in validation and reconstruction of information in databases with the presence of incoherent values and [3] in the identification of water quality and process behavior.

Other techniques are used in water treament plants to study process behaviors by clustering information. In [14] k-means was used as part of their study of a coagulant dosage prediction model, while [1] used the expectation-maximization technique in order to determine the appropriate amount of clusters and their characteristics of information from lakes in Alberta city, in Canada.

In [16] an evolutionary algorithm and expectation-maximization were applied to improving detection of pipe bursts and other events in water distribution systems. The results obtained have shown that the use of these strategies could improve the performance in terms of event detection performance.

In [19] the authors proposed a polynomial function fitted to the historic flow measurements based on a weighted least-squares method in conjunction with an expectation-maximization algorithm for automatic burst detection in the U.K. water distribution networks. This approach can automatically select useful data from the historic flow measurements, which may contain normal and abnormal operating conditions in the distribution network, e.g., water burst.

In [8] the EM technique was used in conjunction with a Bayesian model with the purpose of predicting leakage in water distribution networks, showing the possibility of minimizing the risks of water loss.

Thus, the objective of this work was to propose a clustering approach and the relevance of each physical-chemical parameter determination, using the expectation-maximization technique and the ReliefF algorithm respectivily, in the coagulation process data from a water treatment plant located in the metropolitan region of São Paulo, Brazil.

2 Theoretical Background

2.1 Expectation-Maximization (EM) Algorithm

The expectation-maximization algorithm can be considered a data mining technique. It aims to find the maximum likelihood parameters in a base of information considered incomplete, allowing the application such as pattern recognition with neural networks [10].

The basic algorithm idea consists in the representation of a problem by two information called x and y. The term Y can be considered as an observed data random vector y with probability density function, denoted by $g\left( y|\varphi \right) $, where $\varphi $ = [$\varphi _1$, ..., $\varphi _d$]$^T$ represents the unknown parameter vector. The term X corresponds to the random vector of the total data vector x, denoted as a probability density function $g_{c}\left( x|\varphi \right) $. Thus, $\varphi $ can be the maximization parameter in Eq. 1 [9].

$$\begin{aligned} logL_{c}(\varphi )=log\,g_{c}\left( x|\varphi \right) \end{aligned}$$

(1)

According to [9], there are two steps for each algorithm iteration, as follows:

Expectation ou E-Step: aimed to finding the clusters’ probabilities:
$$\begin{aligned} Q\left( \varphi |\varphi ^{\left( i\right) }\right) =E_{\varphi ^{\left( i\right) }}\left[ logL_{c}\left( \varphi \right) |y\right] \end{aligned}$$
(2)
Maximization ou M-Step: corresponds to the maximization $M(\varphi ^{(i)})$ through Eqs. 3 and 4:
$$\begin{aligned} Q\left( \varphi ^{\left( i+1\right) }|\varphi ^{\left( i\right) }\right) \ge Q\left( \varphi |\varphi ^{\left( i\right) }\right) \end{aligned}$$
(3)

$$\begin{aligned} M\left( \varphi ^{\left( i\right) }\right) ={\mathop {\hbox {arg max}}\limits _\varphi Q}\left( \varphi |\varphi ^{\left( i\right) }\right) \end{aligned}$$
(4)

where: i is the number of iterations.

2.2 ReliefF Algorithm

The ReliefF algorithm has the purpose of qualifying attributes according to proximity and distinction. The basic operation idea is to obtain the updated vector with the attributes’ quality [15].

The algorithm starts resetting the quality vector W[x] and randomly selects an instance R and searches for the nearest neighbor of the same class, called H and other classes, known as M. Then, it is verified which class instance has the most relevance and, finally, updates the quality vector, according to Eq. 5:

$$\begin{aligned} \left[ x\right] =W\left[ x\right] -{\displaystyle {\textstyle \sum _{=1}^{k}\frac{dist\left( x,R,H\right) }{m.k}}+\sum _{C\ne class(R)}\frac{\frac{P\left( \right) }{1-P\left( class\left( R\right) \right) }\sum _{j=1}^{k}dist\left( x,R,M\right) }{m.k}} \end{aligned}$$

(5)

where: C is the class identification, P(C) class probability, $1-P(Class (R))$ the sum of probabilities of the different classes and m the number of iterations.

The dist function calculates the difference between the atribute’s values and the two instances (M and H), with a range from 0 to 1. After the finalization, we have the vector of weights W[x] from all the attributes and the highest weight is considered as the most relevant atribute.

3 Materials e Methods

3.1 Characteristics of the Studied Process

The Alto Cotia water treatment plant (WTP) has a nominal production capacity of 1.25 $m^3/s$ and is located in the metropolitan region of São Paulo. This WTP has two coagulant dosage application points due to an expansion to increase treatment capacity, known as dosing system by gravity and dosing system by pumping. Data were collected from SABESP’s laboratory management system on the period from 2010 to 2012, totalizing 6686 records. Table 1 shows the maximum and minimum values of each parameter.

Table 1. Maximum and minimum values of each parameter

Full size table

The physical-chemical parameters that refer to the raw water quality are gross water turbidity and gross water color. The effectiveness of the coagulation process is measured by physicochemical parameters: clarified water turbidity, residual aluminum of clarified water, filtered water turbidity and residual aluminum of treated water. The dosage references of coagulant are: coagulant dosage by gravity system and coagulant dosage by pumping system. These variables are considered as outputs, as they are values used in the water treatment process.

3.2 Computational Experiments

The computational experiments were carried out with WEKA software, version 3.8.10 x64 [18]. All the algorithms used in the work are available in this tool.

The experiments were divided into the following steps: (i) Pre-processing of data: consists on the elimination of inconsistent and null values of the database collected at Sabesp’s laboratory management system; (ii) Use of the ReliefF algorithm in the database with preprocessing: the attribute selection algorithm was applied to verify the importance of collected data without the clustering techniques application, having as reference, the coagulant dosage variables the coagulant dosage by gravity system and the coagulant dosage by pumping system; (iii) Data clustering: use the EM algorithm for clustering, with an automatic cluster quantification function enabled, in the WEKA tool. In this stage the data mining technique was applied; (iv) Clusters’ selection: the selection of clusters was based on information that represented optimized values on coagulation process of water treatment plant; (v) Use of the ReliefF algorithm in the selected database: applied the attribute selection algorithm again in the information, which is contained in the selected clusters, also, having as systemic coagulant dosage variables by gravity and coagulant dosage by pumping system, applied feature selection algorithm again to the information; (vi) Attribute selections’ comparison: the comparison of the importance of the attributes between the clustered database and the database with preprocessing was performed.

4 Results and Discussions

The pre-processed database, with 6686 registers, were processed by WEKA clustering algorithm, in 6965.12 seconds. The WEKA generated 20 clusters for the submitted database. The quantities of records in each cluster and their percentage are presented in Table 2.

Table 2. Number of records per cluster

Full size table

It can be seen in Table 2 that some clusters contain different numbers in comparison with the mean, demonstrating the existence of several scenarios into varied clusters, which may be related to the raw water quality or the efficacy of the coagulation process by measuring the turbidity parameters of clarified and filtered water.

Table 3 shows the means and standard deviations of each attribute grouped in clusters obtained by the EM algorithm. The data presented in Table 3 shows that the clustering algorithm considered the raw water parameters and dosages as relevant to define the clusters. Thus, it can be observed that clusters 0, 2, 6, 12, 13, 14 and 16, highlighted in red, present high values of dosages in both systems (by gravity and pumping) for very close values on turbidity and color of raw water, which represents inconsistent information.

Therefore, the data selected as optimized values of the coagulant dosage process are presented in the clusters 1, 3, 4, 5, 7, 8, 9, 10, 11, 15, 17, 18 and 19. From the total of 6686 data, 5118 are used, representing a reduction of 23.45 % in the pre-processed database.

The relevance of attributes or parameters was evaluated by the ReliefF algorithm in the pre-processed and mined data, to verify if there was a change in the significance of the attributes after the selection of clusters with optimized values of the process. Table 4 shows the most relevant parameters in relation to the dosing variables by gravity and pumping systems.

Table 3. Mean values and standard deviations of each attribute and each cluster

Full size table

Table 4. Relevance of physico-chemical parameters

Full size table

The data importance in the dosing process of coagulant in the lack of data mining shows, according to Table 4, that the quality of the raw water (highlighted in green) is fundamental for the control of the dosage. In addtion, the dependence between the dosages of the systems in both scenarios, pre-processing and clustering stages, is noticed.

In applying the data mining methods, the parameters that monitor the efficacy of the coagulation process are presented in the sequence of the treatment process, i.e. first the clarified water and subsequently the filtered and treated water (highlighted in bold). This shows that the process must be monitored and controlled, according to the treatment steps, following the parameters of the clarified, filtered and treated water. Thus, contrary to what is reported in the pre-processed data, the control of the clarified water will prevent problems being transferred to the filtration process and consequently damages the quality of the treated water.

5 Conclusion

The proposed approach showed that it was possible to select the optimized values of the process by means of clustering and to identify that the optimization done by the water treatment technicians is not uniform, evidencing empirical evaluations.

Moreover, the obtained results point that is possible to developing models using computational intelligence resources to obtain more adequate training database, avoiding that undesirable data applied to learning step of intelligent algorithms, e.g. artificial neural networks, can lead to wrong output of the model.

For further investigatons, information from clusters with stored values can be submitted to metaheuristics classifiers to improve the data selection, which may be more representative in comparison to the choices made by expert evaluations of the water treatment process.

References

Akbar, T.A., Hassan, Q.K., Achari, G.: A methodology for clustering lakes in alberta on the basis of water quality parameters. Clean - Soil Air Water 39(10), 916–924 (2011)
Article Google Scholar
Baxter, C.W., Stanley, S.J., Zhang, Q.: Development of a full-scale artificial neural network model for the removal of natural organic matter by enhanced coagulation. Aqua 48(4), 129–136 (1999)
Google Scholar
Bieroza, M., Baker, A., Bridgeman, J.: New data mining and calibration approaches to the assessment of water treatment efficiency. Adv. Eng. Softw. 44(1), 126–135 (2012)
Article Google Scholar
Heddam, S., Bermad, A., Dechemi, N.: ANFIS-based modelling for coagulant dosage in drinking water treatment plant: a case study. Environ. Monit. Assess. 184(4), 1953–1971 (2012)
Article Google Scholar
Juntunen, P., Liukkonen, M., Lehtola, M., Hiltunen, Y.: Cluster analysis by self-organizing maps: an application to the modelling of water quality in a treatment process. Appl. Soft Comput. 13(7), 3191–3196 (2013)
Article Google Scholar
Kalteh, A., Hjorth, P., Berndtsson, R.: Review of the self-organizing map (SOM) approach in water resources: analysis, modelling and application. Environ. Model. Softw. 23(7), 835–845 (2008)
Article Google Scholar
Lamrini, B., Lakhal, E.K., Le Lann, M.V., Wehenkel, L.: Data validation and missing data reconstruction using self-organizing map for water treatment. Neural Comput. Appl. 20(4), 575–588 (2011)
Article Google Scholar
Leu, S.S., Bui, Q.N.: Leak prediction model for water distribution networks created using a bayesian network learning approach. Water Resour. Manag. 30(8), 2719–2733 (2016)
Article Google Scholar
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Hoboken (2008)
Book MATH Google Scholar
North, B., Blake, A.: Using expectation-maximisation to learn dynamical models from visual data. Image Vis. Comput. 17(8), 611–616 (1999)
Article Google Scholar
Ogwueleka, T., Ogwueleka, F.: Optimization of drinking water treatment processes using artificial neural network. Niger. J. Technol. 28(1), 16–25 (2009)
Google Scholar
Olanweraju, R.F., Muyibi, S.A., Salawudeen, T.O., Aibinu, A.M.: An intelligent modeling of coagulant dosing system for water treatment plants based on artificial neural network. Aust. J. Basic Appl. Sci. 6(1), 93–99 (2012)
Google Scholar
Olawoyin, R., Nieto, A., Grayson, R.L., Hardisty, F., Oyewole, S.: Application of artificial neural network (ANN) self-organizing map (SOM) for the categorization of water, soil and sediment quality in petrochemical regions. Expert Syst. Appl. 40(9), 3634–3648 (2013)
Article Google Scholar
Park, S., Bae, H., Kim, C.: Decision model for coagulant dosage using genetic programming and multivariate statistical analysis for coagulation/flocculation at water treatment process. Korean J. Chem. Eng. 25(6), 1372–1376 (2008)
Article Google Scholar
Robnik-Šikonja, M., Kononenko, I.: Theoretical and empirical analysis of RelieF and RReliefF. Mach. Learn. 53(1), 23–69 (2003)
Article MATH Google Scholar
Romano, M., Kapelan, Z., Savić, D.A.: Evolutionary algorithm and expectation maximization strategies for improved detection of pipe bursts and other events in water distribution systems. J. Water Resour. Plann. Manag. 140(5), 572–584 (2007)
Article Google Scholar
Siti Rozaimah, S.A., Pasilatun Adawiyah, I., Mohd. Marzuki, M., Rakmi, A.R.: Pattern recognition of fractal profiles in coagulation-flocculation process of wastewater via neural network. J. Inst. Eng. 68(4), 17–19 (2007)
Google Scholar
WEKA: Homepage. http://www.cs.waikato.ac.nz/ml/weka. Accessed 09 Oct 2016
Ye, G., Fenner, R.A.: Weighted least squares with expectation-maximization algorithm for burst detection in U.K. water distribution systems. J. Water Resour. Plann. Manag. 140(4), 417–424 (2014)
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank São Paulo Research Foundation (FAPESP, grant#2016/02641-1), Basic Sanitation Company of the State of São Paulo - (SABESP) and Nove de Julho University (UNINOVE) for their supports.

Author information

Authors and Affiliations

Nove de Julho University - UNINOVE, São Paulo, SP, 01504-000, Brazil
Fábio Cosme Rodrigues dos Santos, André Felipe Henriques Librantz & Renato José Sassi
Basic Sanitation Company of the State of São Paulo - SABESP, São Paulo, SP, 05429-010, Brazil
Fábio Cosme Rodrigues dos Santos

Authors

Fábio Cosme Rodrigues dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
André Felipe Henriques Librantz
View author publications
You can also search for this author in PubMed Google Scholar
Renato José Sassi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to André Felipe Henriques Librantz .

Editor information

Editors and Affiliations

Universidad Federico Santa María, Santiago, Chile
Marcelo Mendoza
Carlos III University of Madrid, Madrid, Spain
Sergio Velastín

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

dos Santos, F.C.R., Librantz, A.F.H., Sassi, R.J. (2018). An Approach to Clustering Using the Expectation-Maximization and Selection of Attributes ReliefF Applied to Water Treatment Plants process. In: Mendoza, M., Velastín, S. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2017. Lecture Notes in Computer Science(), vol 10657. Springer, Cham. https://doi.org/10.1007/978-3-319-75193-1_67

Download citation

DOI: https://doi.org/10.1007/978-3-319-75193-1_67
Published: 04 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75192-4
Online ISBN: 978-3-319-75193-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

An Approach to Clustering Using the Expectation-Maximization and Selection of Attributes ReliefF Applied to Water Treatment Plants process

Abstract

Similar content being viewed by others

Classification, Association and Clustering of Water Body Data: Application to Water Quality Monitoring

A Comparison of Categorical Attribute Data Clustering Methods

Using combined clustering algorithms and association rules for better management of the amount of water delivered to the irrigation network of Abyek Plain, Iran

Keywords

1 Introduction

2 Theoretical Background

2.1 Expectation-Maximization (EM) Algorithm

2.2 ReliefF Algorithm

3 Materials e Methods

3.1 Characteristics of the Studied Process

3.2 Computational Experiments

4 Results and Discussions

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

An Approach to Clustering Using the Expectation-Maximization and Selection of Attributes ReliefF Applied to Water Treatment Plants process

Abstract

Similar content being viewed by others

Classification, Association and Clustering of Water Body Data: Application to Water Quality Monitoring

A Comparison of Categorical Attribute Data Clustering Methods

Using combined clustering algorithms and association rules for better management of the amount of water delivered to the irrigation network of Abyek Plain, Iran

Keywords

1 Introduction

2 Theoretical Background

2.1 Expectation-Maximization (EM) Algorithm

2.2 ReliefF Algorithm

3 Materials e Methods

3.1 Characteristics of the Studied Process

3.2 Computational Experiments

4 Results and Discussions

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation