Keywords

1 Introduction

Fresh water available in rivers or dams contains micro-organisms and impurities that may be harmful to human health. When captured, it is adduced to the treatment plant with the objective of removing all undesirable substances, in order to meet the potability requirements defined by legislation [5, 12].

According to [11], the water treatment process has several physical-chemical parameters that must be monitored and controlled in order to guarantee the quality of the final product. One of the more complex subsystems is coagulation, which consists in destabilizing the dirt particles, so they are retained in the later processes of the water treatment plant [4].

The coagulation subsystem has as main parameters the raw water quality and intermediate subsystems’ parameters, in this order, making possible the measurement of the efficacy of the coagulation process. The reference dosage of coagulant in water is determined by bench-testing the treatment process or by measuring the electrical charge every time the raw water quality scenario has changes [2, 17].

Although the scenarios represent droughts and rainy seasons, it is not easy to classify them, especially when there is a possibility of adjusting the chemical dosage reference values as a result of the transition between the periods that most affect quality of raw water [2].

One way to find possible scenarios in water treatment plant is the use of computational intelligence techniques, specifically data clustering techniques [6]. According to [13], the use of artificial neural network architecture for clustering Self Organizing Maps (SOM) showed good results in the detection of contamination problems in wellsprings and river, allowing the development of corrective measures to avoid health problems.

In [5] the possibility of internal problems detection in the water treatment plant caused by equipment faults or wellspring conditions was shown, also using the SOM network, allowing to anticipate measures so that the quality of the treated water is kept within the limits settled down.

Several segments of the sanitation area have applied SOM network as a technique for solving problems, presenting favorable results, such as [7] in validation and reconstruction of information in databases with the presence of incoherent values and [3] in the identification of water quality and process behavior.

Other techniques are used in water treament plants to study process behaviors by clustering information. In [14] k-means was used as part of their study of a coagulant dosage prediction model, while [1] used the expectation-maximization technique in order to determine the appropriate amount of clusters and their characteristics of information from lakes in Alberta city, in Canada.

In [16] an evolutionary algorithm and expectation-maximization were applied to improving detection of pipe bursts and other events in water distribution systems. The results obtained have shown that the use of these strategies could improve the performance in terms of event detection performance.

In [19] the authors proposed a polynomial function fitted to the historic flow measurements based on a weighted least-squares method in conjunction with an expectation-maximization algorithm for automatic burst detection in the U.K. water distribution networks. This approach can automatically select useful data from the historic flow measurements, which may contain normal and abnormal operating conditions in the distribution network, e.g., water burst.

In [8] the EM technique was used in conjunction with a Bayesian model with the purpose of predicting leakage in water distribution networks, showing the possibility of minimizing the risks of water loss.

Thus, the objective of this work was to propose a clustering approach and the relevance of each physical-chemical parameter determination, using the expectation-maximization technique and the ReliefF algorithm respectivily, in the coagulation process data from a water treatment plant located in the metropolitan region of São Paulo, Brazil.

2 Theoretical Background

2.1 Expectation-Maximization (EM) Algorithm

The expectation-maximization algorithm can be considered a data mining technique. It aims to find the maximum likelihood parameters in a base of information considered incomplete, allowing the application such as pattern recognition with neural networks [10].

The basic algorithm idea consists in the representation of a problem by two information called x and y. The term Y can be considered as an observed data random vector y with probability density function, denoted by \(g\left( y|\varphi \right) \), where \(\varphi \) = [\(\varphi _1\), ..., \(\varphi _d\)]\(^T\) represents the unknown parameter vector. The term X corresponds to the random vector of the total data vector x, denoted as a probability density function \(g_{c}\left( x|\varphi \right) \). Thus, \(\varphi \) can be the maximization parameter in Eq. 1 [9].

$$\begin{aligned} logL_{c}(\varphi )=log\,g_{c}\left( x|\varphi \right) \end{aligned}$$
(1)

According to [9], there are two steps for each algorithm iteration, as follows:

  • Expectation ou E-Step: aimed to finding the clusters’ probabilities:

    $$\begin{aligned} Q\left( \varphi |\varphi ^{\left( i\right) }\right) =E_{\varphi ^{\left( i\right) }}\left[ logL_{c}\left( \varphi \right) |y\right] \end{aligned}$$
    (2)
  • Maximization ou M-Step: corresponds to the maximization \(M(\varphi ^{(i)})\) through Eqs. 3 and 4:

    $$\begin{aligned} Q\left( \varphi ^{\left( i+1\right) }|\varphi ^{\left( i\right) }\right) \ge Q\left( \varphi |\varphi ^{\left( i\right) }\right) \end{aligned}$$
    (3)
    $$\begin{aligned} M\left( \varphi ^{\left( i\right) }\right) ={\mathop {\hbox {arg max}}\limits _\varphi Q}\left( \varphi |\varphi ^{\left( i\right) }\right) \end{aligned}$$
    (4)

where: i is the number of iterations.

2.2 ReliefF Algorithm

The ReliefF algorithm has the purpose of qualifying attributes according to proximity and distinction. The basic operation idea is to obtain the updated vector with the attributes’ quality [15].

The algorithm starts resetting the quality vector W[x] and randomly selects an instance R and searches for the nearest neighbor of the same class, called H and other classes, known as M. Then, it is verified which class instance has the most relevance and, finally, updates the quality vector, according to Eq. 5:

$$\begin{aligned} \left[ x\right] =W\left[ x\right] -{\displaystyle {\textstyle \sum _{=1}^{k}\frac{dist\left( x,R,H\right) }{m.k}}+\sum _{C\ne class(R)}\frac{\frac{P\left( \right) }{1-P\left( class\left( R\right) \right) }\sum _{j=1}^{k}dist\left( x,R,M\right) }{m.k}} \end{aligned}$$
(5)

where: C is the class identification, P(C) class probability, \(1-P(Class (R))\) the sum of probabilities of the different classes and m the number of iterations.

The dist function calculates the difference between the atribute’s values and the two instances (M and H), with a range from 0 to 1. After the finalization, we have the vector of weights W[x] from all the attributes and the highest weight is considered as the most relevant atribute.

3 Materials e Methods

3.1 Characteristics of the Studied Process

The Alto Cotia water treatment plant (WTP) has a nominal production capacity of 1.25 \(m^3/s\) and is located in the metropolitan region of São Paulo. This WTP has two coagulant dosage application points due to an expansion to increase treatment capacity, known as dosing system by gravity and dosing system by pumping. Data were collected from SABESP’s laboratory management system on the period from 2010 to 2012, totalizing 6686 records. Table 1 shows the maximum and minimum values of each parameter.

Table 1. Maximum and minimum values of each parameter

The physical-chemical parameters that refer to the raw water quality are gross water turbidity and gross water color. The effectiveness of the coagulation process is measured by physicochemical parameters: clarified water turbidity, residual aluminum of clarified water, filtered water turbidity and residual aluminum of treated water. The dosage references of coagulant are: coagulant dosage by gravity system and coagulant dosage by pumping system. These variables are considered as outputs, as they are values used in the water treatment process.

3.2 Computational Experiments

The computational experiments were carried out with WEKA software, version 3.8.10 x64 [18]. All the algorithms used in the work are available in this tool.

The experiments were divided into the following steps: (i) Pre-processing of data: consists on the elimination of inconsistent and null values of the database collected at Sabesp’s laboratory management system; (ii) Use of the ReliefF algorithm in the database with preprocessing: the attribute selection algorithm was applied to verify the importance of collected data without the clustering techniques application, having as reference, the coagulant dosage variables the coagulant dosage by gravity system and the coagulant dosage by pumping system; (iii) Data clustering: use the EM algorithm for clustering, with an automatic cluster quantification function enabled, in the WEKA tool. In this stage the data mining technique was applied; (iv) Clusters’ selection: the selection of clusters was based on information that represented optimized values on coagulation process of water treatment plant; (v) Use of the ReliefF algorithm in the selected database: applied the attribute selection algorithm again in the information, which is contained in the selected clusters, also, having as systemic coagulant dosage variables by gravity and coagulant dosage by pumping system, applied feature selection algorithm again to the information; (vi) Attribute selections’ comparison: the comparison of the importance of the attributes between the clustered database and the database with preprocessing was performed.

4 Results and Discussions

The pre-processed database, with 6686 registers, were processed by WEKA clustering algorithm, in 6965.12 seconds. The WEKA generated 20 clusters for the submitted database. The quantities of records in each cluster and their percentage are presented in Table 2.

Table 2. Number of records per cluster

It can be seen in Table 2 that some clusters contain different numbers in comparison with the mean, demonstrating the existence of several scenarios into varied clusters, which may be related to the raw water quality or the efficacy of the coagulation process by measuring the turbidity parameters of clarified and filtered water.

Table 3 shows the means and standard deviations of each attribute grouped in clusters obtained by the EM algorithm. The data presented in Table 3 shows that the clustering algorithm considered the raw water parameters and dosages as relevant to define the clusters. Thus, it can be observed that clusters 0, 2, 6, 12, 13, 14 and 16, highlighted in red, present high values of dosages in both systems (by gravity and pumping) for very close values on turbidity and color of raw water, which represents inconsistent information.

Therefore, the data selected as optimized values of the coagulant dosage process are presented in the clusters 1, 3, 4, 5, 7, 8, 9, 10, 11, 15, 17, 18 and 19. From the total of 6686 data, 5118 are used, representing a reduction of 23.45 % in the pre-processed database.

The relevance of attributes or parameters was evaluated by the ReliefF algorithm in the pre-processed and mined data, to verify if there was a change in the significance of the attributes after the selection of clusters with optimized values of the process. Table 4 shows the most relevant parameters in relation to the dosing variables by gravity and pumping systems.

Table 3. Mean values and standard deviations of each attribute and each cluster
Table 4. Relevance of physico-chemical parameters

The data importance in the dosing process of coagulant in the lack of data mining shows, according to Table 4, that the quality of the raw water (highlighted in green) is fundamental for the control of the dosage. In addtion, the dependence between the dosages of the systems in both scenarios, pre-processing and clustering stages, is noticed.

In applying the data mining methods, the parameters that monitor the efficacy of the coagulation process are presented in the sequence of the treatment process, i.e. first the clarified water and subsequently the filtered and treated water (highlighted in bold). This shows that the process must be monitored and controlled, according to the treatment steps, following the parameters of the clarified, filtered and treated water. Thus, contrary to what is reported in the pre-processed data, the control of the clarified water will prevent problems being transferred to the filtration process and consequently damages the quality of the treated water.

5 Conclusion

The proposed approach showed that it was possible to select the optimized values of the process by means of clustering and to identify that the optimization done by the water treatment technicians is not uniform, evidencing empirical evaluations.

Moreover, the obtained results point that is possible to developing models using computational intelligence resources to obtain more adequate training database, avoiding that undesirable data applied to learning step of intelligent algorithms, e.g. artificial neural networks, can lead to wrong output of the model.

For further investigatons, information from clusters with stored values can be submitted to metaheuristics classifiers to improve the data selection, which may be more representative in comparison to the choices made by expert evaluations of the water treatment process.