Advanced turbidity prediction for operational water supply planning

doi:10.1016/j.dss.2019.02.009

Decision Support Systems

Volume 119, April 2019, Pages 72-84

https://doi.org/10.1016/j.dss.2019.02.009 Get rights and content

Highlights

•
We develop a system for predicting turbidity peaking events at a water company by using operational, meteorological and hydrogeological factors.
•
We explore correlations and variable significance and confirm, in most instances, that there is non-linearity in the data.
•
We conclude that machine learning techniques can be used to successfully predict turbidity peaking events with AUC values over 0.80 at five of six sites.

Abstract

Turbidity is an optical quality of water caused by suspended solids that give the appearance of ‘cloudiness'. While turbidity itself does not directly present a hazard to human health, it can be an indication of poor water quality and mask the presence of parasites such as Cryptosporidium. It is, therefore, a recommendation of the World Health Organisation (WHO) that turbidity should not exceed a level of 1 Nephelometric Turbidity Unit (NTU) before chlorination. For a drinking water supplier, turbidity peaks can be highly disruptive requiring the temporary shutdown of a water treatment works. Such events must be carefully managed to ensure continued supply; to recover the supply deficit, water stores must be depleted or alternative works utilised. Machine learning techniques have been shown to be effective for the modelling of complex environmental systems, often used to help shape environmental policy. We contribute to the literature by adopting such techniques for operational purposes, developing a decision support tool that predicts >1 NTU turbidity events up to seven days in advance allowing water supply managers to make proactive interventions. We apply a Generalised Linear Model (GLM) and a Random Forest (RF) model for the prediction of >1 NTU events. AUROC scores of over 0.80 at five of six sites suggest that machine learning techniques are suitable for predicting turbidity peaking events. Furthermore, we find that the RF model can provide a modest performance boost due to its stronger capacity to capture nonlinear interactions in the data.

Introduction

Turbidity can be defined as the “optical quality [of water] that causes light to be scattered and absorbed rather than transmitted in straight lines through a sample” [1, p.200]. It can also be understood to be “the cloudiness of water caused by suspended particles such as clay and silts, chemical precipitates such as manganese and iron, and organic particles such as plant debris and organisms” [2, p.3].

While turbidity itself does not present a hazard to human health, it can be an indication of poor water quality. Furthermore, high levels of turbidity present during the treatment of raw water can limit the effectiveness of filtration and chlorination processes designed to remove dangerous bacteria and parasites such as Cryptosporidium [1]. It is therefore recommended by the World Health Organisation (WHO) that turbidity should not exceed a level of 1 Nephelometric Turbidity Unit (NTU) before chlorination [3].

Turbidity (NTU) levels can change slowly over time due to changes in water catchments as part of an underlying trend, but it can also rapidly peak over shorter periods, sometimes appearing random. Peaks in turbidity are linked to environmental events such as heavy rainfall but can also be a result of operational activities like pumping. The inherent solution features at the site such as fissures within the aquifer can also lead to turbidity events [2].

Peaks in turbidity (NTU) present a significant challenge to the operation of a drinking water company. Turbidity is a naturally occurring a phenomena and somewhat inevitable, however, for a drinking water supplier there are many operational interventions which impact its ability to continue to supply potable water. Depending on the treatment works, there may be varying degrees of treatment activities used to reduce turbidity. Those most resilient to turbidity will likely include a system of filters and settling tanks that remove sediment before chlorination, but even so, these sites with more complex processes will have a limited capacity before treatment must be suspended for cleaning and maintenance. In response to short-term outages, a water supplier may rely on storage reservoirs, alternative treatment works or likely a combination of both. The challenge, however, is that turbidity peaks can occur rapidly and therefore these mitigating activities must be actionable immediately; storage reservoirs will need sufficient supplies, and alternative sources will need to be able to meet the new additional requirement caused by outages. Failure to do so will result in the company unable to meet its demand, or, water entering supply that is not fit for consumption; either instance would be damaging for a water supplier and its customers.

We propose a decision support system that provides drinking water suppliers 7-days notice of a turbidity event, allowing time for remedial actions to be prepared in advance of short term outages caused by turbidity peaking events.

Our first objective is to explore the cause of daily turbidity (NTU) peaking events by identifying candidate predictor variables which we test across each of the six sites. We use this to confirm relevant variables from the literature, but also further explore how operational features such as pumping activity impact on turbidity levels at a treatment works. We apply a static correlation analysis and a dynamic cross-correlation analysis which considers time lags of some variables.

Our second objective is to assess the effectiveness of models from the field of machine learning for the prediction of > 1 NTU events. We use a Generalised Linear Model (GLM) and a non-linear Random Forest (RF) model to predict turbidity across the six treatment works. We use a linear and non-linear model to assess the impact of any non-linearity that may exist in the data; furthermore, they represent different aspects of the trade-off between complexity and predictive capability [4]. We use the AUCROC score to asses the performance of the models, models with a score of greater than 0.70 are considered satisfactory. The causation analysis is complemented using the Variable Importance outputs from the GLM and the RF. We also review the cut-off probability points for event classification.

To address the research problem, we first review how machine learning has been applied to predict a range of other water quality parameters in the literature, and, identify causal factors in turbidity peaking. We then the illustrate the behaviour of turbidity peaking across the six sites and use the static and dynamic correlation analysis to determine candidate variables for the models. We consider the results in three parts: 1) the model performance of the GLM and RF models are reviewed using the AUROC metric, 2) the Variable Importance outputs of the models are examined to understand the multivariate nature of turbidity prediction, complementing the earlier correlation analysis and, 3) we then use a cost-based approach to define the cut-off probability points for event classification for each of the sites. We conclude by reflecting on the general findings for turbidity causation compared to that of the literature and assess the viability of an operation turbidity prediction model as a decision support system.

Section snippets

Background

Techniques from the field of Machine Learning have been applied to solve a wide range of event prediction problems [[5], [6], [7], [8]]. In this section, we seek to understand how statistical and machine learning tools and techniques have been applied to understand and solve a variety of challenges surrounding water quality. Some of the research is focused upon causal analysis while other research attempts to predict and model systems to be tested under different conditions, this, in turn, can

Data description

Turbidity, the dependent variable, is obtained for each of the six sites from the telemetry system of the water company. Turbidity (NTU) level is recorded at least every 15 min using apparatus located at the water treatment works, for this paper, only the daily maximum NTU level is required as it always reveals whether turbidity has exceeded 1 NTU in each day. The record for most sites extends back to at least 1 November 2007 up to the point of extraction on to 15 September 2017. Therefore,

Modelling

For the prediction of >1 NTU turbidity events, we apply a Random Forest (RF) and a Generalised Linear Model (GLM).

Several parameters require tuning in an RF model; the number of variables available for each split, the number of trees to include, and a cost rule by which predictions are made in the training of each tree.

As recommended by Breiman [20], we use the square root of the total number of variables and this figure halved and doubled.

The number of trees used in the model is referred to as

Model performance

We present the AUROC performance of the RF and GLM models at each of the six sites in Table 4. We use a randomly selected 25%/75% test/train split with a 10 k-fold cross-validation and 10 repeats. For the cross-validation results, we present the mean AUROC across the 100 samples and the standard deviation.

At five of six sites (Site-A, Site-B, Site-C, Site-D, Site-E), we obtained an AUROC score of over 0.80 in the holdout sample suggesting that these models have a ‘good’ discriminative ability.

Implementation of the decision support system

So far we have considered the performance of the models across the six sites in terms of AUC performance. We now consider at which probability, from 0.00 to 1.00, that the decision support system positively classifies an event and the water company takes mitigating steps. Mitigating actions might include the decision ensure that the reservoirs are full before an event, or, committing personnel to test the equipment at an alternative site to while pumping at the site in question is temporarily

Causation

The first aim of this paper was to identify the potential variables causing daily turbidity (NTU) peaking events at groundwater sources for a water company operating in the South Coast of England. Several approaches have been used to understand the cause of turbidity (NTU) at six water sources on the South Coast of England; a static correlation analysis, a dynamic correlation analysis and an assessment of variable importance. We sought to confirm the findings of the hydrological literature

Acknowledgments

We would like to acknowledge the anonymous company who provided the data.

This work was supported by the Economic and Social Research Council [grant number ES/P000673/1]; and The Alan Turing Institute under the EPSRC [grant number EP/N510129/1].

Matthew Stevenson is a PhD student in the Southampton Business School at the University of Southampton. He completed an MSc in Business Analytics & Management Science at the University of Southampton in 2017. His main research interests are predictive analytics, deep learning and natural language processing.

References (27)

T. Verbraken et al.
Development and application of consumer credit scoring models using profit-based classification measures
European Journal of Operational Research
(2014)
M. Camacho-Collados et al.
A decision support system for predictive police patrolling
Decision Support Systems
(2015)
ž. Deljac et al.
Early detection of network element outages based on customer trouble calls
Decision Support Systems
(2015)
N. Carneiro et al.
A data mining based system for credit-card fraud detection in e-tail
Decision Support Systems
(2017)
V. Van Vlasselaer et al.
A novel approach for automated credit card transaction fraud detection using network-based extensions
Decision Support Systems
(2015)
S.T.Y. Tong et al.
Modeling The relationship between land use and surface water quality
Journal of Environmental Management
(2002)
N. Massei et al.
Investigating transport properties and turbidity dynamics of a karst aquifer using correlation, spectral, and wavelet analyses
Journal of Hydrology
(2006)
A.C. Twort et al.
Water Supply
(1994)
World Health Organization
Water Quality and Health-Review of Turbidity: Information for Regulators and Water Suppliers
(2017)
DrinkingWater Inspectorate
Guidance on the Implementation of the Water Supply (Water Quality) Regulations 2000 (As Amended) in England
(March 2012)

M.W. LeChevallier et al.

Occurrence of Giardia and Cryptosporidium spp. in surface water supplies

Applied and Environmental Microbiology

(1991)

E.K. Emily et al.

The importance of lake-specific characteristics for water quality across the continental United States

Ecological Applications

(2015)

N. Muttil et al.

Neural network and genetic programming for modelling coastal algal blooms

International Journal of Environment and Pollution

(2006)

Cited by (30)

Analysis of self-organizing maps and explainable artificial intelligence to identify hydrochemical factors that drive drinking water quality in Haor region
2023, Science of the Total Environment
Water contamination undermines human survival and economic growth. Water resource protection and management require knowledge of water hydrochemistry and drinking water quality characteristics, mechanisms, and factors. Self-organizing maps (SOM) have been developed using quantization and topographic error approaches to cluster hydrochemistry datasets. The Piper diagram, saturation index (SI), and cation exchange method were used to determine the driving mechanism of hydrochemistry in both surface and groundwater, while the Gibbs diagram was used for surface water. In addition, redundancy analysis (RDA) and a generalized linear model (GLM) were used to determine the key drinking water quality parameters in the study area. Additionally, the study aimed to utilize Explainable Artificial Intelligence (XAI) techniques to gain insights into the relative importance and impact of different parameters on the entropy water quality index (EWQI). The SOM results showed that thirty neurons generated the hydrochemical properties of water and were organized into four clusters. The Piper diagram showed that the primary hydrochemical facies were HCO₃⁻-Ca²⁺ (cluster 4), Cl---Na⁺ (all clusters), and mixed (clusters 1 and 4). Results from SI and cation exchange show that demineralization and ion exchange are the driving mechanisms of water hydrochemistry. About 45 % of the studied samples are classified as “medium quality”,” that could be suitable as drinking water with further refinement. Cl⁻ may pose increased non-carcinogenic risk to adults, with children at double risk. Cluster 4 water is low-risk, supporting EWQI findings. The RDA and GLM observations agree in that Ca²⁺, Mg²⁺, Na⁺, Cl⁻ and HCO₃⁻ all have a positive and significant effect on EWQI, with the exception of K⁺. TDS, EC, Na+, and Ca²⁺ have been identified as influencing factors based on bagging-based XAI analysis at global and local levels. The analysis also addressed the importance of SO₄, HCO₃, Cl, Mg²⁺, K+, and pH at specific locations.
A soft-sensor for sustainable operation of coagulation and flocculation units
2022, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Abba et al. (2019) evaluated the performance of some multi-parametric ANN methods for water turbidity. Likewise, Stevenson and Bravo (2019) evaluated the performance of two machine learning systems, including Generalized Linear Model (GLM) and RF, for forecasting the Nephelometric Turbidity Unit (NTU) through water supply planning. Zounemat-Kermani et al. (2020) utilized the multilayer perceptron-based ANN (MLPANN), Classification And Regression Tree (CART), Group Method of Data Handling (GMDH), and the Response Surface Methodology (RSM) to estimate the water turbidity.
Nowadays, Machine Learning (ML) techniques have become one of the most widely used engineering tools due to their numerous advantages, including their continuous improvement. This study proposes a smart soft sensor using various ML algorithms to control and predict the Coagulation and Flocculation Process (CFP). Optimizing and predicting the behaviour of a CFP is difficult due to its non-linear and complex behaviour. Therefore, ML computations is a proper method to overcome this challenge. However, one of the challenges of ML studies is the lack of sufficient data which we overcome by using an 8-year database of experiments. For prediction, this study compares different ML methods, including Random Tree, Random Forest (RF), Artificial Neural Networks (ANN), Quinlan’s M5 algorithm with regression function (M5P), Linear Regression (LR), Simple LR, Gaussian method, Decision Stump method, Smola and Scholkopf’s Sequential Minimal Optimization algorithm with LR (SMOreg), and the Adaptive Neuro-Fuzzy Inference System (ANFIS). Also, for optimization of the studied system, Central Composite Design designed the experimental data with the Response Surface Methodology (CCD-RSM). The most significant factors in turbidity removal are related to FeCl3 dosage, and slow mixing speed with < 0.0001 and 0.005 P-values. The present research findings show that the maximum removal efficiency of 92% is predicted using CCD-RSM under the optimal condition. In addition, ANFIS and RF models with R² of 0.96 and 0.92, have shown the highest accuracy levels for removing water turbidity. Finally, a Petri-Net model establishes a Conceptual model to intelligently conduct managerial insights for water treatments.
Risk assessment of Cryptosporidium intake in drinking water treatment plant by a combination of predictive models and event-tree and fault-tree techniques
2022, Science of the Total Environment
Citation Excerpt :
Despite these good results, in 0.20 % of the cases water was rejected due to being over 100 NTU. As Stevenson and Bravo (2019) have indicated, turbidity levels can change slowly over time due to changes in water catchments as part of an underlying trend, but can also rapidly peak over shorter periods, even though they appear to be random. Turbidity peaks are linked to environmental events such as heavy rainfall but can also be a result of operations like pumping.
Risk-informed decision making permits a more effective water safety management. In this framework, this article introduces the rationale and proposes a new approach to carry out a quantitative risk assessment along the water chain, from river source to tap water, by integrating predictive modelling combined with event-tree and fault-tree techniques. The model developed by this approach could not only account for normal but also for abnormal process conditions in the water treatment plant, as well as assess the real impact of the applied safety controls, such as turbidity control. A sensitivity study was conducted to determine the effect of considering a typical drinking water treatment plant (DWTP), i.e. coagulation, sedimentation and filtration with two turbidity controls (on intake and after filtration) on the risk of infection due to exposure to Cryptosporidium in tap water. The results showed that, with the current effectiveness of turbidity reduction in the DWTP, the first control did not minimise the annual risk of Cryptosporidium infection (3.6E-04) and only limiting turbidity after filtration to below 0.01NTU provided a clear reduction in risk (7.7E-05) at the cost of rejecting 60 % of the water after the control. The lowest risk was found when turbidity reduction was set at 4 logs (8.48E-06), although this means that the effectiveness of turbidity reduction should be greatly improved. It was therefore concluded that supplementing the current treatment with alternative barriers such as UV or ozone disinfection and/or implementing direct control of Cryptosporidium concentration should be considered.
Data-driven models for predicting microbial water quality in the drinking water source using E. coli monitoring and hydrometeorological data
2022, Science of the Total Environment
Citation Excerpt :
For example, artificial neural networks have been successfully applied to predict microbial water quality in terms of compliance with recreational water quality regulations (Avila et al., 2018; Choi and Bae, 2018; Laureano-Rosario et al., 2019; Vijayashanthar et al., 2018), often alongside various regression methods (Mas and Ahlfeld, 2007; Motamarri and Boccelli, 2012; Thoe et al., 2012, 2014, 2015), as well as classification trees (Avila et al., 2018; Stidson et al., 2012). In the context of drinking water supply, there are also some recent attempts to predict the concentrations of microorganisms using, e.g., zero-inflated regression models, random forest regression model, adaptive neuro-fuzzy inference system, and Gaussian process for machine learning (Mohammed et al., 2017a, 2017b, 2017c, 2018), as well as of other pollutants (Asheri Arnon et al., 2019; Samanipour et al., 2019; Speight et al., 2019; Stevenson and Bravo, 2019). A recent review (Francy et al., 2020) provides several examples when data-driven methods are used in operational nowcasting systems for public notification and water management.
Rapid changes in microbial water quality in surface waters pose challenges for production of safe drinking water. If not treated to an acceptable level, microbial pathogens present in the drinking water can result in severe consequences for public health. The aim of this paper was to evaluate the suitability of data-driven models of different complexity for predicting the concentrations of E. coli in the river Göta älv at the water intake of the drinking water treatment plant in Gothenburg, Sweden. The objectives were to (i) assess how the complexity of the model affects the model performance; and (ii) identify relevant factors and assess their effect as predictors of E. coli levels. To forecast E. coli levels one day ahead, the data on laboratory measurements of E. coli and total coliforms, Colifast measurements of E. coli, water temperature, turbidity, precipitation, and water flow were used. The baseline approaches included Exponential Smoothing and ARIMA (Autoregressive Integrated Moving Average), which are commonly used univariate methods, and a naive baseline that used the previous observed value as its next prediction. Also, models common in the machine learning domain were included: LASSO (Least Absolute Shrinkage and Selection Operator) Regression and Random Forest, and a tool for optimising machine learning pipelines – TPOT (Tree-based Pipeline Optimization Tool). Also, a multivariate autoregressive model VAR (Vector Autoregression) was included. The models that included multiple predictors performed better than univariate models. Random Forest and TPOT resulted in higher performance but showed a tendency of overfitting. Water temperature, microbial concentrations upstream and at the water intake, and precipitation upstream were shown to be important predictors. Data-driven modelling enables water producers to interpret the measurements in the context of what concentrations can be expected based on the recent historic data, and thus identify unexplained deviations warranting further investigation of their origin.
Monitoring the vertical distribution of HABs using hyperspectral imagery and deep learning models
2021, Science of the Total Environment
Remote sensing techniques have been applied to monitor the spatiotemporal variation of harmful algal blooms (HABs) in many inland waters. However, these studies have been limited to monitor the vertical distribution of HABs due to the optical complexity of inland water. Therefore, this study applied a deep neural network model to monitor the vertical distribution of Chlorophyll-a (Chl-a), phycocyanin (PC), and turbidity (Turb) using drone-borne hyperspectral imagery, in-situ measurement, and meteoroidal data. The pigment concentrations were measured between depths of 0 m and 5.0 m with 0.05 m intervals. Here, four state-of-the-art data-driven model structures (ResNet-18, ResNet-101, GoogLeNet, and Inception v3) were adopted for estimating the vertical distributions of the harmful algal pigments. Among the four models, the ResNet-18 model showed the best performance, with an R² value of 0.70. In addition, Gradient-weighted Class Activation Mapping (Grad-CAM) substantially provided informative reflectance band ranges near 490 nm and 620 nm in the hyperspectral image for the vertical estimation of pigments. Therefore, this study demonstrated that the explainable deep learning model with drone-borne hyperspectral images has the potential to estimate Chl-a, PC, and Turb vertical distributions and to show influential features that contribute to describing the vertical profile phenomena.
A new approach to monitor water quality in the Menor sea (Spain) using satellite data and machine learning methods
2021, Environmental Pollution
The Menor sea is a coastal lagoon declared by the European Union as a sensitive area to eutrophication due to human activities. To control the deterioration of its water quality, it is necessary to monitor some parameters such as chlorophyll-a (chl-a), which indicates phytoplankton biomass in the water. In the study area, current efforts focus on in-situ measurements to estimate chl-a by means of a few permanent stations and seasonal oceanographic campaigns, however they are expensive and time consuming. In this work, we proposed a machine learning approach based on Sentinel-2 data to estimate chl-a content on the upper part of the water column. Random forest (rf), support vector machine (svmRadial), Artificial Neural Network (ANN) and Deep Neural Network (DNN) algorithms were utilized under three feature selection scenarios, and several spectral indices were used in combination with Sentinel 2 bands. Rf, svmRadial and DNN performed better when all the available predictors were included in the models (RMSE = 0.82, 0.82 and 1.76 mg/m³ respectively), whereas ANN achieved better results under scenario c (principal components). Our results demonstrate the possibility to estimate chl-a concentration in a cost-effective manner and thereby provide near-real time information to monitor the water quality of the Menor sea, what can be of great interest for local authorities, tourism and fishing industry.

View all citing articles on Scopus

Dr Cristián Bravo is Associate Professor of Business Analytics. His research focuses on the development and application of data science methodologies in the context of credit risk analytics, covering areas such as deep learning, text analytics, image processing, and social network analysis. Dr Bravo has an extensive publication list in journals and international conferences, covering the multiple topics in data science and analytics. Dr Bravo is also editorial board member of the journal Applied Soft Computing, the official journal of the World Federation on Soft Computing, and of the Journal of Business Analytics published by the UK's Operational Research Society.

View full text

Advanced turbidity prediction for operational water supply planning

Highlights

Abstract

Introduction

Section snippets

Background

Data description

Modelling

Model performance

Implementation of the decision support system

Causation

Acknowledgments

European Journal of Operational Research

Decision Support Systems

Decision Support Systems

Decision Support Systems

Decision Support Systems

Journal of Environmental Management

Journal of Hydrology

Water Supply

Water Quality and Health-Review of Turbidity: Information for Regulators and Water Suppliers

Guidance on the Implementation of the Water Supply (Water Quality) Regulations 2000 (As Amended) in England

Occurrence of Giardia and Cryptosporidium spp. in surface water supplies

Applied and Environmental Microbiology

The importance of lake-specific characteristics for water quality across the continental United States

Ecological Applications

Neural network and genetic programming for modelling coastal algal blooms

International Journal of Environment and Pollution