Detecting outliers in a univariate time series dataset using unsupervised combined statistical methods: A case study on surface water temperature

https://doi.org/10.1016/j.ecoinf.2022.101672Get rights and content

Highlights

  • Proposed three outlier detection algorithms for water surface temperature measurement.

  • An increase in outlier detection might reduce the precision of identifying the actual outliers.

  • Combined statistical-based models outperforms individual models.

Abstract

The surface water temperature is a vital ecological and climate variable, and its monitoring is critical. An extensive sensor network measures the ocean, but outliers pervade the monitoring data due to the sudden change in the water surface level. No single algorithm can identify the outlier efficiently. Hence, this work aims to propose and evaluate the performance of three statistical-based outlier detection algorithms for the water surface temperature: 1) the Standard Z-Score method, 2) the Modified Z-Score coupled with decomposition, and 3) the Exponential Moving Average with the Coupled Modified Z-Score and decomposition. A threshold was set to flag the outlier values. The models' performance was evaluated using the F-score method. Results showed that an increase in outlier detection might reduce the precision of identifying the actual outlier. Based on the results, the Exponential Moving Average with the Modified Z-Score gave the highest F-score value (= 0.83) compared to the other two individual methods. Therefore, this proposed algorithm is recommended to detect outliers efficiently in large surface water temperature datasets.

Introduction

Oceans cover 80% of the Earth's surface and are considered the most significant thermal reservoirs. Ocean temperature has a notable impact on marine ecosystems' environmental status and biogeochemical processes, including dissolved oxygen, nutrients, and contaminants (Doney et al., 2012). For instance, the eutrophication process is dependent on water temperature and can produce significant amounts of nitrous oxide, a critical greenhouse gas (Marzadri et al., 2013). Water temperature is also an essential aspect of ocean health, and it can directly affect the amount of dissolved oxygen in water, which is vital for marine organisms' survival. Water temperature is critical for the survival of cold-water species because cold-water species cannot adapt to sudden environmental changes (Chang and Lawler, 2011).

The advancement of data logging technologies has opened new doors for the public, industry, and academia in the ocean and coastal monitoring. Temperature monitoring is critical for managing ocean water chemistry, species control, and climate change monitoring. Scientists can now collect a massive amount of data on ocean temperature. Outliers and missing values are common in large data sets (Cho et al., 2013). As a result, detecting anomalies or outliers is critical to improving data reliability (Rettig et al., 2015). Outliers can be caused by environmental changes, such as weather or human error (Chandola et al., 2009). Accurate anomaly detection is a significant challenge in data analysis and environmental applications, such as identifying abnormal climatic conditions caused by global warming (Çelik et al., 2011). The data can be used for further analysis after successful pre-processing. Thus, anomaly detection is vital in data analysis (Rettig et al., 2015). Using efficient pre-processing reduces the time needed to review data and remove outliers compared to manual review.

A consensus does not exist on how to differentiate between anomalies and outliers. The following reference is frequently used to demonstrate the equivalence of anomalies and outliers: “Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature.” – Aggarwal (Aggarwal, 2017). On the other hand, some definitions consider anomalies and noise part of the same category as outliers (Salgado et al., 2016). Outliers, according to some, are data corruption, while anomalies are abnormal data points in a pattern (Günnemann et al., 2014). Anomalies are distinguished by the presence of two essential characteristics (Grubbs, 1969; Hawkins, 1980):

  • 1.

    The anomalies' distribution differs significantly from the data's overall distribution.

  • 2.

    Most of the data points in the dataset are normal. The anomalies make up a relatively minor portion of the overall dataset.

We use the terms interchangeably because an outlier due to sensor malfunction can cause data points to veer from expected behavior due to sensor surfacing caused by the water level dropping. Cho et al. (2013) have suggested that the outliers might occur for various reasons, including the lack of equipment control, inadequate power supply, and malfunctioning sensors. According to Grubbs (1969), anomalies have two characteristics: 1) they are different from the norm, and 2) they are rare in the dataset compared to standard data points. The anomaly detection method identifies unexpected data points in datasets that differ from the specific data points. Building upon that definition, Chandola et al. (2009) classify anomalies into three categories: 1) point anomalies where a single data point can be considered abnormal concerning the rest of the data, 2) contextual anomalies in which a data point is odd in a specific context and defined using contextual and behavioral attributes, and 3) collective anomalies in which it is normal to have some data points in a context, while the same data can be considered as abnormal in another context.

Cho et al. (2013) found that an advanced pre-processing method is required in large datasets. They used two steps to remove outliers in their coastal ocean temperature data. They first divided the time series into approximate components, known as trends, and complex components, known as residuals using harmonic analysis. Additionally, they suggested other smoothing techniques, including regression, kernel smoothing technique, and moving average, to produce a data approximation component. Then, they extracted the outliers from the residual part and decomposed them by implementing the current primary outlier detection method. Outliers were labeled using expert judgment. They used a modified Z-Score method with a 95% confidence level to remove the outliers.

Understanding the difference between temporal and non-temporal data is critical for detecting outliers, especially for univariate data, regardless of the models employed, either machine learning or statistical. Temporal data, like ocean temperature, are not entirely dependent, and they relate to prior data points in the time series. On the other hand, data is independent in non-temporal time series. Many techniques have been proposed to detect outliers in these time series datasets Gupta et al. (2014). According to Braei and Wagner (2020), three different approaches are available to detect anomalies in univariate time series. The first way is to approach it using statistical models, such as the Autoregressive Model (AR), the Moving Average Model (MA), the Autoregressive Moving Average Model (ARMA), and ARIMA. The second approach is by using machine learning techniques such as K-Means Clustering – Subsequence Time-Series Clustering (STSC), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Local Outlier Factor (LOF), Extreme Gradient boosting (XGBoost), and Isolation Forest. The last method is Neural Network techniques such as Multiple Layer Perceptron (MLP), Convolutional Neural Networks (CNN), Residual Neural Network (Resnet), and WaveNet.

In the pre-processing step, two methods were applied: 1) standardizing the data, which means that the mean should be zero, and the standard deviation should be one; and 2) detrending the data, which is used chiefly for statistical approaches to making the time-series stationary. Time series does not necessarily need to be stationary in machine learning or deep learning methods.

Two different methods can evaluate the models. The first is the F-Score, while the second is the area under the curve or AUC. They show the association of the true positive rate and the false positive rate based on various thresholds.

Anomaly detection can be performed using supervised, semi-supervised, and unsupervised techniques. Anomaly detection is frequently applied to “unlabeled” data, also known as the unsupervised anomaly detection method, as reported by Goldstein and Uchida (2016). Hair et al. (2010) describe outliers based on different sample sizes. Data in a dataset with less than 80 samples is an outlier if it is not within the range of 2.5 the standard deviation. However, for more than 80 samples, the range can be broader: up to four standard deviations.

Ray et al. (2018) applied six different statistical anomaly detection models: Poisson Changepoint Model, ARIMA, Seasonal Hybrid Extreme Studentized Deviate, HDC, Bayesian Changepoint, and E-Divisive with Median (EDM) on failures in Clinical Decision Support to identify the anomalies. The study shows that four out of six models (Bayesian changepoint, ARIMA, Poisson changepoint, HDC) were able to identify anomalies reliably. Çelik et al. (2011) applied a DBSCAN model to detect anomalies in time-series data compared to the statistical model. Due to the dataset's temporal characteristics, a pre-processing method was applied to remove seasonality before using the DBSCAN algorithm. They discovered the DBSCAN algorithm could identify anomalies even if they are not extreme values.

Breaker Laurence et al. (2005) applied Chauvenet's test, a statistical model for detecting outliers in time series of daily sea surface temperatures using a threshold of 2.7 standard deviations from the mean. Jingang et al. (2017) applied two methods for detecting the outliers in a continuous time series of ocean observation data: data quality control and the Dixon detection theory. Their study shows that the Dixon detection had a better performance than Cook's distance diagnostic statistic outlier detection method. Dixon detection was also recommended for the quality control of complex oceanic observation data. Arabelos et al. (2001) used matrix time series analysis to identify daily water level and temperature outliers. The algorithm was able to find 36 outliers out of 40 for the water temperature, and for the water level, the algorithm was able to identify 97 out of 130 outliers.

Statistical models generally performed better than machine learning and deep learning models (Braei and Wagner, 2020). The latter recorded the computational time for each algorithm. They found that the moving average and autoregressive models, which are statistical models, were the fastest algorithms, while deep learning methods took a considerably longer time to perform the computations. The Chandola et al. (2009) study shows that, despite the growth in machine learning methods and deep neural networks, the statistical methods that rely on the generating process perform better for detecting anomalies in univariate time series. Besides the fact that statistical methods identify outliers more accurately than machine learning and deep learning algorithms, they additionally perform faster and require less training and prediction duration. Moreover, mathematics and optimization are easier in statistical models than deep learning methods due to a low number of hyper-parameters (Braei and Wagner, 2020). Statistical methods are mathematically justified, and after building the model based on a statistical approach, outlier detection is possible without storing the original dataset with a minimal amount of information that describes the model (Petrovskiy, 2003).

The coastal ocean temperature dataset is univariate, and researchers found that statistical approaches are the most efficient unsupervised algorithms at detecting outliers in the time series (Braei and Wagner, 2020). Hence, this work assessed and proposed a novel outlier detection algorithm that combines statistical-based methods and univariate analyses to identify outliers in coastal ocean temperature datasets. Visual inspection of the data is not feasible due to the high number of current and incoming data. Therefore, an online database was required to implement the outlier detection algorithm. The following sections describe the data used to develop the algorithm and the technology infrastructure that facilitate the process. We added the discussion on the infrastructure to illustrate the adoption practicality of the proposed outlier detection algorithm. In the proceeding section, we reported the algorithm's performance and summarized the work in the conclusion section.

Section snippets

The site, sensor, and dataset

A micrometeorological tower located at the Centre for the Marine and Coastal Studies (CEMACS), Teluk Aling, measured various meteorological parameters, such as atmospheric temperature and relative humidity, water temperature, and carbon dioxide and moisture flux. The data is used to analyze the carbon and water cycles of the tropical coastal ocean ecosystem. The tower is situated at coordinates 5.49°N, 100.20°E, which is on the northwest coast of the island of Pulau Pinang in the national

Descriptive statistics

The exploratory data analysis technique was applied to understand the water temperature dataset better. Exploratory data analysis is the essential process involving early studies on data to discover patterns, identify anomalies, test hypotheses, and evaluate assumptions using descriptive statistics and visualizations. This assists in analyzing the dataset by summarizing the critical characteristics and extracting the most significant insights.

The statistical description of the water temperature

Conclusions

Outliers are common in surface water temperature datasets due to sudden changes in water level and sensor malfunction. These extreme values require subsequent data analysis to evaluate their legitimacy. Because of its efficiency, the statistical class of methods was used to identify data outliers.

Three algorithms to identify the outliers are proposed and assessed. First, the Standard Z-Score method was used as a baseline method compared to the modified methods. A Modified Z-Score coupled with

Declaration of Competing Interest

None.

Acknowledgments

We acknowledge the Ministry of Education Malaysia for awarding us the Malaysian Research University Network Long-Term Research Grant Scheme (MRUN-LRGS) with grant number 203.PTEKIND.6777006 that made this research possible. The data is available at https://figshare.com/articles/dataset/Atmosphere_Interaction_Research/13906643 and at the link https://atmosfera.usm.my/api.html. The Python code for the algorithms is available on GitHub at the link //github.com/AtmosferaUSM/outlier-detection-water-temperature

References (50)

  • V. Chandola et al.

    Anomaly detection: a survey

  • H. Chang et al.

    Impacts of Climate Variability and Change on Water Temperature in an Urbanizing Oregon Basin

    (2011)
  • S. Chauhan et al.

    Anomaly detection in ECG time signals via deep long short-term memory networks

  • H.Y. Cho et al.

    Outlier detection and missing data filling methods for coastal water temperature data

    J. Coast. Res.

    (2013)
  • P. Čisar et al.

    Optimization methods of EWMA statistics

    Acta Polytech. Hungarica

    (2011)
  • A.A. Cook et al.

    Anomaly detection for IoT time-series data: a survey

    IEEE Internet Things J.

    (2020)
  • H.A. Dau et al.

    Anomaly detection using replicator neural networks trained on examples of one class

  • E.M. Dogo et al.

    Sensed outlier detection for water monitoring data and a comparative analysis of quantization error using Kohonen self-organizing maps

  • S.C. Doney et al.

    Climate change impacts on marine ecosystems

    Annu. Rev. Mar. Sci.

    (2012)
  • J. Gao et al.

    RobustTAD: robust time series anomaly detection via decomposition and convolutional neural networks

  • M. Goldstein et al.

    A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data

    PLoS One

    (2016)
  • F.E. Grubbs

    Procedures for detecting outlying observations in samples

    Technometrics.

    (1969)
  • N. Günnemann et al.

    Robust multivariate autoregression for anomaly detection in dynamic product ratings

  • M. Gupta et al.

    Outlier detection for temporal data: a survey

  • J.F. Hair et al.

    Multivariate Data Analysis: A Global Perspective

    (2010)
  • Cited by (21)

    View all citing articles on Scopus
    View full text