Predicting air quality with deep learning LSTM: Towards comprehensive models
Introduction
In the last decades, air quality has been gaining attention due to the health threats produced by high levels of environmental pollution (Ozkaynak et al., 2009). Within the context of this study, air quality is related to both chemical pollutants and biotic factors present in the environment. Concretely, chemical pollutants are considered as the agents released in the environment which disrupt ecosystems such as CO, O3, NO2, SO2 and PM10 which are also considered as the main air chemical pollutants in the studied region (Querol et al., 2012). On the other hand, biotic factors refer to airborne pollen concentrations of the Plantago and Poaceae genus which are the most common and aggressive in terms of allergic and respiratory disorders (Subiza et al., 1995).
Air quality information systems are increasingly used to predict future air pollution levels, which allows for alerting about peaks in admissions in clinical institutions, traffic and environmental management in urban areas or minimizing the exposure for patients in order to prevent adverse effects (Abraham et al., 2009; González et al., 2001; Linares and Díaz, 2008; Ozkaynak et al., 2009).
Field experts have been employing observation-based models to relate records of pollutants to one or more variables which can be measured of predicted, usually meteorological data (Navares and Aznarte, 2016; Sabariego et al., 2012; Schaber and Badeck, 2003; Silva-Palacios et al., 2016; Smith and Emberlin, 2006). Despite the extensive literature, few consider the problem of taking into account both types of pollutants altogether as they are inherently different problems: atmospheric pollen concentrations depend on plant development during previous seasons which, at the same time, depends on the climatological conditions during plant evolution (Cannell and Smith, 1983; Smith and Emberlin, 2006). This implies long and mid-term relations between past atmospheric conditions and current plant status. Contrarily, chemical air pollutant levels are related to recent past atmospheric conditions (Navares et al., 2018). Both pollutants show influence on the development and attack of, for instance, allergic respiratory diseases (D'Amato et al., 2011).
Neural network models have been successfully applied to environmental modeling (Gardner and Dorling, 1998) and air quality problems (Castellano-Méndez et al., 2005; Chaloulakou et al., 1998; Chelani et al., 2002; Grivas and Chaloulakou, 2006; Iglesias-Otero et al., 2015). However, given the nature of the problem (short term influential variables for chemical pollutants and mid-long term influential variables for pollen) this approach requires a thorough research and selection of relevant variables based on expert knowledge (Andersen, 1991; Catalano et al., 2016; Navares and Aznarte, 2016). In addition to the temporal dimension, it is important to take into account the spatial interactions between observation stations as they might be implicitly related. These approaches imply a new research process which might add a new set of influential variables every time a new kind of pollutant or a new genus of pollen is taken into consideration by the system.
In this paper we propose several long short term memory (LSTM) network setups (Hochreiter and Schmidhuber, 1997) to gain insights on how influential is network design when dealing with interrelated time series of different nature. The study compares network topologies in order to find the most suitable configuration to solve the problem, exposing the advantages and disadvantages of each one. The objective is twofold: on the one hand, we show how to avoid thorough preprocessing steps to find influential features (both long and short term, and with differences at each location as a result of particular environmental conditions of the areas where the observation stations are located) by letting the network extract them regardless of the pollutant type. Such a unified approach avoids manually fitting one specific model per pair of location and pollutant, saving human resources and increasing the scalability of the system. On the other hand, we provide a convenient network topology for accurate forecasts at each location which is able to obtain relevant information from data both temporal and spatial. The problem chosen to prove the validity of the proposal is the prediction of air quality over a dataset which consist of a dozen of time series with different characteristics and regimes, sampled over 13 adjacent locations.
Section snippets
Data description
Chemical air pollutants were measured using the gravimetric method or an equivalent method (β-attenuation) and were provided by the Madrid Municipal Air Quality Monitoring Grid (http://www.mambiente.munimadrid.es/). The grid consists of a network of 24 urban background stations spread across the city, which capture chemical air pollutants in real time. Hourly data was aggregated to obtain daily mean levels of chemical air pollutants for the study period from 01 to 01-2001 to 31-12-2013. Daily
Results
Table 2 shows prediction errors for each pollutant at each location, while Table 2 shows the average prediction errors for each pollutant. Linear regression obtains an average RMSE of 0.107 across all location for carbon monoxide (CO) while Random Forest manages to diminish this error to 0.086. All LSTM configurations outperform Random Forest results except SP-LSTM which results in an average RMSE of 0.093 mainly due to the error at Farolillo where it underperforms with an RMSE of 0.127.
Discussion
As we have seen in Section 3 there is statistical evidence of the outperformance of GP-LSTM, IGP-LSTM and FC-LSTM with respect to the other proposed methods. This situation is clear for CO, NO2 and PM10 where there is an error reduction higher than 10% at all locations except for Arturo Soria when forecasting CO. With a close look at this location, we have seen a yearly average concentrations of 0.40 μg/m3 with a standard deviation of 0.24 μg/m3 while this average goes over 0.47 μg/m3 with a
Conclusions
This paper presents a comparison study of different LSTM configurations in order to obtain the most suitable to forecast air quality in the region of Madrid. Several pollutants showing very different behaviors were taken into consideration. In addition to the intrinsic differences in the behavior among pollutant types, each pollutant behaves differently at each location given the conditions of the zones studied due to several factors such as traffic congestion or green areas. This adds an extra
References (37)
- et al.
A note on the validity of cross-validation for evaluating autoregressive time series prediction
Comput. Stat. Data Anal.
(2018) - et al.
Improving the prediction of air pollution peak episodes generated by urban transport networks
Environ. Sci. Pol.
(2016) - et al.
Artificial neural networks (the multilayer perceptron) a review of applications in the atmospheric sciences
Atmos. Environ.
(1998) - et al.
Artificial neural network models for prediction of pm10 hourly concentrations, in the greater area of Athens, Greece
Atmos. Environ.
(2006) - et al.
Statistical behavior of ozone in urban environment
Sustain. Eviron. Res.
(2016) - et al.
Allergenic pollen pollinosis in madrid
J. Allergy Clin. Immunol.
(1995) - et al.
Short-term forecasting of emergency inpatient flow
Inf. Technol. Biomed.
(2009) A model to predict the beginning of the pollen season
Grana
(1991)Random forests
Machine Learning
(2001)- et al.
Thermal time, chill days and prediction of budburst in Picea sitchensis
J. Appl. Ecol.
(1983)
Artificial neural networks as a useful tool to predict the risk level of Betula pollen in the air
Int. J. Biometeorol.
Comparative assessment of neural networks and regression models for forecasting summertime ozone in Athens
Sci. Total Environ.
Prediction of sulphur dioxide concentration using artificial neural networks
Environ. Model. Softw.
Nonparametric methods
For the WAO Special Committee on Climate Change, Allergy, climate change, migration, and allergic respiratory diseases: an update for the allergist
World Allergy Organ. J.
Short term effects of pollen species on hospital admissions in the city of Madrid in terms of specific causes and age
Aerobiologia
The use of ranks to avoid the assumption of normality implicit in the analysis of variance
J. Am. Stat. Assoc.
Manual de Calidad y Gestión de la Red Española de Aerobiología
Cited by (94)
Air pollution in industrial clusters: A comprehensive analysis and prediction using multi-source data
2024, Ecological InformaticsDeep learning-based air pollution analysis on carbon monoxide in Taiwan
2024, Ecological Informaticse-Science workflow: A semantic approach for airborne pollen prediction
2024, Knowledge-Based SystemsA deep learning LSTM-based approach for forecasting annual pollen curves: Olea and Urticaceae pollen types as a case study
2024, Computers in Biology and MedicineForecasting hourly PM<inf>2.5</inf> concentration with an optimized LSTM model
2023, Atmospheric EnvironmentForecasting of fine particulate matter based on LSTM and optimization algorithm
2023, Journal of Cleaner Production