1 Introduction

Exposure to air pollution is a risk factor for diseases such as stroke, asthma, cancer and chronic obstructive pulmonary disease. It is associated with noxious effects on human health and it is especially harmful to vulnerable groups such as children, the elderly and patients with respiratory and cardiovascular diseases [1,2,3,4]. Air pollutants not only severely impact public health, but also the climate and ecosystems because several of them are greenhouse gases [1, 5]. Many of the air pollutants are also sources of greenhouse gas emissions. Considering the significance of air quality on health and the environment, the World Health Organization (WHO) has developed guidelines to improve air quality by setting limits on the concentrations of various air pollutants: ozone (O3), nitrogen dioxide (NO2), sulphur dioxide (SO2), carbon monoxide (CO), particulate matter with size less than or equal to 2.5 \(\upmu \)m (PM2.5) and particulate matter with size less than or equal to 10 \(\upmu \)m (PM10) [5].

Air quality monitoring is an important task for governments to provide information on potential health risks and determine appropriate environmental management policies. The development of air pollution sensing technology in the last few decades and the support of government agencies have contributed to build an air quality monitoring network in many urban areas with the aim of analysing and publishing the concentrations of several air pollutants that are potentially harmful to healthFootnote 1Footnote 2. However, these networks are usually sparsely distributed and the sensor calibration problems that may appear lead to missing and wrong measurements [6,7,8,9]. There is an increasing interest in developing air quality modelling methods to minimize measurement errors, predict spatial and temporal air quality, and support more spatially-resolved health effect analyses [8,9,10,11,12,13].

Air pollution modelling follows two different approaches. The first approach consists in using the deterministic mathematical modelling of atmospheric pollutant dispersion. The second approach, on the other hand, consists in employing statistical models based on historical air quality data, and, in some cases, meteorological and geographic information too.

The deterministic mathematical modelling involves the simulation of pollutant dispersion and transport mechanisms using emission values in industrial and urban areas, physical and chemical processes in the atmosphere, meteorological data, and geographic and topological information. The deterministic methods that are most present in the scientific literature are the “Community Multiscale Air Quality (CMAQ) model” [14, 15], “Weather Research and Forecasting model with Chemistry (WRF-Chem)” [16], and “Nested Air Quality Prediction Modelling System (NAQPMS)” [17, 18]. The deterministic modelling has limitations due to the enormous number of pollution sources and the fact that air distribution is influenced by several complex physical/chemical processes that require many variables.

The statistical approach takes advantage of the spatial and temporal correlations that are present in the air pollution concentration time series and formulates models that simulate these dependencies with a high degree of accuracy. Several methodologies have been developed along this approach, including classical statistics [19], artificial intelligence [8, 9, 11,12,13, 20, 21] and geostatistical techniques [6, 7, 22].

The application of Artificial Neural Networks (ANNs) has been frequently used to forecast air quality. Some recent articles [7, 13, 20, 21, 23,24,25,26] use the historical values of various pollutants to predict the air quality index and/or air pollutant concentrations. Several of them use meteorological data too. Machine and deep learning methods show a more remarkable ability to simulate non-linear systems because of their self-learning, self-organizing, and self-adaptation features.

Instead, fewer studies have used the ANN technique for the spatial estimation of air pollution [6, 8, 11, 12, 27]. The study [8] evaluates the use of a Back-Propagation Neural Networks (BPNN) for modelling the spatial atmospheric pollution of five air pollutants (NO2, O3, SO2, PM2.5 and PM10). The authors of [6] proposed machine learning and geostatistical methods to predict PM2.5 pollution levels. Some of these studies applied Deep learning methods to extract complex and non-linear spatiotemporal correlations [11, 12]. The next section further describes the references about the use of machine learning and deep learning methods for predicting air pollution.

The main objective of this research was to develop an ANN-based system for modelling the spatial characteristics of air pollutant concentrations measured at an urban air quality monitoring network. To estimate the air pollutant value at the target site based on the measurements collected at nearby locations, we applied three Feed-Forward Neural Network (FFNN) architectures for regression (with one, two, and three fully connected hidden layers), a Support Vector Machine (SVM) and geostatistical methods. The evaluation of the methods was performed by using the historical values of seven air pollutants (Nitrogen monoxide (NO), NO2, O3, SO2, CO, PM2.5, and PM10) collected at the urban air quality monitoring network located at the greater metropolitan area of Madrid (Spain). For assessing and comparing the predictive ability of the models, three estimation accuracy indicators were calculated: the Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE) and the coefficient of determination (R²).

The main contributions of this work include the following:

  • A broad analysis of the most common pollutants evaluated on a real dataset consisting of samples obtained from the air quality monitoring network deployed in the city of Madrid (Spain).

  • A FFNN-based air pollution spatial estimation system able to accurately predict NO2, NO, SO2, O3, PM2.5 and PM10 concentrations.

  • A systematic comparison between two geostatistical models, SVM, and three FFNN architectures (one-hidden FFNN, bi-layer FFNN, and tri-layer FFNN) for evaluating the prediction of the concentration of seven air pollutants (NO2, NO, CO, SO2, O3, PM2.5 and PM10).

The present work is structured as follows. The following section reviews literature related to machine learning and deep learning methods applied to air quality prediction. Section 3 presents the study area and the air pollution dataset used in the experiments. In Section 4, the methods applied to estimate air pollutant concentrations are described: Inverse distance weighted (IDW), Ordinary Kriging (OK), SVM, FFNN. In Section 5, every phase of the experiments is presented. Section 6 describes the results of the experiments performed. In Section 7 we discuss our results. Finally, the conclusions are presented in Section 8.

2 Related work

Machine learning methods are widely used to extract non-linear correlations in air pollutant concentration data. Deploying air pollution monitoring stations in urban areas generates a massive amount of collected data, creating databases suitable for statistical analysis. In addition, machine learning methods do not require a deep understanding of the dynamic and chemical processes between air pollutants and other relative atmospheric variables.

ANNs have been improved through years of research and applications, bringing more evolved versions to air pollution prediction. For example, [27] proposed a BPNN to estimate the hourly concentrations of NO2 in unsampled locations at Algeciras Bay (Spain) using the historical values of NO2 concentrations at fourteen monitoring stations and the distances to each monitoring station. The prediction system has a first stage that applies the IDW interpolation and a multiple linear regression method to produce air pollution maps that are used as input of the the BPNN. The highest result achieved was an \(R^{2}\) value of 0.76. They also evaluated the methods separately, showing the BPNN as the best prediction method in most monitoring stations. Instead, [6] used several methods to obtain daily estimates of PM2.5 concentration across the contiguous US, and the results showed a better predictive performance of spatial statistical models over machine learning methods. The 829 monitor stations take measurements every 1, 3, or 6 days, with only approximately 15% of monitors sampling daily, implying an irregular and sparse dataset. In other recent research, [8] applied a BPNN for spatially estimating pollutant concentrations in the metropolitan city of Athens in Greece. Five pollutants were estimated, NO2, O3, PM10, PM2.5 and SO2, and the \(R^{2}\) values for O3 and PM10 were above 0.87,and for NO2, PM2.5, and SO2 were 0.76, 0.69 and 0.55, respectively.

Deep learning has become increasingly widely used in air quality prediction because of its ability to extract complex and non-linear spatiotemporal correlations on large datasets. Many researchers employed a Long Short Term Memory (LSTM) network for modelling the complex and non-linear temporal correlation of the historical values of the pollutants [9, 20, 25, 28, 29]. LSTM is an enhanced version of the Recurrent Neural Network for handling long-time sequence data. The recent studies [11, 24, 30] have proposed combined models for air quality prediction based on Convolutional Neural Networks (CNNs) to extract the spatial characteristics, and LSTMs to predict future air pollution concentrations. [24] developed a CNN-LSTM method for predicting the next day’s daily average PM2.5 concentration in Beijing City. [30] proposed an attention-based CNN-LSTM multilayer structure to predict the PM2.5 concentration in the next 72 hours at Beijing-Tianjin-Hebei region. This research analysed the historical air quality and meteorological data of 100 monitoring stations for spatiotemporal correlation. [11] proposed spatiotemporal forecasting models of Beijing’s Air Quality Index. Four methods (CNN, LSTM, CNN-LSTM, BPNN) were evaluated to extract the spatiotemporal characteristics of air quality concentration data (hourly PM2.5, PM10, SO2, NO2, O3, and CO) and the relations with meteorological and spatiotemporal data. The method that showed the best performance in next-hour forecasting was the CNN-LSTM method. [12] developed a spatiotemporal air quality prediction model based on LSTMs. The input data were Beijing’s historical concentrations of PM2.5, SO2, NO2, O3, and CO and meteorological data. The output was the concentration sequence of PM2.5, CO, NO2, O3, and SO2 at four monitoring sites. The model’s prediction accuracy is high, as shown by the best \(R^{2}\) value of the four analysed sites: 0.939 for PM2.5, 0.847 for CO, 0.875 for NO2, 0.935 for O3, and 0.809 for SO2. Some researchers took advantage of the CNN’s ability to process sequence-structure data for air pollutant concentration prediction. For example, [31] used a five-layer CNN to extract the temporal correlation from historical observation data and predict the ozone concentration in the next 24 hours in an urban area. [7] applied the IDW method to interpolate air quality and weather data collected in South Korea and then used the interpolation as input of the CNN to predict PM2.5 and PM10 concentrations. The results show an effective prediction performance with an \(R^{2}\) higher than 0.97.

In our research, we have deployed and compared six air quality prediction methods present in the recent literature: spatial-based interpolation (IDW), geostatistical model (Kriging), machine learning (SVM and FFNN), and deep learning (FFNN with 2 - 3 hidden layers). Each method is evaluated for extracting spatial dependencies of seven air pollutants: NO2, NO, CO, SO2, O3, PM2.5, and PM10. Therefore, we have broadly compared different methods to predict the concentrations of several air pollutants that are generated by various sources and that have different behaviours. The analysis of so many air pollutants is rare in the literature. The urban area selected has an air quality monitoring network with a spatial density and a sample time higher or similar to recent studies.

3 Study area and dataset

The area of study is the city of Madrid, which is the capital of Spain and the largest and most populated metropolitan area of the country. Madrid’s province population has grown from 6,446 million in 2016 to 6,751 million in 2021Footnote 3. The elevation at its centre (40\(^{\circ }\) 25’ N, 3\(^{\circ }\) 41’ W) is 657 m. Madrid area’s expected mean temperature changes from 9.8 \(^{\circ }\)C in January to 32.1 \(^{\circ }\)C in July, experiencing cold winters and hot summers. Spring and autumn are the seasons with more expected rainy days, while the summer months are usually dry and sunnyFootnote 4.

Madrid’s air pollution levels are high, although, since the activation of air quality policies in 2011, those levels were effectively reduced [32]. This study analysed hourly time series of four air pollutants (NO2, O3, PM10, and SO2) monitoring in Madrid urban area during the period from 2001 to 2017 by a two-stage method: first, a Hidden Markov Model was used to characterize the air pollution at temporal scales. Then, the spatial distribution was analysed by combining the interpolation results of Ordinary Kriging and Inverse Distance Weighting. [32] concludes that the air pollution spatial analysis is challenging to assess due to meteorological and physical factors and the regional contributions originated in adjacent municipals. Not only is human activity responsible for bad air quality, but also other climate events like Saharan dust intrusions have an impact by rising PM levels [32, 33].The research [34] examined the effects of local road traffic, meteorological conditions, and temporal variables on air pollution in Madrid. Its result showed that air pollutant levels were weakly linked to local vehicular emissions because various elements affect the pollutant concentration, mainly meteorological agents, topography, tree and shrub presence, building distribution, and water streams like rivers.

Fig. 1
figure 1

Area of study and position of the different air monitoring stations. Map image is obtained from Google Maps

This study uses hourly air quality data measured between January 2016 and August 2018 by the air monitoring network of Madrid. Madrid’s city council operates an air quality monitoring network from 2001 and publishes both real-time and historical air quality dataFootnote 5. We used the Dataset “Air Quality in Madrid”Footnote 6 that contains the processed data from the files offered by the Madrid’s city council with a structured organization based on timestamp and standard format data. It consists of a file for each year where each row is timestamped and the columns are the different measures performed at that point in time in a certain station. In addition, the information regarding each station (identifier, name, address, coordinates and elevation) is available in another file. The measurements of many pollutants are available (NO, NO2, O3, SO2, CO, PM2.5, PM10, toluene, benzene, methane), but not every station is equipped with all air pollution sensors, which have been increasing over the years.

The study area and the distribution of 24 air quality monitoring stations deployed in Madrid are shown in Fig. 1. The maximum distance between two stations is 7 kilometres, and the average distance between stations at the urban centre is 3. Table 1 contains the coordinates, elevation and the air pollutant measured in each air quality monitoring station between 2016 and 2018. We selected the most harmful air pollutants according to WHO and several government agencies like the U.S. Environmental Protection AgencyFootnote 7, European Environment AgencyFootnote 8 and Health Canada and Environment CanadaFootnote 9. The criteria for the selection of the monitoring sites are the hourly data availability along with homogeneous spatial data coverage for each air pollutant.

Table 2 presents the measuring units and the range of measured concentrations for each pollutant between 2016 and 2018.

Table 1 Air quality monitoring stations, coordinates, and measured pollutants (from 2016 to 2018)
Table 2 Measuring units and range of measured concentrations for each pollutant (years 2016 to 2018)

4 Methods

In this work, we present a system based on FFNN architecture for regression to predict the air pollutant concentration in a specific location based on the measurements obtained from nearby monitored locations. To compare our proposal, we applied the following geostatistical and machine learning methods: Inverse Distance Weighting (IDW), Ordinary Kriging (OK), and Support Vector Machine (SVM).

4.1 Inverse distance weighting and kriging methods

Nearly all spatial interpolation methods share the same general estimation formula, which is as follows:

$$\begin{aligned} Z(x_{0}) = \sum _{i=1}^{n}(w_{i} z(x_{i})) \end{aligned}$$
(1)

where Z is the estimated value at the point of interest \(x_{0}\), z is the observed value at the sampled point \(x_{i}\), \(w_{i}\) is the weight assigned to the sampled point, and n represents the number of sampled points used for the estimation. The difference between the methods depends on the formula to calculate the weights. The two most commonly used interpolation methods in the literature are IDW and OK [35, 36]. The IDW method uses the following expression for the weight:

$$\begin{aligned} w_{i} = \frac{\frac{1^{p}}{d_{i}}}{\sum _{i=1}^{n}\frac{1^{p}}{d_{i}}} \end{aligned}$$
(2)

where \(d_{i}\) is the distance between \(x_{0}\) and \(x_{i}\), and p is an exponent that determines the influence of values closest to the interpolated point, while the weight for OK is estimated by minimizing the variance of the prediction errors. It is assumed that the data are part of an intrinsic function z(x) with the sample variogram [37]. The sample variogram is fitted with specific known positive defined functions. The most common functions are linear, spherical, exponential, and Gaussian.

In our experiments, we tested the IDW model for p=1 and p=2 and, for the implementation of the OK method, we applied four function models for fitting the sample variogram: spherical, exponential, Gaussian and bounded linear.

4.2 Support vector machine

SVM is a popular machine learning tool for classification, but it can also be used for regression analysis [38, 39]. SVMs aim to provide a nonlinear function to map a given training data set D: \({(x_{1}, y_{1}), (x_{2}, y_{2}),...,(x_{i}, y_{i})}\) to a high dimensional feature space. In this space, a hyperplane is optimized to be within a certain threshold of the selected data, called the support vectors, and the hyperplane is used for predicting regression.

A linear epsilon-insensitive (\(\epsilon \)) SVM was used for regression, which is also known as L1 loss. The set of training data included predictor variables and observed response values. The goal was to find a function z(x) that deviates from observed values no greater than \(\epsilon \) for each training point x, and that is as flat as possible at the same time. The training of the SVM with epsilon-insensitive loss function was performed by using quadratic programming for minimising the objective-function.

4.3 Feed-forward neural network for regression

ANNs are massively parallel interconnected networks of simple, hierarchically organized elements (artificial neurons) that attempt to interact with the environment in the same way as the biological nervous system [40]. The output of such an artificial neuron can be calculated using the (3).

$$\begin{aligned} y = f(\sum _{i=1}^{n}(w_{i}x_{i}))+b) \end{aligned}$$
(3)

where \(x_{i}\) are the inputs, n the number of inputs, \(w_{i}\) the synaptic weights, b the threshold and f the activation function. The most commonly used activation functions are linear, sigmoid, and hyperbolic tangent. Artificial neurons are arranged in several layers and connected by synaptic weights.

In this work, three feed-forward, fully connected neural networks were used for regression. The structure includes an input layer, one or more hidden layers, and an output layer. The input layer takes information (predictor data) from the domain and passes it to all the neurons from the first hidden layer. As the first hidden layer is fully connected to the input layer, each subsequent layer is connected to all the neurons from the previous layer. Each neuron of a fully connected layer multiplies the input by the synaptic weight and then adds the multiplication results with the bias. The sum is passed through an activation function. The final fully connected layer produces the network’s output (predicted response values). The proposed architecture for the fully connected layered neural network with two hidden layers is shown in Fig. 2. An enlarged diagram of a single artificial neuron is presented separately to show its five components -inputs, synaptic weights, sum, bias, and activation function. We chose two activation functions for hidden layers: a Rectified Linear Unit (ReLU) and a sigmoid function. These functions are described in (4) and (5), respectively. According to the regression problem, the activation function of the output layer is the linear function f(x) = x.

$$\begin{aligned} f(x) = \left\{ \begin{matrix} x, &{} x \ge 0\\ 0, &{} x < 0 \end{matrix}\right. \end{aligned}$$
(4)
$$\begin{aligned} f(x) = \dfrac{1}{1+e^{-x}} \end{aligned}$$
(5)

The training is based on the limited-memory Broyden-Fletcher-Goldfarb-Shanno quasi-Newton algorithm (LBFGS) [41], where the mean squared error (MSE) is minimized.

We proposed three FFNN architectures for extracting the spatial characteristics of the air pollution concentration: one hidden layer neural network, a bi-layer neural network, and a tri-layer neural network. Regarding the current application, the number of monitoring sites for predicting each pollutant defines the number of input nodes, and the output layer consists of a fully connected neuron to return the prediction. The number of neurons in each hidden layer is configurable. In order to compare the same structure for different air pollutants, we determined 10 neurons per hidden layer for each architecture.

Fig. 2
figure 2

Fully interconnected bi-layer FFNN with 5 input nodes, two hidden layers with 10 neurons and the output layer with one neuron. An enlarged diagram of a single artificial neuron is presented separately to show its five components -inputs, synaptic weights, sum, bias, and activation function

Fig. 3
figure 3

Experimental workflow followed for the prediction of each air pollutant and the evaluation of the proposed models

5 Experiments

Figure 3 shows the diagram of the experimental workflow. It was followed for the evaluation of several prediction models to estimate each air pollutant.

Firstly, we selected the target location, which should be a point with historical concentrations of the air pollutant to be estimated. We chose target stations based on their position close to the centre of Madrid and the five stations closest to the target site. However, our decision was constrained by the availability of air pollution sensors in each station. Table 3 shows the monitoring station selected as the target site and five nearby monitoring stations used to estimate the air pollutant concentrations. The average distance between the target and the selected stations is 3.26 kilometres. The coordinates and distribution of the air quality monitoring stations are shown in Table 1 and in Fig. 1, respectively.

Table 3 The selected monitoring stations and distances to the target site for each air pollutant under evaluation

Secondly, we processed the dataset to get the air pollutant concentrations from the selected stations. Each sample included the air pollutant concentrations for each station taken simultaneously. The samples with some missing values were removed. The input values for IDW and OK models are the latitudes, longitudes, and pollutant concentration at the five nearby stations. However, the input data for SVM and FFNN are the historical values of the pollutants from six stations, for training, and the pollutant concentration at the five nearby stations, for predicting.

Then, the predictive models were designed. We developed two IDW models (p=1 and p=2) for evaluating the influence of the distance in the prediction. When p=2, the method is known as the inverse distance squared weighted interpolation. To implement the OK method, four function models were applied to fit the sample variogram: spherical, exponential, Gaussian, and bounded linear. We proposed a linear \(\epsilon \)-SVM for regression and three different FFNN architectures: one hidden layer neural network, a bi-layer neural network, and a tri-layer neural network. The number of monitoring sites for predicting each pollutant defines the number of input nodes, a parameter set to five in our system. The output layer consists of a fully connected neuron to return the prediction. The number of neurons in each hidden layer is a configurable parameter, and we set it to ten to compare the same structure for different air pollutants. We tested two activation functions for hidden layers: a Rectified Linear Unit (ReLU) and a sigmoid function.

Fig. 4
figure 4

The training phase flowchart of SVM and FFNN methods

In the case of the SVM and FFNN models, the next phase is training. In both cases, we used 80% of the samples to train the neuronal network model and the remaining 20% to test the performance of the trained model with new data. Figure 4 shows the data and processes of training stage. The results are the trained FFNN (or SVM) and the difference and correlation measures. The IDW and OK models do not require training since the estimated value is calculated from the simultaneous measurements taken at the nearby stations and the distances to the target site. We used the test set to validate the SVM and FFNN models and the whole dataset to evaluate the IDW and OK models. The accuracy of each model is based on the comparison of the observed and predicted concentrations and the statistical analysis of the residual values.

Finally, we determined the mean performance of each model by using a set of difference and correlation measures: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the coefficient of determination (\(R^{2}\)). \(R^{2}\) is the proportion of variation in the dependent variable that is predicted by the statistical model (range from 0 to 1). Therefore, \(R^{2}\) provides information about the model’s goodness of fit. The following equations determine these error measures between observed (\(x_{i}\)) and predicted (\(y_{i}\)) values, where n represents the number of sampled points used for the estimation and \(\mu x\) is the mean of observed values.

$$\begin{aligned} MAE = \frac{1}{n}\sum _{i=1}^{n}\left| x_{i} - y_{i} \right| \end{aligned}$$
(6)
$$\begin{aligned} RMSE = \sqrt{\frac{1}{n}\sum _{i=1}^{n}\left( x_{i} - y_{i} \right) ^{2}} \end{aligned}$$
(7)
$$\begin{aligned} R^2 = 1 - \frac{\sum _{i=1}^{n}\left( x_{i} - y_{i} \right) ^{2}}{\sum _{i=1}^{n}\left( x_{i} - \mu _{x} \right) ^{2}} \end{aligned}$$
(8)

This process was followed for the seven selected pollutants: NO, NO2, O3, SO2, CO, PM2.5 and PM10.

6 Results

Six methods for modelling the spatial characteristics of the air pollutant concentrations were implemented and evaluated for each of the seven air pollutants (NO2, NO, CO, SO2, O3, PM2.5, and PM10). The experiments used historical pollutant data from monitoring stations of Madrid city’s air quality monitoring network collected from January 1, 2016, to December 31, 2018. Among the different model variants that were tested for the IDW and OK methods, only the best ones are shown in the results. We proposed three FFNN architectures: a FFNN with one fully connected layer (FCL), two FCLs, and three FCLs. Except in the prediction of SO2 concentration, we chose to apply the ReLU function in the hidden layers because it has a greater accuracy than the sigmoid one.

Tables 4, 5, 6, 7, 8, 9 and 10 present the assessment of the statistical (IDW and OK) and machine learning (SVM and FFNN) methods by RMSE, MAE and \(R^2\). RMSE and MAE are in the same units of pollutant concentrations, and \(R^2\) ranges from 0 to 1.

Table 4 Overall performance metrics for the prediction of NO2
Table 5 Overall performance metrics for the prediction of NO
Table 6 Overall performance metrics for the prediction of CO
Table 7 Overall performance metrics for the prediction of SO2
Table 8 Overall performance metrics for the prediction of O3
Table 9 Overall performance metrics for the prediction of PM2.5
Table 10 Overall performance metrics for the prediction of PM10
Fig. 5
figure 5figure 5

FFNN prediction performance: (Left) scatter plots showing observed versus predicted concentrations for the pollutants based on the 20% of data used for validation. (Centre) Residual plot of the difference between the predicted and true responses. (Right) Residual histograms showing the relative probability of error values

The results for predicting NO2 and NO showed a better predictive performance of machine learning methods over spatial statistical models. The bi-layer FFNN method presents maximum \(R^{2}\) and minimum RMSE for estimating NO2 and NO air pollutants. The proposed models for estimating the CO concentrations result in low accuracy. Neither method is appropriate to extract the spatial characteristics of the CO concentrations. A possible reason for this result may be that CO is a primary pollutant whose concentration is closely linked to local combustion phenomena, aggravated in some cases by low dispersion due to thermal inversion or lack of wind. Heavy traffic or traffic jams in a specific area can cause high local measurements of CO without hardly affecting neighbouring areas. A higher density of measurement stations could be needed to get better results for CO because its concentration relative to the distance from the pollution source can decrease quickly. In the case of SO2, the accuracy of the machine learning methods is vastly superior to that of the spatial statistical models. FNN methods experiment an exceptional improvement when increasing the number of hidden layers. The prediction models of O3, PM2.5, and PM10 present similar performances, with slightly higher accuracy for bi-layer FFNN methods.

The bi-layer FFNN method exhibits the best result for most of the air pollutants under evaluation, except for the prediction of SO2, where the best performance is led by the tri-layer NN. The highest coefficient of determination (0.9) is reached by the bi-layer FFNN method for the prediction of NO2 and PM10. In most cases, we could rank the methods according to their performance (from best to worst) as follows: bi-layer FFNN, tri-layer FFNN, FFNN, SVM, IDW, and OK. The IDW and OK methods present similar results and have the lowest accuracy for all the evaluated air pollutants. The SVM models exhibit lower predictive ability than the FFNN methods for fitting air pollution concentration.

Figure 5 shows the bi-layer FFNN performance for the seven air pollutants (NO, NO2, CO, SO2, O3, PM2.5 and PM10), based on the 20% of data used for validation. In the left column, the scatter plots of observed versus predicted concentrations for each air pollutants are shown. In all cases, low dispersion is observed along the diagonal of the optimum prediction. The centre column shows the residual plots that display the difference between the predicted and measured values. The right column contains the residual histogram showing the relative probability of predictive errors. A general remark is that many error values are around zero and residuals distributions are in accordance with the mean model performance metrics.

7 Discussion

The main objective of this work was estimating air pollution concentration in a certain location based on the measurements taken at nearby stations for addressing the missing values and detecting uncalibrated sensors. We have developed several statistical (IDW and OK) and machine learning (SVM, FFNN, bi-layer FFNN, tri-layer FFNN) methods to model the spatial characteristics of the concentrations of seven air pollutants (NO2, NO, CO, SO2, O3, PM2.5, and PM10) measured by Madrid’s air quality monitoring network. The models were evaluated and compared using MAE, RMSE, and \(R^2\) as accuracy indicators.

Table 11 Effect of decreasing FFNN architecture

IDW and OK’s statistical models reached \(R^2\) greater than 0.75 for NO2, NO, O3, and PM10. The prediction accuracy of the FFNN methods is better than the results of IDW and OK for all analysed air pollutants. There is a more significant accuracy difference between geostatistical models and FFNN methods for predicting NO2, NO, and SO2 than O3, PM2.5, and PM10. The results show the effectiveness of the bi-layer FFNN model to fit spatial correlation of the air pollution concentration of NO2, NO, SO2, O3, PM2.5 PM10. For the prediction of SO2, the tri-layer FFNN model improves the bi-layer FFNN accuracy. Therefore, the proposed systems have a direct application to provide missing values of the air monitoring network and can be used to detect uncalibrated sensors. Neither method is appropriate to extract the spatial characteristics of the CO concentrations. A possible reason for this result may be that the CO concentration is closely linked to local combustion phenomena. A higher density of measurement stations could be needed to get better results for CO.

As mentioned in Section  2, the models proposed by [12] and [7] reach a very high \(R^2\) value regarding the rest of the recent research analysed in this section. The proposed bi-layer FFNN-based system reaches a better \(R^2\) value of 0.9 for the prediction of NO2. The prediction of SO2 is less frequently analysed in the related recent studies and gets worse performance. [12] presents the highest R2 value of 0.81. In this work, the tri-layer FFNN system reaches an \(R^2\) value of 0.79. [12] proposed a model based on an LSTM network with n hidden layers and a fully connected layer. Adding a fully connected layer improves the performance of the prediction model, according to the results presented in [12] and this work. The input data of the models proposed by [12] and [7] are air pollution concentrations collected at the monitoring stations and meteorological data. A possibility of future improvement is introducing an LSTM to extract temporal correlation and adding meteorological variables such as temperature, dew point, pressure, wind direction, and wind speed to data input.

Several ablation experiments were carried out to evaluate the effect of reducing components of the proposed neural network architecture. The effectiveness of hidden layer size was examined by two implementations of bi-layer FFNN with seventy and fifty percent reduction of neurons in hidden layers. For most of the pollutants analyzed, the performance of the structure with a seventy percent reduction is similar to the full bi-layer FFNN, obtaining the same value of the R2 metric and a slight degradation of MAE and RMSE. In the case of SO2, the R2 value decreases by 0.02. In contrast, reducing the number of neurons in the hidden layers to fifty percent decreases the system’s accuracy for all cases, as shown in Table 11 by the R2 metric. The performance for PM10 prediction decreases markedly with an R2 value of 0.61. To verify the influence of the features on the proposed system performance, the farthest station measurements were removed from the input. For predicting NO2 and NO, station 1 was eliminated. For CO and PM10, station 10. For SO2, O3, and PM2.5, stations 13, 4, and 7, respectively. The third entry in Table 11 shows the performance results for the system with four inputs instead of five. The R2 value decreases for predicting all pollutants, with the most significant difference for SO2. These ablation experiments were also performed for the one-hidden FFNN and tri-layer FFNN systems, obtaining similar results to the bi-layer FFNN.

8 Conclusion

In this study, three FFNN architectures (with one, two, and three fully connected hidden layers) were implemented and evaluated for modelling the spatial correlation of the concentrations of seven air pollutants: NO2, NO, CO, SO2, O3, PM2.5, and PM10. A comparison with other exposure modelling approaches has been presented: an SVM model and two geostatistical models (IWD and OK). The input dataset is historical pollutant measurements collected by Madrid’s air quality monitoring network from January 1, 2016, to December 31, 2018.

The performance results reveal that bi-layer FFNN and tri-layer FFNN systems are suitable for the spatial prediction of NO2, NO, SO2, O3, PM2.5, and PM10 concentration with an accuracy of (\(R^2\)) 0.9, 0.83, 0.79, 0.88, 0.75, 0.91, respectively. The comparison results show that FFNN models are superior to geostatistical methods and slightly better than Support Vector Machines for fitting the spatial correlation of air pollutant measurements (NO2, NO, CO, SO2, O3, PM2.5, and PM10) collected at nearby locations (less than 3.5 kilometres). For the prediction of NO2 and SO2 concentrations, the bi-layer FFNN and tri-layer FFNN models get a similar accuracy to the recent studies where the BPNN and deep neural network were developed.

In future work, we expect to introduce an LSTM neural network to extract the temporal correlation of air pollution concentration. Also, we will extend the system input with meteorological variables such as temperature, dew point, pressure, wind direction, and wind speed to data input to evaluate the prediction system performance.