Data Fusion from Multiple Stations for Estimation of PM2.5 in Specific Geographical Location

Becerra, Miguel A.; Sánchez, Marcela Bedoya; Carvajal, Jacobo García; Luna, Jaime A. Guzmán; Peluffo-Ordóñez, Diego H.; Tobón, Catalina

doi:10.1007/978-3-319-52277-7_52

Data Fusion from Multiple Stations for Estimation of PM2.5 in Specific Geographical Location

Miguel A. Becerra^16,17,
Marcela Bedoya Sánchez¹⁶,
Jacobo García Carvajal¹⁶,
Jaime A. Guzmán Luna¹⁷,
Diego H. Peluffo-Ordóñez^18,19 &
…
Catalina Tobón²⁰

Conference paper
First Online: 16 February 2017

1464 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10125))

Abstract

Nowadays, an important decrease in the quality of the air has been observed, due to the presence of contamination levels that can change the natural composition of the air. This fact represents a problem not only for the environment, but also for the public health. Consequently, this paper presents a comparison among approaches based on Adaptive Neural Fuzzy Inference System (ANFIS) and Support Vector Regression (SVR) for the estimation level of PM2.5 (Particle Material 2.5) in specific geographic locations based on nearby stations. The systems were validated using an environmental database that belongs to air quality network of Valle de Aburrá (AMVA) of Medellin Colombia, which has the registration of 5 meteorological variables and 2 pollutants that are from 3 nearby measurement stations. Therefore, this project analyses the relevance of the characteristics obtained in every single station to estimate the levels of PM2.5 in the target station, using four different selectors based on Rough Set Feature Selection (RSFS) algorithms. Additionally, five systems to estimate the PM2.5 were compared: three based on ANFIS, and two based on SVR to obtain an aim and an efficient mechanism to estimate the levels of PM2.5 in specific geographic locations fusing data obtained from the near monitoring stations.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

The World Health Organization (WHO) has studied the harmful effects of the polluted air on the human health, showing the need of monitoring different types of pollutants, including pollutants like PM2.5 particles in the cities of developed and developing countries [1]. It is because there are evidences that these particles have a high association with cancer, heart diseases, lung diseases, and low tract respiratory infections, increasing the morbidity and mortality of the population in a considerable way. Therefore, to measure or estimate the PM2.5 in order to determine the values of the air quality, everywhere in an effective way, is very important in order to establish preventive measures and reduce the risk on the health of the population.

There are three types of models, which are widely used, for predicting the quality of the air: Probabilistic, autoregressive and hybrids [2]. Probabilistic models use different techniques to assess the relationship between the air quality and the meteorological factors [3]. These are usually cheap computational models, such as Hidden Markov Models (HMM), which allow to do the accuracy statistic prediction with relatively less detailed data in relation to the prognostic in many areas [2,3,4]. However, these techniques have problems when the sampling data are limited, and the model may not work as expected which limits their ability of predicting future events. The nonlinear relationship between PM2.5 and meteorological data pose a problem of nonlinear regression between predictors, so models of artificial neural networks (ANN) are used, which have the ability to detect underlying nonlinear relationships between the responses and the predictors. These can be trained using some algorithms and require much less informatics resources [4,5,6,7,8]. The Autoregressive models have been widely used to predict the air quality, but with a variable precision due to their capacity of application to nonlinear processes and their dependence of the quality of the input data [9, 10]. The support vector machines (SVM) are designed to solve nonlinear classification problems; but because of its ability to generalize, it has shown its application in regression problems and time series forecasting. The SVM makes use of a kernel function, attributing its great capacity for generalization, even when the training set is small, i.e., where both the generalization and the training process of the machine do not necessarily depend on the attributed number; allowing an excellent performance in high dimensional problems [11]. Referring to hybrid models as ANFIS [12], they have better performance, because they have the capacity to predict the maximum PM2.5 concentrations, that are considered a critical factor in the prediction system of air pollution [8]; and they also give adequate solutions to nonlinear problems, ambiguity, and randomness of the data. Other reported studies for the prediction of meteorological variables and contaminants have, in most of them, limitations in predictions respect to time and generality. Also, they do not focus on the estimation of variables in specific geographic locations and cities with a limited number of measurement stations and/or limited variables measurements, as is the case in Medellín city [13,14,15,16,17,18,19], which is the principal contribution of this work.

In this study, a comparison among three ANFIS approaches, and two SVM approaches to estimate the PM2.5 was applied together with a relevant analysis based on Rough Set Feature Selection. This aimed at reducing the number of features obtained from the fused measures of near monitoring stations, in order to provide an objective, and an accurate mechanism for more reliable estimation of the PM2.5 concentration on a specific geographical location. Thus, this was obtained from meteorological data and air quality variables in order to give the community an adequate information about the air quality to make decisions on the environmental exposure.

2 Materials and Methods

2.1 Database

The air quality network of “Valle de Aburrá” (AMVA) has 22 measuring places of air quality and meteorology that are fixed. These places are distributed in their different municipalities and the jurisdiction of the Metropolitan area. Three stations were selected taken the completeness and their locations as the criteria, as shown in Fig. 1. They are described as follows: (i) “Urbana Museo de Antioquia” (MED-MAINT) station, with the following coordinates: an altitude of 1488 [MASL], a latitude of 6º15′08.48″ North, and a length of 75º34′07.37″ West. (ii) “Fondo Urbano Universidad Nacional de Colombia Sede Medellín – Núcleo el Volador” (MED-UNNV) station, with the following coordinates: an altitude of 1506 [MASL], a latitude of 6º15′34.39″ North, and a length of 75º34′32.46″ West. (iii) “Suburbana UNE – Casa Yalta, Loma Los Balsos” (MED-UNEP) station, with the following coordinates: an altitude of 1848 [MASL], a latitude of 6º11′11.07″ North, and a length of 75º33′26.55″ West. These stations have measures of particles smaller than 2.5 micrometers (PM2.5), automatic monitors of Ozone (O3), and meteorological stations that generate data of temperature, humidity, wind speed, wind direction, and radiation.

2.2 Theoretical Background

Adaptive Neural Fuzzy Inference System (ANFIS).

This method needs Fuzzy Logic to change the given inputs into wanted outputs through highly interconnected Neural Network processing elements and information connections, which are weighted to locate and assign inputs into an output. To ease the use of this technique, two fuzzy IF-THEN rules based on a first order Sugeno model [20] are implemented. $ Rule \left( 1 \right):IF\, x \,is \,A_{1} \,AND\, y \,is\, B_{1} , \,THEN\,f_{1} = p_{1} x + q_{1} y + r_{1} $ and $ Rule \left( 2 \right):IF\, x\, is \,A_{2} AND y \,is\, B_{2} , \,THEN\,f_{2} = p_{2} x + q_{2} y + r_{2} $. Where, x and y are inputs, $ A_{i } $ and $ B_{i } $ are fuzzy sets, $ f_{i} $ are outputs within the fuzzy region specified by the fuzzy rule, and $ p_{i} , q_{i} , r_{i} $ are design parameters which are adjusted during the training process. The layer 1 is a mapping input variable to corresponding fuzzy membership $ O_{i } = \mu A_{i } (x) $. In the layer 2, $ \varPi $ is an AND operator to fuzzify the inputs i.e. $ w_{i} = \mu A_{i } \left( x \right)* \mu A_{2} \left( y \right) i = 1, 2, 3, \ldots ,N $. Layer 3 has N-nodes that indicate normalization to the firing strengths from the previous layer. In the fourth layer, the nodes $ i $ are adaptive and the output of each node is the product of the normalized firing strength with a first order polynomial (for a first order Sugeno model) $ O_{i } = w_{i } f_{i } = w_{i} \left( {px + qy + r} \right) i = 1, 2, 3, \ldots ,N $. Finally, the overall output f of the model is given by one single fixed i.e. $ f = \mathop \sum \limits_{i} \bar{w}f_{i} $ (Fig. 2).

Support Vector Machines.

The support vector machines (SVMs) were proposed by [21], to solve a nonlinear problem, an input variable that corresponds to predictor variable is non-linearly mapped into a high-dimension feature space [22], and the SVR function is formulated as $ y = \omega \phi \left( x \right) + b $. Where $ \phi $ (x) is called the feature, which is nonlinear mapped from the input space x. The coefficients w and b are estimated by minimizing:

$$ R\left( c \right) = c\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} L_{\varepsilon } \left( {d_{i} , y_{i} } \right) + \frac{1}{2}\left\| \omega \right\|^{2} $$

$$ L_{\varepsilon } \left( {d , y} \right) = \left\{ {\begin{array}{*{20}l} {\left| {d - y} \right| - \varepsilon } \hfill & {\left| {d - y} \right| \ge \varepsilon } \hfill \\ 0 \hfill & {Others} \hfill \\ \end{array} } \right. $$

Where both c and $ \varepsilon $ are prescribed parameters. The first term $ L_{\varepsilon } \left( {d , y} \right) $ is called the $ \varepsilon $ -intensive loss function. Here, $ k \left( {\tilde{x},\bar{x}} \right) $ is called the kernel function. The value of the kernel is equal to the inner product of two vectors $ \tilde{x} $ and $ \bar{x} $ in the feature space $ \phi $ ( $ \tilde{x} $ ) and $ \phi $ ( $ \bar{x} $ ), such that $ k \left( {\tilde{x},\bar{x}} \right) = \phi \left( {\tilde{x}} \right)*\phi (\bar{x}) $. Any function satisfying Mercer’s condition can be used as the kernel function.

Proposed Procedure.

Figure 3 shows the proposed procedure to estimate the PM2.5 in specific geographical location. First, we eliminate the samples with outliers, and then the database was normalized between 0 and 1. Then, the data of the selected stations were fused. Next, the relevance analysis was carried out using Rough Set- Neighbor (RS-N), Rough Set - Entropy (RS-E), Fuzzy Rough Set - Neighbor (FRS-N), and Fuzzy Rough Set – Entropy (FRS-E). These techniques are widely explained in [23]. Each of the parameters (inclusion and neighbor) of the algorithms was adjusted heuristic, taking values between 0.05 and 1 with increments of 0.05. Next, the results obtained were computed according to common relevance of the data given by the four selection methods, obtaining a reduct, which was used for training 5 regression approaches. Three system based on ANFIS were trained; each used different algorithms (Grid Partition-GP, Fuzzy C-means-FC [21] and Subtractive Clustering-SC [24, 25]) for establishing the initial FIS. Their parameters were adjusted in order to maximize accuracy. The same way, two SVR systems with two different kernels (polynomial kernel -KP and Gaussian kernel-KG) were trained, and their parameters C and gamma were adjusted for maximizing the accuracy. Finally, all approaches were validated using a 30-fold cross-validation procedure (70/30 split).

3 Results and Discussion

Figure 4 shows the results obtained of the relevance analysis carried out applying 4 selection algorithms (RS-N, FRS–N, RS-E and FRS–E) obtaining as relevant variables: wind speed, wind direction, and humidity to estimate the PM2.5.

Table 1 shows the accuracy of the different techniques on the estimation of PM2.5 concentration in each station. The notation ANFIS-SC, ANFIS-GP, ANFIS-FC, SVR-KP, SVR-KG means Adaptive neural fuzzy inference system- subtractive clustering, Adaptive neural fuzzy inference system-Grid Partition, Adaptive neural fuzzy inference system-Fuzzy C means, Support vector machines-Kernel Polynomial, Support vector machines-Kernel Gaussian, respectively. The best results were obtained with ANFIS-FC and SVR-KP 93.4% for each. However, the difference among the systems can be considered as minimal 3%. The worst results were obtain by SVR-KG 87.5%.

Table 1. PM2.5 estimation -ANFIS and SVR

Full size table

4 Conclusions

In this study, meteorological and pollutant variables obtained from three nearby measurement stations were fused, and a relevance analysis using techniques based on RS for selecting was carried out. This allowed to obtain objectives, and accurate mechanisms to estimate the PM2.5. Besides, five regression systems were compared, where the results were similar. The best results were obtained using the three ANFIS and one SVM systems, which have demonstrated the capability of the variables selected together with data fusion of nearby stations to estimate the PM2.5 in specific geographical location, which demonstrated the effectiveness of our proposed procedure. However, the adjustment of parameters of the regression systems can be optimized using metaheuristic algortithms in order to obtain major results in terms of accuracy. In addition, this study should be extended using other database from other geographical locations in order to increase its generality.

References

OMS | Calidad del aire (exterior) y salud, WHO. http://www.who.int/mediacentre/factsheets/fs313/es/. Accessed 24 Oct 2015
Dong, M., Yang, D., Kuang, Y., He, D., Erdal, S., Kenski, D.: PM2.5 concentration prediction using hidden semi-Markov model-based times series data mining. Expert Syst. Appl. 36(5), 9046–9055 (2009)
Article Google Scholar
Sun, W., Zhang, H., Palazoglu, A., Singh, A., Zhang, W., Liu, S.: Prediction of 24-hour-average PM2.5 concentrations using a hidden Markov model with different emission distributions in Northern California. Sci. Total Environ. 443, 93–103 (2013)
Article Google Scholar
Mishra, D., Goyal, P., Upadhyay, A.: Artificial intelligence based approach to forecast PM2.5 during haze episodes: a case study of Delhi, India. Atmos. Environ. 102, 239–248 (2015)
Article Google Scholar
Perez, P., Gramsch, E.: Forecasting hourly PM2.5 in Santiago de Chile with emphasis on night episodes. Atmos. Environ. Part A 124, 22–27 (2016)
Article Google Scholar
Zhou, Q., Jiang, H., Wang, J., Zhou, J.: A hybrid model for PM2.5 forecasting based on ensemble empirical mode decomposition and a general regression neural network. Sci. Total Environ. 496, 264–274 (2014)
Article Google Scholar
Antanasijević, D.Z., Pocajt, V.V., Povrenović, D.S., Ristić, M.Đ., Perić-Grujić, A.A.: PM10 emission forecasting using artificial neural networks and genetic algorithm input variable optimization. Sci. Total Environ. 443, 511–519 (2013)
Article Google Scholar
Feng, X., Li, Q., Zhu, Y., Hou, J., Jin, L., Wang, J.: Artificial neural networks forecasting of PM2.5 pollution using air mass trajectory based geographic model and wavelet transformation. Atmos. Environ. 107, 118–128 (2015)
Article Google Scholar
Kumar, A., Goyal, P.: Forecasting of daily air quality index in Delhi. Sci. Total Environ. 409(24), 5517–5523 (2011)
Article Google Scholar
Qin, S., Liu, F., Wang, J., Sun, B.: Analysis and forecasting of the particulate matter (PM) concentration levels over four major cities of China using hybrid models. Atmos. Environ. 98, 665–675 (2014)
Article Google Scholar
Velásquez, J.D., Olaya, Y., Franco, C.J.: Time series prediction using support vector machines. Ingeniare, 64–75 (2010)
Google Scholar
Popoola, O., Munda, J., Mpanda, A., Popoola, A.P.I.: Comparative analysis and assessment of ANFIS-based domestic lighting profile modelling. Energy Build. 107, 294–306 (2015)
Article Google Scholar
Klaić, Z.B., Hrust, L.: Neural network forecasting of air pollutants hourly concentrations using optimised temporal averages of meteorological variables and pollutant concentrations. Atmos. Environ. 43(35), 5588–5596 (2009)
Article Google Scholar
Gardner, M.W., Dorling, S.R.: Neural network modelling and prediction of hourly NOx and NO2 concentrations in urban air in London. Atmos. Environ. 33(5), 709–719 (1999)
Article Google Scholar
Yildirim, Y., Bayramoglu, M.: Adaptive neuro-fuzzy based modelling for prediction of air pollution daily levels in city of Zonguldak. Chemosphere 63(9), 1575–1582 (2006)
Article Google Scholar
Hoshyaripour, G., Noori, R.: Uncertainty analysis of developed ANN and ANFIS models in prediction of carbon monoxide daily concentration. Atmos. Environ. 44(4), 476–482 (2010)
Article Google Scholar
Assareh, E., Behrang, M.A.: The potential of different artificial neural network (ANN) techniques in daily global solar radiation modeling based on meteorological data. Sol. Energy 84(8), 1468–1480 (2010)
Article Google Scholar
Pai, T.-Y., Hanaki, K., Su, H.-C., Yu, L.-F.: A 24-h forecast of oxidant concentration in Tokyo using neural network and fuzzy learning approach. CLEAN – Soil Air. Water 41(8), 729–736 (2013)
Google Scholar
Polat, K.: A novel data preprocessing method to estimate the air pollution (SO2): neighbor-based feature scaling (NBFS). Neural Comput. Appl. 21(8), 1–8 (2001)
Google Scholar
Jang, J.S.R.: ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 23(3), 665–685 (1993)
Article Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer New York, New York (2000)
Book MATH Google Scholar
Deo, R.C., Wen, X., Qi, F.: A wavelet-coupled support vector machine model for forecasting global incident solar radiation using limited meteorological dataset. Appl. Energy 168, 568–593 (2016)
Article Google Scholar
Orrego, D.A., Becerra, M.A., Delgado-Trejos, E.: Dimensionality reduction based on fuzzy rough sets oriented to ischemia detection. In: 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 5282–5285 (2012)
Google Scholar
Chiu, S.L.: Fuzzy model identification based on cluster estimation. J. Intell. Fuzzy Syst. 2(3), 267–278 (1994)
Google Scholar
Lohani, A.K., Goel, N.K., Bhatia, K.K.S.: Improving real time flood forecasting using fuzzy inference system. J. Hydrol. 509, 25–41 (2014)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the research project identified with code 267 at the “Institución Universitaria Salazar y Herrera” of Medellin, Colombia, CALAIRE Laboratory of “Universidad Nacional of Colombia”, and the Area Metropolitana de Medellín, who supplied the database.

Author information

Authors and Affiliations

GEA Research Group, Institución Universitaria Salazar y Herrera, Medellín, Colombia
Miguel A. Becerra, Marcela Bedoya Sánchez & Jacobo García Carvajal
SINTELWEB Research Group, Universidad Nacional de Colombia, Medellín, Colombia
Miguel A. Becerra & Jaime A. Guzmán Luna
Facultad de Ingeniería en Ciencias Aplicadas-FICA from Universidad Técnica del Norte, Ibarra, Ecuador
Diego H. Peluffo-Ordóñez
Department of Electronics, Universidad de Nariño, Pasto, Colombia
Diego H. Peluffo-Ordóñez
Universidad de Medellín, Medellín, Colombia
Catalina Tobón

Authors

Miguel A. Becerra
View author publications
You can also search for this author in PubMed Google Scholar
Marcela Bedoya Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Jacobo García Carvajal
View author publications
You can also search for this author in PubMed Google Scholar
Jaime A. Guzmán Luna
View author publications
You can also search for this author in PubMed Google Scholar
Diego H. Peluffo-Ordóñez
View author publications
You can also search for this author in PubMed Google Scholar
Catalina Tobón
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel A. Becerra .

Editor information

Editors and Affiliations

Pontificia Universidad Católica del Perú, Lima, Peru
César Beltrán-Castañón
Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Ottawa, Ottawa, Ontario, Canada
Fazel Famili

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Becerra, M.A., Sánchez, M.B., Carvajal, J.G., Luna, J.A.G., Peluffo-Ordóñez, D.H., Tobón, C. (2017). Data Fusion from Multiple Stations for Estimation of PM2.5 in Specific Geographical Location. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2016. Lecture Notes in Computer Science(), vol 10125. Springer, Cham. https://doi.org/10.1007/978-3-319-52277-7_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-52277-7_52
Published: 16 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52276-0
Online ISBN: 978-3-319-52277-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)