Abstract
Recently, the R package has become a popular tool for big data analysis due to its several matured software packages for the data analysis and visualization, including the analysis of air pollution. The air pollution problem is of increasing global concern as it has greatly impacts on the environment and human health. With the rapid development of IoT and the increase in the accuracy of geographical information collected by sensors, a huge amount of air pollution data were generated. Thus, it is difficult to analyze the air pollution data in a single machine environment effectively and reliably due to its inherent characteristic of memory design. In this work, we construct a distributed computing environment based on both the softwares of RHadoop and SparkR for performing the analysis and visualization of air pollution with the R more reliably and effectively. In the work, we firstly use the sensors, called EdiGreen AirBox to collect the air pollution data in Taichung, Taiwan. Then, we adopt the Inverse Distance Weighting method to transform the sensors’ data into the density map. Finally, the experimental results show the accuracy of the short-term prediction results of PM2.5 by using the ARIMA model. In addition, the verification with respect to the prediction accuracy with the MAPE method is also presented in the experimental results.
Similar content being viewed by others
References
Cohen AJ, Ross Anderson H, Ostro B, Pandey KD, Krzyzanowski M, Kunzli N, Gutschmidt K, Pope A, Romieu I, Samet JM, Smith K (2005) The global burden of disease due to outdoor air pollution. J Toxic Environ Health 68(13–14):1301–1307
Mehta S, Shin H, Burnett R, North T, Cohen AJ (2013) Ambient particulate air pollution and acute lower respiratory infections: a systematic review and implications for estimating the global burden of disease. Air Qual Atmos Health 6(1):69–83
Liu L, Yang X, Liu H, Wang M, Welles S, Mrquez S, Frank A, Haas CN (2016) Spatial temporal analysis of airpollution, climate change, and total mortality in 120 cities of china. Front Public Health 4:1–13
da Silva CS, Rossato JM, Rocha JAV, Vargas VM (2015) Characterization of an area of reference for inhalable particulate matter (PM2.5) associated with genetic biomonitoring in children. Mutat Res Genet Toxicol Environ Mutagen 778:44–55
Yorifuji T, Kashima S, Diez MH, Kado Y, Sanada S, Doi H (2017) Prenatal exposure to outdoor air pollution and child behavioral problems at school age in Japan. Environ Int 99:192–198
Ries L (1993) Areas of influence for IDW-interpolation with isotropic environmental data. CATENA 20(1):199–205
Liang Y, Fang L, Pan H, Zhang K, Kan H, Brook JR, Sun Q (2014) PM2.5 in Beijing temporal pattern and its association with influenza. Environ Health 13:102–109
Li X, Peng L, Hu Y, Shao J, Chi T (2016) Deep learning architecture for air quality predictions. Environ Sci Pollut Res 23:22408–22417
Eddelbuettel D (2016) CRAN task view: high-performance and parallel computing with R. https://cran.r-project.org/web/views/HighPerformanceComputing.html
Zhao Y, Cen Y (2013) Data mining applications with R. Academic Press, Cambridge
Liang M, Trejo C, Muthu L, Ngo LB, Luckow A, Apon AW (2015) Evaluating R-based big data analytic frameworks. In: 2015 IEEE International Conference on Cluster Computing, September 2015
Dousse O, Thiran P, Hasler M (2002) Connectivity in ad-hoc and hybrid networks. In: Proceedings of IEEE INFOCOM 2002, June 2002
Uskenbayeva R, Kuandykov A, Young IC, Temirboltov T, Mnzholov S, Kozhmzhrov D (2015) Integrating of data using the Hadoop and R. Proc Comput Sci 56:145–149
Stachelek J (2017) Spatial interpolation via inverse path distance weighting. https://cran.r-project.org/web/packages/ipdw/vignettes/ipdw2.html
Stachelek J (1993) Spatial interpolation via inverse path distance weighting. West Palm Beach 20:237–240
Prajapati V (2013) Big data analytics with R and Hadoop. Packt Publishing, Birmingham
Catalano M, Galatioto F, Bell M, Namdeo A, Bergantinoc AS (2016) Improving the prediction of air pollution peak episodes generated by urban transport networks. Environ Sci Policy 60:69–83
Zafra C, Ngel Y, Torres E (2017) ARIMA analysis of the effect of land surface coverage on PM10 concentrations in a high-altitude megacity. Atmos Pollut Res 8(4):660–668
Wang P, Zhang H, Qin Z, Zhang G (2017) A novel hybrid-Garch model based on ARIMA and SVM for PM2.5 concentrations forecasting. Atmos Pollut Res 8(5):850–860
Kuandykov A, Cho YI, Temirboltov T, Mnzholov S, Kozhmzhrov D (2016) Optimizing R with SparkR on a commodity cluster for biomedical research. Comput Methods Progr Biomed 137:321–328
Shivaram V, Zongheng Y, Davies L, Eric L, Hossein F, Xiangrui M, Reynold X, Ali G, Michael F, Stoica I, Matei Z (2016) SparkR: scaling R programs with spark. In: Proceedings of the 2016 International Conference on Management of Data, June–July 2016
Siknun GP, Sitanggang IS (2016) Web-based classification application for forest fire data using the shiny framework and the C5.0 algorithm. Proc Environ Sci 33:332–339
Hermawati R, Sitanggang IS (2016) Web-based clustering application using shiny framework and DBSCAN algorithm for hotspots data in peatland in Sumatra. Proc Environ Sci 33:317–323
Ries L (1993) Areas of influence for IDW-interpolation with isotropic environmental data. CATENA 20(1–2):199–205
Wagner M, Darrell K (2015) Tutorial L exploring discrete database networks of triCare health data using R and shiny. Pract Predict Anal Decis Syst Med 30:635–658
Acknowledgements
This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 105-2634-E-029-001 and MOST 106-2621-M-029-001.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, CT., Chan, YW., Liu, JC. et al. An implementation of cloud-based platform with R packages for spatiotemporal analysis of air pollution. J Supercomput 76, 1416–1437 (2020). https://doi.org/10.1007/s11227-017-2189-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2189-1