Abstract
The missing readings in various sensors of air pollution monitoring stations is a common issue. Those missing sensor readings may greatly influence the performance of monitoring and analysis of air pollution data. To address this problem, in this paper, a multi-view based missing value (MV) imputation method called MVDI (Multi-View Data Imputation) is proposed for air pollution related time series data. MVDI combines four models namely LSTM (Long-Short Term Memory), IDS (Inverse Distance Squared), SVR (Support Vector Regressor), and KNN (K-Nearest Neighbors) to estimate MVs. These four models are mainly employed to capture the variations in data from different views of the dataset. Here, different views represent different portions (subsets) of the actual dataset. The estimates of MVs from all the views are combined using a kernel function to get an overall result. The proposed model MVDI is evaluated on real-world air pollution dataset in terms of RMSE, MAE, MAPE, and R2. The experimental results show that MVDI dominates over the baseline methods namely AR (AutoRegressive), ARIMA (AutoRegressive Integrated Moving Average), RFR (Random Forest Regressor), ANN (Artificial Neural Network), LI (Linear Interpolation), NN (Nearest Neighbors), MI (Mean Imputation), CNN (Convolutional Neural Network), ConvLSTM (Convolutional LSTM).











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility
The participatory sensing data is stored in a private repository and will be available on request.
References
Abd Rahman NH, Lee MH (2020) Artificial neural network forecasting performance with missing value imputations. IAES Int J Artif Intell 9(1):33
Air Pollution and Health in India (2008) https://www.ceh.org.in/wp-content/uploads/2017/10/Air-Pollution-and-Health-in-India.pdf. Accessed on 08 July 2020
Arroyo Á, Herrero Á, Tricio V, Corchado E, Woźniak M (2018) Neural models for imputation of missing ozone data in air-quality datasets. Complexity 2018
Awad YA, Koutrakis P, Coull BA, Schwartz J (2017) A spatio-temporal prediction model based on support vector machine regression: ambient black carbon in three new England states. Environ Res 159:427–434
Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20(1):40–49
Batra S, Khurana R, Khan MZ, Boulila W, Koubaa A, Srivastava P (2022) A pragmatic ensemble strategy for missing values imputation in health records. Entropy 24(4):533
Belachsen I, Broday DM (2022) Imputation of missing pm2.5 observations in a network of air quality monitoring stations by a new knn method. Atmosphere 13(11):1934. https://doi.org/10.3390/atmos13111934
Beretta L, Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak 16(3):197–208
Central pollution control board (2020) https://cpcb.nic.in/. Accessed on 07 Aug 2020
Chen X, Wang H, Wei Y, Li J, Gao H (2019) Autoregressive-model-based methods for online time series prediction with missing values: an experimental evaluation, arXiv preprint arXiv:1908.06729
Choi K, Chong K (2022) Modified inverse distance weighting interpolation for particulate matter estimation and mapping. Atmosphere 13(5):846. https://doi.org/10.3390/atmos13050846
Choong MK, Charbit M, Yan H (2009) Autoregressive-model-based missing value estimation for DNA microarray time series data. IEEE Trans Inf Technol Biomed 13(1):131–137
Das R, Middya AI, Roy S (2021) High granular and short term time series forecasting of pm2.5 air pollutant: a comparative review. Artif Intell Rev. https://doi.org/10.1007/s10462-021-09991-1
de Oliveira Santos TM, da Silva IN, Bessani M (2022) Evolving dynamic Bayesian networks by an analytical threshold for dealing with data imputation in time series dataset. Big Data Res 28:100316
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc: Ser B (Methodol) 39(1):1–22
Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Dutta J, Chowdhury C, Roy S, Middya AI, Gazi F (2017) Towards smart city: sensing air quality in city based on opportunistic crowd-sensing. In: Proceedings of the 18th international conference on distributed computing and networking. pp 1–6
Gaetan C, Yao J-F (2003) A multiple-imputation metropolis version of the EM algorithm. Biometrika 90(3):643–654
Gheyas IA, Smith LS (2010) A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing 73(16–18):3039–3065
Hadeed SJ, O’Rourke MK, Burgess JL, Harris RB, Canales RA (2020) Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ 730:139140
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Hong C, Yu J, Zhang J, Jin X, Lee K-H (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961
Hu K, Guo X, Gong X, Wang X, Liang J, Li D (2022) Air quality prediction using spatio-temporal deep learning. Atmos Pollut Res 13(10):101543. https://doi.org/10.1016/j.apr.2022.101543
Huang G (2021) Missing data filling method based on linear interpolation and lightgbm. J Phys: Conf Ser 1754(1):012187
Imputation of missing values (2020) https://scikit-learn.org/stable/modules/impute.html. Accessed 07 Aug 2020
Inverse distance squared weighted interpolation (2020) https://pro.arcgis.com/en/pro-app/help/analysis/geostatistical-analyst/how-inverse-distance-weighted-interpolation-works.htm. Accessed on 08 Aug 2020
Jamshidian M, Mata M (2007) Advances in analysis of mean and covariance structure when data are incomplete. Handbook of latent variable and related models. Elsevier, pp 21–44
Junger W, De Leon AP (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907
Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64(5):402
Ke X, Keenan K, Smith VA (2022) Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data. BMC Med Res Methodol 22(1):1–16
Koprinska I, Wu D, Wang Z (2018) Convolutional neural networks for energy time series forecasting. In: international joint conference on neural networks (IJCNN). IEEE 2018. pp 1–8
Li L, Zhang J, Wang Y, Ran B (2018) Missing value imputation for traffic-related time series data based on a multi-view learning method. IEEE Trans Intell Transp Syst 20(8):2933–2943
Lin W-C, Tsai C-F, Zhong JR (2022) Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowl-Based Syst 239:108079
Liu X, Lai X, Zhang L (2019) A hierarchical missing value imputation method by correlation-based k-nearest neighbors. In: Proceedings of SAI Intelligent Systems Conference. Springer. pp 486–496
Lstm (2020a) https://colah.github.io/posts/2015-08-Understanding-LSTMs/. Accessed on 07 Aug 2020
Lstm (2020b) https://keras.io/api/layers/recurrent_layers/lstm/. Accessed on 08 Aug 2020
Luo Y, Cai X, Zhang Y, Xu J, Yuan X (2018) Multivariate time series imputation with generative adversarial networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. pp 1603–1614
Luo Y, Zhang Y, Cai X, Yuan X (2019) E2gan: End-to-end generative adversarial network for multivariate time series imputation. AAAI Press, pp 3094–3100
Ma Q, Gu Y, Lee W-C, Yu G (2018) Order-sensitive imputation for clustered missing values. IEEE Trans Knowl Data Eng 31(1):166–180
Malarvizhi MR, Thanamani AS (2012) K-nearest neighbor in missing data imputation. Int J Eng Res Dev 5(1):5–7
Middya AI, Roy S, Dutta J, Das R (2020) JUSense: a unified framework for participatory-based urban sensing system. Mob Netw Appl 25(4):1249–1274. https://doi.org/10.1007/s11036-020-01539-x
Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in r. arXiv preprint arXiv:1510.03924
Mustafi A, Middya AI, Roy S (2022) Fuzzy-based missing value imputation technique for air pollution data. Artif Intell Rev 56(2):1–38. https://doi.org/10.1007/s10462-022-10168-7
Nassir ST, Khamees AB, Mousa WT (2018) Estimation the missing data of meteorological variables in different Iraqi cities by using Arima model. Iraqi J Sci 59:792–801
Nath P, Saha P, Middya AI, Roy S (2021) Long-term time-series pollution forecast using statistical and deep learning methods. Neural Comput Appl. https://doi.org/10.1007/s00521-021-05901-2
Nguyen TQ, Nguyen DH, Nguyen LTT (2020) Personal air quality index prediction using inverse distance weighting method. MediaEval
Niu Q, Li M, He S, Gao C, Gary Chan SH, Luo X (2019) Resource-efficient and automated image-based indoor localization. ACM Trans Sensor Netw (TOSN) 15(2):1–31
Olcese LE, Palancar GG, Toselli BM (2015) A method to estimate missing aeronet aod values based on artificial neural networks. Atmos Environ 113:140–150
Peña M, Ortega P, Orellana M (2019) A novel imputation method for missing values in air pollutant time series data. In: 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI). pp 1–6
Quinteros ME, Lu S, Blazquez C, Cárdenas-R JP, Ossa X, Delgado-Saborit J-M, Harrison RM, Ruiz-Rudolph P (2019) Use of data imputation tools to reconstruct incomplete air quality datasets: a case-study in Temuco, Chile. Atmos Environ 200:40–49
Rumaling MI, Chee FP, Dayou J, Hian Wui Chang J, Soon Kai Kong S, Sentian J (2020) Missing value imputation for pm 10 concentration in Sabah using nearest neighbour method (nnm) and expectation-maximization (em) algorithm. Asian J Atmos Environ (AJAE) 14(1):62–72
Sakul-Ung P, Ruchanawet P, Thammabunwarit N, Vatcharaphrueksadee A, Triperm C, Sodanil M (2019) Pm2. 5 prediction based weather forecast information and missingness challenges: A case study industrial and metropolis areas. In: Research, Invention, and Innovation Congress (RI2C). IEEE. pp 1–5
Shaadan N, Rahim N (2019) Imputation analysis for time series air quality (pm10) data set: a comparison of several methods. J Phys: Conf Ser 1366(1):012107
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Stekhoven DJ (2015) missforest: Nonparametric missing value imputation using random forest. ascl. pp ascl–1505
Sun S, Shawe-Taylor J (2010) Sparse semi-supervised learning using conjugate functions. J Mach Learn Res 11:2423–2455
Sun S, Zhang C (2007) The selective random subspace predictor for traffic flow forecasting. IEEE Trans Intell Transp Syst 8(2):367–373
Sun S, Zhang C, Yu G (2006) A Bayesian network approach to traffic flow forecasting. IEEE Trans Intell Transp Syst 7(1):124–132
Sv Buuren, Groothuis-Oudshoorn K (2010) mice: Multivariate imputation by chained equations in r. J Stat Softw 45:1–68
Svr (2020) https://bit.ly/3lkFUjI. Accessed on 07 Aug 2020
Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min: ASA Data Sci J 10(6):363–377
Tang X, Yao H, Sun Y, Aggarwal CC, Mitra P, Wang S (2020) Joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values. In: AAAI. pp 5956–5963
Tsokov S, Lazarova M, Aleksieva-Petrova A (2022) A hybrid spatiotemporal deep model based on CNN and LSTM for air pollution prediction. Sustainability 14(9):5104. https://doi.org/10.3390/su14095104
Tutz G, Ramzan S (2015) Improved methods for the imputation of missing data by nearest neighbor methods. Comput Stat Data Anal 90:84–99
Uusitalo L (2007) Advantages and challenges of Bayesian networks in environmental modelling. Ecol Model 203(3–4):312–318. https://doi.org/10.1016/j.ecolmodel.2006.11.033
Walter Y, Kihoro J, Athiany K, Kibunja H (2013) Imputation of incomplete non-stationary seasonal time series data. Math. Theory Model 3:142–154
Wang Y, Li K, Gan S, Cameron C (2019) Missing data imputation with ols-based autoencoder for intelligent manufacturing. IEEE Trans Ind Appl 55(6):7219–7229
Wu Z, Ma C, Shi X, Wu L, Dong Y, Stojmenovic M (2022) Imputing missing indoor air quality data with inverse mapping generative adversarial network. Build Environ 215:108896
Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems. pp 802–810
Yeon H, Seo S, Son H, Jang Y (2022) Visual analysis for panel data imputation with Bayesian network. J Supercomput 78(2):1759–1782
Yi X, Zheng Y, Zhang J, Li T (2016) St-mvl: filling missing values in geo-sensory time series data
Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Yu J, Tan M, Zhang H, Rui Y, Tao D (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell 44(2):563–578
Yuan H, Xu G, Yao Z, Jia J, Zhang Y (2018) Imputation of missing data in time series for air pollutants using long short-term memory recurrent neural networks. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers. pp 1293–1300
Zainuri NA, Jemain AA, Muda N (2015) A comparison of various imputation methods for missing values in air quality data. Sains Malays 44(3):449–456
Zheng Y, Liu F, Hsieh H-P (2013) U-air: When urban air quality inference meets big data. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp 1436–1444
Acknowledgements
The research work of Asif Iqbal Middya is funded by "NET-JRF (National Eligibility Test-Junior Research Fellowship) scheme of the University Grants Commission, Government of India". This research work is supported by the project entitled-“Development of AI/ML based predictive models for association analysis of risk factors and high granular forecasting for air pollutants”, funded by MoE-STARS, IISC.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Middya, A.I., Roy, S. Multiview data fusion technique for missing value imputation in multisensory air pollution dataset. J Ambient Intell Human Comput 15, 3173–3191 (2024). https://doi.org/10.1007/s12652-024-04816-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-024-04816-9