Skip to main content
Log in

Multiview data fusion technique for missing value imputation in multisensory air pollution dataset

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

The missing readings in various sensors of air pollution monitoring stations is a common issue. Those missing sensor readings may greatly influence the performance of monitoring and analysis of air pollution data. To address this problem, in this paper, a multi-view based missing value (MV) imputation method called MVDI (Multi-View Data Imputation) is proposed for air pollution related time series data. MVDI combines four models namely LSTM (Long-Short Term Memory), IDS (Inverse Distance Squared), SVR (Support Vector Regressor), and KNN (K-Nearest Neighbors) to estimate MVs. These four models are mainly employed to capture the variations in data from different views of the dataset. Here, different views represent different portions (subsets) of the actual dataset. The estimates of MVs from all the views are combined using a kernel function to get an overall result. The proposed model MVDI is evaluated on real-world air pollution dataset in terms of RMSE, MAE, MAPE, and R2. The experimental results show that MVDI dominates over the baseline methods namely AR (AutoRegressive), ARIMA (AutoRegressive Integrated Moving Average), RFR (Random Forest Regressor), ANN (Artificial Neural Network), LI (Linear Interpolation), NN (Nearest Neighbors), MI (Mean Imputation), CNN (Convolutional Neural Network), ConvLSTM (Convolutional LSTM).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availibility

The participatory sensing data is stored in a private repository and will be available on request.

References

  • Abd Rahman NH, Lee MH (2020) Artificial neural network forecasting performance with missing value imputations. IAES Int J Artif Intell 9(1):33

    Google Scholar 

  • Air Pollution and Health in India (2008) https://www.ceh.org.in/wp-content/uploads/2017/10/Air-Pollution-and-Health-in-India.pdf. Accessed on 08 July 2020

  • Arroyo Á, Herrero Á, Tricio V, Corchado E, Woźniak M (2018) Neural models for imputation of missing ozone data in air-quality datasets. Complexity 2018

  • Awad YA, Koutrakis P, Coull BA, Schwartz J (2017) A spatio-temporal prediction model based on support vector machine regression: ambient black carbon in three new England states. Environ Res 159:427–434

    Google Scholar 

  • Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res 20(1):40–49

    Google Scholar 

  • Batra S, Khurana R, Khan MZ, Boulila W, Koubaa A, Srivastava P (2022) A pragmatic ensemble strategy for missing values imputation in health records. Entropy 24(4):533

    Google Scholar 

  • Belachsen I, Broday DM (2022) Imputation of missing pm2.5 observations in a network of air quality monitoring stations by a new knn method. Atmosphere 13(11):1934. https://doi.org/10.3390/atmos13111934

    Article  Google Scholar 

  • Beretta L, Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak 16(3):197–208

    Google Scholar 

  • Central pollution control board (2020) https://cpcb.nic.in/. Accessed on 07 Aug 2020

  • Chen X, Wang H, Wei Y, Li J, Gao H (2019) Autoregressive-model-based methods for online time series prediction with missing values: an experimental evaluation, arXiv preprint arXiv:1908.06729

  • Choi K, Chong K (2022) Modified inverse distance weighting interpolation for particulate matter estimation and mapping. Atmosphere 13(5):846. https://doi.org/10.3390/atmos13050846

    Article  Google Scholar 

  • Choong MK, Charbit M, Yan H (2009) Autoregressive-model-based missing value estimation for DNA microarray time series data. IEEE Trans Inf Technol Biomed 13(1):131–137

    Google Scholar 

  • Das R, Middya AI, Roy S (2021) High granular and short term time series forecasting of pm2.5 air pollutant: a comparative review. Artif Intell Rev. https://doi.org/10.1007/s10462-021-09991-1

    Article  Google Scholar 

  • de Oliveira Santos TM, da Silva IN, Bessani M (2022) Evolving dynamic Bayesian networks by an analytical threshold for dealing with data imputation in time series dataset. Big Data Res 28:100316

    Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc: Ser B (Methodol) 39(1):1–22

    MathSciNet  Google Scholar 

  • Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091

    Google Scholar 

  • Dutta J, Chowdhury C, Roy S, Middya AI, Gazi F (2017) Towards smart city: sensing air quality in city based on opportunistic crowd-sensing. In: Proceedings of the 18th international conference on distributed computing and networking. pp 1–6

  • Gaetan C, Yao J-F (2003) A multiple-imputation metropolis version of the EM algorithm. Biometrika 90(3):643–654

    MathSciNet  Google Scholar 

  • Gheyas IA, Smith LS (2010) A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing 73(16–18):3039–3065

    Google Scholar 

  • Hadeed SJ, O’Rourke MK, Burgess JL, Harris RB, Canales RA (2020) Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ 730:139140

    Google Scholar 

  • Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670

    MathSciNet  Google Scholar 

  • Hong C, Yu J, Zhang J, Jin X, Lee K-H (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961

    Google Scholar 

  • Hu K, Guo X, Gong X, Wang X, Liang J, Li D (2022) Air quality prediction using spatio-temporal deep learning. Atmos Pollut Res 13(10):101543. https://doi.org/10.1016/j.apr.2022.101543

    Article  Google Scholar 

  • Huang G (2021) Missing data filling method based on linear interpolation and lightgbm. J Phys: Conf Ser 1754(1):012187

    Google Scholar 

  • Imputation of missing values (2020) https://scikit-learn.org/stable/modules/impute.html. Accessed 07 Aug 2020

  • Inverse distance squared weighted interpolation (2020) https://pro.arcgis.com/en/pro-app/help/analysis/geostatistical-analyst/how-inverse-distance-weighted-interpolation-works.htm. Accessed on 08 Aug 2020

  • Jamshidian M, Mata M (2007) Advances in analysis of mean and covariance structure when data are incomplete. Handbook of latent variable and related models. Elsevier, pp 21–44

    Google Scholar 

  • Junger W, De Leon AP (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104

    Google Scholar 

  • Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907

    Google Scholar 

  • Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64(5):402

    Google Scholar 

  • Ke X, Keenan K, Smith VA (2022) Treatment of missing data in Bayesian network structure learning: an application to linked biomedical and social survey data. BMC Med Res Methodol 22(1):1–16

    Google Scholar 

  • Koprinska I, Wu D, Wang Z (2018) Convolutional neural networks for energy time series forecasting. In: international joint conference on neural networks (IJCNN). IEEE 2018. pp 1–8

  • Li L, Zhang J, Wang Y, Ran B (2018) Missing value imputation for traffic-related time series data based on a multi-view learning method. IEEE Trans Intell Transp Syst 20(8):2933–2943

    Google Scholar 

  • Lin W-C, Tsai C-F, Zhong JR (2022) Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowl-Based Syst 239:108079

    Google Scholar 

  • Liu X, Lai X, Zhang L (2019) A hierarchical missing value imputation method by correlation-based k-nearest neighbors. In: Proceedings of SAI Intelligent Systems Conference. Springer. pp 486–496

  • Lstm (2020a) https://colah.github.io/posts/2015-08-Understanding-LSTMs/. Accessed on 07 Aug 2020

  • Lstm (2020b) https://keras.io/api/layers/recurrent_layers/lstm/. Accessed on 08 Aug 2020

  • Luo Y, Cai X, Zhang Y, Xu J, Yuan X (2018) Multivariate time series imputation with generative adversarial networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. pp 1603–1614

  • Luo Y, Zhang Y, Cai X, Yuan X (2019) E2gan: End-to-end generative adversarial network for multivariate time series imputation. AAAI Press, pp 3094–3100

    Google Scholar 

  • Ma Q, Gu Y, Lee W-C, Yu G (2018) Order-sensitive imputation for clustered missing values. IEEE Trans Knowl Data Eng 31(1):166–180

    Google Scholar 

  • Malarvizhi MR, Thanamani AS (2012) K-nearest neighbor in missing data imputation. Int J Eng Res Dev 5(1):5–7

    Google Scholar 

  • Middya AI, Roy S, Dutta J, Das R (2020) JUSense: a unified framework for participatory-based urban sensing system. Mob Netw Appl 25(4):1249–1274. https://doi.org/10.1007/s11036-020-01539-x

    Article  Google Scholar 

  • Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in r. arXiv preprint arXiv:1510.03924

  • Mustafi A, Middya AI, Roy S (2022) Fuzzy-based missing value imputation technique for air pollution data. Artif Intell Rev 56(2):1–38. https://doi.org/10.1007/s10462-022-10168-7

    Article  Google Scholar 

  • Nassir ST, Khamees AB, Mousa WT (2018) Estimation the missing data of meteorological variables in different Iraqi cities by using Arima model. Iraqi J Sci 59:792–801

    Google Scholar 

  • Nath P, Saha P, Middya AI, Roy S (2021) Long-term time-series pollution forecast using statistical and deep learning methods. Neural Comput Appl. https://doi.org/10.1007/s00521-021-05901-2

    Article  Google Scholar 

  • Nguyen TQ, Nguyen DH, Nguyen LTT (2020) Personal air quality index prediction using inverse distance weighting method. MediaEval

  • Niu Q, Li M, He S, Gao C, Gary Chan SH, Luo X (2019) Resource-efficient and automated image-based indoor localization. ACM Trans Sensor Netw (TOSN) 15(2):1–31

    Google Scholar 

  • Olcese LE, Palancar GG, Toselli BM (2015) A method to estimate missing aeronet aod values based on artificial neural networks. Atmos Environ 113:140–150

    Google Scholar 

  • Peña M, Ortega P, Orellana M (2019) A novel imputation method for missing values in air pollutant time series data. In: 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI). pp 1–6

  • Quinteros ME, Lu S, Blazquez C, Cárdenas-R JP, Ossa X, Delgado-Saborit J-M, Harrison RM, Ruiz-Rudolph P (2019) Use of data imputation tools to reconstruct incomplete air quality datasets: a case-study in Temuco, Chile. Atmos Environ 200:40–49

    Google Scholar 

  • Rumaling MI, Chee FP, Dayou J, Hian Wui Chang J, Soon Kai Kong S, Sentian J (2020) Missing value imputation for pm 10 concentration in Sabah using nearest neighbour method (nnm) and expectation-maximization (em) algorithm. Asian J Atmos Environ (AJAE) 14(1):62–72

    Google Scholar 

  • Sakul-Ung P, Ruchanawet P, Thammabunwarit N, Vatcharaphrueksadee A, Triperm C, Sodanil M (2019) Pm2. 5 prediction based weather forecast information and missingness challenges: A case study industrial and metropolis areas. In: Research, Invention, and Innovation Congress (RI2C). IEEE. pp 1–5

  • Shaadan N, Rahim N (2019) Imputation analysis for time series air quality (pm10) data set: a comparison of several methods. J Phys: Conf Ser 1366(1):012107

    Google Scholar 

  • Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    MathSciNet  Google Scholar 

  • Stekhoven DJ (2015) missforest: Nonparametric missing value imputation using random forest. ascl. pp ascl–1505

  • Sun S, Shawe-Taylor J (2010) Sparse semi-supervised learning using conjugate functions. J Mach Learn Res 11:2423–2455

    MathSciNet  Google Scholar 

  • Sun S, Zhang C (2007) The selective random subspace predictor for traffic flow forecasting. IEEE Trans Intell Transp Syst 8(2):367–373

    Google Scholar 

  • Sun S, Zhang C, Yu G (2006) A Bayesian network approach to traffic flow forecasting. IEEE Trans Intell Transp Syst 7(1):124–132

    Google Scholar 

  • Sv Buuren, Groothuis-Oudshoorn K (2010) mice: Multivariate imputation by chained equations in r. J Stat Softw 45:1–68

    Google Scholar 

  • Svr (2020) https://bit.ly/3lkFUjI. Accessed on 07 Aug 2020

  • Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min: ASA Data Sci J 10(6):363–377

    MathSciNet  Google Scholar 

  • Tang X, Yao H, Sun Y, Aggarwal CC, Mitra P, Wang S (2020) Joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values. In: AAAI. pp 5956–5963

  • Tsokov S, Lazarova M, Aleksieva-Petrova A (2022) A hybrid spatiotemporal deep model based on CNN and LSTM for air pollution prediction. Sustainability 14(9):5104. https://doi.org/10.3390/su14095104

    Article  Google Scholar 

  • Tutz G, Ramzan S (2015) Improved methods for the imputation of missing data by nearest neighbor methods. Comput Stat Data Anal 90:84–99

    MathSciNet  Google Scholar 

  • Uusitalo L (2007) Advantages and challenges of Bayesian networks in environmental modelling. Ecol Model 203(3–4):312–318. https://doi.org/10.1016/j.ecolmodel.2006.11.033

    Article  Google Scholar 

  • Walter Y, Kihoro J, Athiany K, Kibunja H (2013) Imputation of incomplete non-stationary seasonal time series data. Math. Theory Model 3:142–154

    Google Scholar 

  • Wang Y, Li K, Gan S, Cameron C (2019) Missing data imputation with ols-based autoencoder for intelligent manufacturing. IEEE Trans Ind Appl 55(6):7219–7229

    Google Scholar 

  • Wu Z, Ma C, Shi X, Wu L, Dong Y, Stojmenovic M (2022) Imputing missing indoor air quality data with inverse mapping generative adversarial network. Build Environ 215:108896

    Google Scholar 

  • Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems. pp 802–810

  • Yeon H, Seo S, Son H, Jang Y (2022) Visual analysis for panel data imputation with Bayesian network. J Supercomput 78(2):1759–1782

    Google Scholar 

  • Yi X, Zheng Y, Zhang J, Li T (2016) St-mvl: filling missing values in geo-sensory time series data

  • Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779

    Google Scholar 

  • Yu J, Tan M, Zhang H, Rui Y, Tao D (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell 44(2):563–578

    Google Scholar 

  • Yuan H, Xu G, Yao Z, Jia J, Zhang Y (2018) Imputation of missing data in time series for air pollutants using long short-term memory recurrent neural networks. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers. pp 1293–1300

  • Zainuri NA, Jemain AA, Muda N (2015) A comparison of various imputation methods for missing values in air quality data. Sains Malays 44(3):449–456

    Google Scholar 

  • Zheng Y, Liu F, Hsieh H-P (2013) U-air: When urban air quality inference meets big data. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp 1436–1444

Download references

Acknowledgements

The research work of Asif Iqbal Middya is funded by "NET-JRF (National Eligibility Test-Junior Research Fellowship) scheme of the University Grants Commission, Government of India". This research work is supported by the project entitled-“Development of AI/ML based predictive models for association analysis of risk factors and high granular forecasting for air pollutants”, funded by MoE-STARS, IISC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarbani Roy.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Middya, A.I., Roy, S. Multiview data fusion technique for missing value imputation in multisensory air pollution dataset. J Ambient Intell Human Comput 15, 3173–3191 (2024). https://doi.org/10.1007/s12652-024-04816-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-024-04816-9

Keywords