Abstract
Incomplete data are quite common which can deteriorate statistical inference, often affecting evidence-based policymaking. A typical example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian Government’s national data asset. In this paper, motivated by helping BLADE practitioners select and implement advanced imputation methods with a solid understanding of the impact different methods will have on data accuracy and reliability, we implement and examine performance of data imputation techniques based on 12 machine learning algorithms. They range from linear regression to neural networks. We compare the performance of these algorithms and assess the impact of various settings, including the number of input features and the length of time spans. To examine generalisability, we also impute two features with distinct characteristics. Experimental results show that three ensemble algorithms: extra trees regressor, bagging regressor and random forest consistently maintain high imputation performance over the benchmark linear regression across a range of performance metrics. Among them, we would recommend the extra trees regressor for its accuracy and computational efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The 3 input features are Capital Expenditure, Wages and FTE/Turnover (depending on the target feature). The 7 input features include the preceding features in addition to Export Sales, Imported Goods with Deferred GST, Non-Capital Purchases and Headcount. The 14 input features include all preceding features and GST on Purchases, GST on Sales, Other GST-free sales, Amount Withheld from Salary, PAYG Tax Withheld, Amount Withheld from Salary, Amount Withheld from Payments and Amount Withheld from Investments.
References
Australian Bureau of Statistics: The Business Longitudinal Analysis Data Environment (BLADE) Standard Product, Australia, 2001–02 to 2015–16. DataLab. Findings based on use of ABS Microdata, Detailed Microdata (2019)
Bakar, K.S., Jin, H.: A real prediction of survey data using Bayesian spatial generalised linear models. In: Communications in Statistics-Simulation and Computation, pp. 1–16 (2019)
Bakhtiari, S.: Entrepreneurship dynamics in Australia: lessons from microdata. Econ. Rec. 95, 114–140 (2019)
Jin, H., Henderson, B.: Towards a daily soil moisture product based on incomplete time series observations of two satellites. In: Chan, F., Marinova, D., Anderssen, R. (eds.) MODSIM 2011, Perth, Australia, pp. 1959–1965 (2011)
Jin, H., Wong, M.L., Leung, K.S.: Scalable model-based clustering for large databases based on data summarization. IEEE Trans. Pattern Anal. Mach. Intell. 27(11), 1710–1719 (2005)
Khan, S.S., Ahmad, A., Mihailidis, A.: Bootstrapping and multiple imputation ensemble approaches for missing data. CoRR abs/1802.00154 (2018)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rubin, D.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Solow, R.: A contribution to the theory of economic growth. Quart. J. Econ. 70, 65–94 (1956)
Wood, S.N.: Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, New York (2017)
Yoon, J., Jordon, J., van der Schaar, M.: GAIN: missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920 (2018)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Crown
About this paper
Cite this paper
Suresh, M., Taib, R., Zhao, Y., Jin, W. (2019). Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-35288-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35287-5
Online ISBN: 978-3-030-35288-2
eBook Packages: Computer ScienceComputer Science (R0)