Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning

Suresh, Marcus; Taib, Ronnie; Zhao, Yanchang; Jin, Warren

doi:10.1007/978-3-030-35288-2_18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11919))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

2320 Accesses
3 Citations
7 Altmetric

Abstract

Incomplete data are quite common which can deteriorate statistical inference, often affecting evidence-based policymaking. A typical example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian Government’s national data asset. In this paper, motivated by helping BLADE practitioners select and implement advanced imputation methods with a solid understanding of the impact different methods will have on data accuracy and reliability, we implement and examine performance of data imputation techniques based on 12 machine learning algorithms. They range from linear regression to neural networks. We compare the performance of these algorithms and assess the impact of various settings, including the number of input features and the length of time spans. To examine generalisability, we also impute two features with distinct characteristics. Experimental results show that three ensemble algorithms: extra trees regressor, bagging regressor and random forest consistently maintain high imputation performance over the benchmark linear regression across a range of performance metrics. Among them, we would recommend the extra trees regressor for its accuracy and computational efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The 3 input features are Capital Expenditure, Wages and FTE/Turnover (depending on the target feature). The 7 input features include the preceding features in addition to Export Sales, Imported Goods with Deferred GST, Non-Capital Purchases and Headcount. The 14 input features include all preceding features and GST on Purchases, GST on Sales, Other GST-free sales, Amount Withheld from Salary, PAYG Tax Withheld, Amount Withheld from Salary, Amount Withheld from Payments and Amount Withheld from Investments.

References

Australian Bureau of Statistics: The Business Longitudinal Analysis Data Environment (BLADE) Standard Product, Australia, 2001–02 to 2015–16. DataLab. Findings based on use of ABS Microdata, Detailed Microdata (2019)
Google Scholar
Bakar, K.S., Jin, H.: A real prediction of survey data using Bayesian spatial generalised linear models. In: Communications in Statistics-Simulation and Computation, pp. 1–16 (2019)
Google Scholar
Bakhtiari, S.: Entrepreneurship dynamics in Australia: lessons from microdata. Econ. Rec. 95, 114–140 (2019)
Article Google Scholar
Jin, H., Henderson, B.: Towards a daily soil moisture product based on incomplete time series observations of two satellites. In: Chan, F., Marinova, D., Anderssen, R. (eds.) MODSIM 2011, Perth, Australia, pp. 1959–1965 (2011)
Google Scholar
Jin, H., Wong, M.L., Leung, K.S.: Scalable model-based clustering for large databases based on data summarization. IEEE Trans. Pattern Anal. Mach. Intell. 27(11), 1710–1719 (2005)
Article Google Scholar
Khan, S.S., Ahmad, A., Mihailidis, A.: Bootstrapping and multiple imputation ensemble approaches for missing data. CoRR abs/1802.00154 (2018)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)
MATH Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rubin, D.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet Google Scholar
Solow, R.: A contribution to the theory of economic growth. Quart. J. Econ. 70, 65–94 (1956)
Article Google Scholar
Wood, S.N.: Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, New York (2017)
Book Google Scholar
Yoon, J., Jordon, J., van der Schaar, M.: GAIN: missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920 (2018)

Download references

Author information

Authors and Affiliations

Analytical Insights Division, Department of Industry Innovation and Science, Sydney/Canberra, Australia
Marcus Suresh
CSIRO - Data61, Sydney/Canberra, Australia
Marcus Suresh, Ronnie Taib, Yanchang Zhao & Warren Jin

Authors

Marcus Suresh
View author publications
You can also search for this author in PubMed Google Scholar
Ronnie Taib
View author publications
You can also search for this author in PubMed Google Scholar
Yanchang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Warren Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Marcus Suresh , Ronnie Taib , Yanchang Zhao or Warren Jin .

Editor information

Editors and Affiliations

University of South Australia, Adelaide, SA, Australia
Jixue Liu
The University of Melbourne, Melbourne, VIC, Australia
James Bailey

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suresh, M., Taib, R., Zhao, Y., Jin, W. (2019). Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-35288-2_18
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35287-5
Online ISBN: 978-3-030-35288-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics