Skip to main content

Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning

  • Conference paper
  • First Online:
AI 2019: Advances in Artificial Intelligence (AI 2019)

Abstract

Incomplete data are quite common which can deteriorate statistical inference, often affecting evidence-based policymaking. A typical example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian Government’s national data asset. In this paper, motivated by helping BLADE practitioners select and implement advanced imputation methods with a solid understanding of the impact different methods will have on data accuracy and reliability, we implement and examine performance of data imputation techniques based on 12 machine learning algorithms. They range from linear regression to neural networks. We compare the performance of these algorithms and assess the impact of various settings, including the number of input features and the length of time spans. To examine generalisability, we also impute two features with distinct characteristics. Experimental results show that three ensemble algorithms: extra trees regressor, bagging regressor and random forest consistently maintain high imputation performance over the benchmark linear regression across a range of performance metrics. Among them, we would recommend the extra trees regressor for its accuracy and computational efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The 3 input features are Capital Expenditure, Wages and FTE/Turnover (depending on the target feature). The 7 input features include the preceding features in addition to Export Sales, Imported Goods with Deferred GST, Non-Capital Purchases and Headcount. The 14 input features include all preceding features and GST on Purchases, GST on Sales, Other GST-free sales, Amount Withheld from Salary, PAYG Tax Withheld, Amount Withheld from Salary, Amount Withheld from Payments and Amount Withheld from Investments.

References

  1. Australian Bureau of Statistics: The Business Longitudinal Analysis Data Environment (BLADE) Standard Product, Australia, 2001–02 to 2015–16. DataLab. Findings based on use of ABS Microdata, Detailed Microdata (2019)

    Google Scholar 

  2. Bakar, K.S., Jin, H.: A real prediction of survey data using Bayesian spatial generalised linear models. In: Communications in Statistics-Simulation and Computation, pp. 1–16 (2019)

    Google Scholar 

  3. Bakhtiari, S.: Entrepreneurship dynamics in Australia: lessons from microdata. Econ. Rec. 95, 114–140 (2019)

    Article  Google Scholar 

  4. Jin, H., Henderson, B.: Towards a daily soil moisture product based on incomplete time series observations of two satellites. In: Chan, F., Marinova, D., Anderssen, R. (eds.) MODSIM 2011, Perth, Australia, pp. 1959–1965 (2011)

    Google Scholar 

  5. Jin, H., Wong, M.L., Leung, K.S.: Scalable model-based clustering for large databases based on data summarization. IEEE Trans. Pattern Anal. Mach. Intell. 27(11), 1710–1719 (2005)

    Article  Google Scholar 

  6. Khan, S.S., Ahmad, A., Mihailidis, A.: Bootstrapping and multiple imputation ensemble approaches for missing data. CoRR abs/1802.00154 (2018)

    Google Scholar 

  7. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)

    MATH  Google Scholar 

  8. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  9. Rubin, D.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  Google Scholar 

  10. Solow, R.: A contribution to the theory of economic growth. Quart. J. Econ. 70, 65–94 (1956)

    Article  Google Scholar 

  11. Wood, S.N.: Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, New York (2017)

    Book  Google Scholar 

  12. Yoon, J., Jordon, J., van der Schaar, M.: GAIN: missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Marcus Suresh , Ronnie Taib , Yanchang Zhao or Warren Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Crown

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Suresh, M., Taib, R., Zhao, Y., Jin, W. (2019). Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-35288-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-35287-5

  • Online ISBN: 978-3-030-35288-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics