skip to main content
10.1145/2428736.2428793acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Imputation of missing values for semi-supervised data using the proximity in random forests

Published: 03 December 2012 Publication History

Abstract

This paper presents a procedure that imputes missing values by using random forests on semi-supervised data. We found that the rate of correct classification of our method is higher than that of other methods: a simple expansion of Liaw's "rfImpute" for (un)supervised data and the k-nearest neighbor method (kNN). Our method can handle missing predictor variables as well as missing response variable. An imputation that uses random forests for semi-supervised cases in the training data set has never been implemented until now.

References

[1]
R. A. Becker, J. M. Chambers and A. R. Wilks, The New S Language. Wadsworth & Brooks/Cole, 1988.
[2]
L. Breiman, Random Forests, Machine Learning, 45 (1), 5--32, 2001.
[3]
L. Breiman, Manual for Setting Up, Using, and Understanding Random Forest V4.0, http://oz.berkeley.edu/users/breiman/Using_random_forests_v4.0.pdf, 2003.
[4]
L. Breiman and A. Cutler, Random Forests, http://www.stat.berkeley.edu/~breiman/RandomForests/updated March 3, 2004.
[5]
CRAN, Package randomForest, http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
[6]
A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press, 2007.
[7]
k-Nearest Neighbour Classification, R Documentation, knn {class}, http://stat.ethz.ch/R-manual/R-patched/library/class/html/knn.html
[8]
A. Liaw, Missing Value Imputations by randomForest, R Documentation, http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/randomForest/html/rfImpute.html
[9]
C. L. Nicholas and F. O. Andrew, yaImpute: An R Package for kNN Imputation, Journal of Statistical Software, 23 (10), Jan 2008.
[10]
The R Project for Statistical Computing, http://www.r-project.org/
[11]
UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearn/MLRepository.html

Cited By

View all
  • (2024)Prediction of Soil Moisture From Near-Global Cygnss Gnss-Reflectometry Using a Random Forest Machine Learning ModelIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS53475.2024.10642723(4465-4471)Online publication date: 7-Jul-2024
  • (2022)Machine Learning-Enabled Internet of Things (IoT): Data, Applications, and Industry PerspectiveElectronics10.3390/electronics1117267611:17(2676)Online publication date: 26-Aug-2022
  • (2020)Digital Soil Mapping over Large Areas with Invalid Environmental Covariate DataISPRS International Journal of Geo-Information10.3390/ijgi90201029:2(102)Online publication date: 6-Feb-2020

Index Terms

  1. Imputation of missing values for semi-supervised data using the proximity in random forests

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
        December 2012
        432 pages
        ISBN:9781450313063
        DOI:10.1145/2428736
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        • @WAS: International Organization of Information Integration and Web-based Applications and Services

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 03 December 2012

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. k-nearest neighbor
        2. R
        3. data imputation
        4. ensemble learning
        5. impute.knn
        6. missing data
        7. rfImpute

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        IIWAS '12
        Sponsor:
        • @WAS

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)11
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 20 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Prediction of Soil Moisture From Near-Global Cygnss Gnss-Reflectometry Using a Random Forest Machine Learning ModelIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium10.1109/IGARSS53475.2024.10642723(4465-4471)Online publication date: 7-Jul-2024
        • (2022)Machine Learning-Enabled Internet of Things (IoT): Data, Applications, and Industry PerspectiveElectronics10.3390/electronics1117267611:17(2676)Online publication date: 26-Aug-2022
        • (2020)Digital Soil Mapping over Large Areas with Invalid Environmental Covariate DataISPRS International Journal of Geo-Information10.3390/ijgi90201029:2(102)Online publication date: 6-Feb-2020
        • (2014)Investigations into Missing Values Imputation Using Random Forests for Semi-supervised DataProceedings of the 16th International Conference on Information Integration and Web-based Applications & Services10.1145/2684200.2684288(296-301)Online publication date: 4-Dec-2014

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media