Mining incomplete survey data through classification

Wang, Hai; Wang, Shouhong

doi:10.1007/s10115-009-0245-8

Mining incomplete survey data through classification

Regular Paper
Published: 20 August 2009

Volume 24, pages 221–233, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hai Wang¹ &
Shouhong Wang²

570 Accesses
Explore all metrics

Abstract

Data mining with incomplete survey data is an immature subject area. Mining a database with incomplete data, the patterns of missing data as well as the potential implication of these missing data constitute valuable knowledge. This paper presents the conceptual foundations of data mining with incomplete data through classification which is relevant to a specific decision making problem. The proposed technique generally supposes that incomplete data and complete data may come from different sub-populations. The major objective of the proposed technique is to detect the interesting patterns of data missing behavior that are relevant to a specific decision making, instead of estimation of individual missing value. Using this technique, a set of complete data is used to acquire a near-optimal classifier. This classifier provides the prediction reference information for analyzing the incomplete data. The data missing behavior concealed in the missing data is then revealed. Using a real-world survey data set, the paper demonstrates the usefulness of this technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abbreviations

BPNN:: Layered back-propagation neural networks
CCR:: Correct classification rate
HMDA:: Home Mortgage Disclosure Act
LDA:: Linear discriminant analysis
MI:: Multiple imputation
MSA:: Metropolitan Statistical Area
C :: Classifier
RC :: A set of reference information
RM :: A set of classification results for SM _k
S :: A set of survey data with incomplete data
SC :: A data set with complete data
SC _Test :: Sub-set of SC for test of the classifier
SC _Train :: Sub-set of SC for training of the classifier
SM :: A set of survey data with missing values
SM _k :: Data sets with artificial imputation values for the missing values

References

Aggarwal CC, Parthasarathy S (2001) Mining massively incomplete data sets by conceptual reconstruction. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 227–232
Alyuda Research Inc. (2009) Alyuda Forecaster. Version XL (software program). http://www.alyuda.com. Retrieved January 5, 2009
Archer NP, Wang S (1993) Application of the back propagation neural network algorithm with monotonicity conditions for two-Group classification problems. Decis Sci 24(1): 60–75
Article Google Scholar
Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5/6): 519–533
Article Google Scholar
Chang X, Lilly JH (2004) Evolutionary design of a fuzzy classifier from data. IEEE Trans Syst Man Cybern B 34(4): 1894–1906
Article Google Scholar
Consumers Union (2000) Consumers Union Southwest Regional Office and Austin Tenant’s Council. Access to the dream: Subprime and prime mortgage lending in Texas—executive summary. April 2000. http://www.consumersunion.org/finance/access/access1.htm [Retrieved January 15, 2009]
Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B39(1): 1–38
MathSciNet Google Scholar
Duda R, Hart P (1973) Pattern Classification and Scene Analysis. Wiley, New York
MATH Google Scholar
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78: 316–331
Article MATH MathSciNet Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7: 179–188
Google Scholar
Green PE (1978) Analyzing multivariate data. Dryden Press, Hinsdale
Google Scholar
Hand DJ (1981) Discrimination and classification. Wiley, New York
MATH Google Scholar
Hjorth JSU (1994) Computer intensive statistical methods validation, model selection, and bootstrap. Chapman & Hall, London
MATH Google Scholar
Ishibuchi H, Nakashima T, Murata T (1999) Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Trans Syst Man Cybern B 29(5): 601–618
Article Google Scholar
Lachenbruch PA (1975) Discriminant analysis. Hafner Press, New York
MATH Google Scholar
Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental investigation. Knowl Inf Syst 10(4): 453–468
Article Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
MATH Google Scholar
Parthasarathy S, Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Comput Soc 15(6): 1512–1521
Google Scholar
Partovi FY, Anandarajan M (2002) Classifying inventory using an artificial neural network approach. Comput Ind Eng 41(4): 389–404
Article Google Scholar
Rubin DB (1978) Multiple imputations in sample survey—a phenomenological Bayesian approach to nonresponse. In: Proceedings of the survey research methods section, American Statistical Association, pp 20–34
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Book Google Scholar
Rubin DB (1996) Multiple imputation after 18+ year. J Am Stat Assoc 91(434): 473–489
Article MATH Google Scholar
Rumelhart D, McClelland J, The PDP Research Group (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol 1. Foundations. The MIT Press, Cambridge
Sinharay S, Stern HS, Russell D (2001) The use of multiple imputation for the analysis of missing data. Psychol Methods 6(4): 317–329
Article Google Scholar
SPSS (2009) SPSS for Windows, Version 15.0 (software program). http://www.spss.com. Retrieved January 15, 2009
UCI (2009) UCI machine learning repository, adult database, http://www.ics.uci.edu/~mlearn/MLRepository.html. Retrieved January 14, 2009
US Census Bureau (2009) Home Mortgage Disclosure Act. DataFerrett for TheDataWeb (software program). http://dataferrett.census.gov. Retrieved January 15, 2009
Wang JS, Lee CSG (2002) Self-adaptive neuro-fuzzy inference systems for classification applications. IEEE Trans Fuzzy Syst 10(6): 790–802
Article Google Scholar
Wang S (1995) The unpredictability of standard back propagation neural networks in classification applications. Manag Sci 41(3): 555–559
Article MATH Google Scholar
Weiss SM, Kulikowski CA (1991) Computer systems that learn. Morgan Kaufmann, New York
Google Scholar
Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithm in data mining. Knowl Inf Syst 14(1): 1–37
Article Google Scholar
Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9(3): 339–352
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Sobey School of Business, Saint Mary’s University, Halifax, NS, B3H 2W3, Canada
Hai Wang
Charlton College of Business, University of Massachusetts Dartmouth, Dartmouth, MA, 02747-2300, USA
Shouhong Wang

Authors

Hai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shouhong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shouhong Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Wang, S. Mining incomplete survey data through classification. Knowl Inf Syst 24, 221–233 (2010). https://doi.org/10.1007/s10115-009-0245-8

Download citation

Received: 04 October 2008
Revised: 14 May 2009
Accepted: 27 June 2009
Published: 20 August 2009
Issue Date: August 2010
DOI: https://doi.org/10.1007/s10115-009-0245-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining incomplete survey data through classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multiple Imputation and Ensemble Learning for Classification with Incomplete Data

Multiple-Side Multiple-Learner for Incomplete Data Classification

Data-Driven Machine Learning Approach for Predicting Missing Values in Large Data Sets: A Comparison Study

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Mining incomplete survey data through classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multiple Imputation and Ensemble Learning for Classification with Incomplete Data

Multiple-Side Multiple-Learner for Incomplete Data Classification

Data-Driven Machine Learning Approach for Predicting Missing Values in Large Data Sets: A Comparison Study

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation