Abstract
Data mining with incomplete survey data is an immature subject area. Mining a database with incomplete data, the patterns of missing data as well as the potential implication of these missing data constitute valuable knowledge. This paper presents the conceptual foundations of data mining with incomplete data through classification which is relevant to a specific decision making problem. The proposed technique generally supposes that incomplete data and complete data may come from different sub-populations. The major objective of the proposed technique is to detect the interesting patterns of data missing behavior that are relevant to a specific decision making, instead of estimation of individual missing value. Using this technique, a set of complete data is used to acquire a near-optimal classifier. This classifier provides the prediction reference information for analyzing the incomplete data. The data missing behavior concealed in the missing data is then revealed. Using a real-world survey data set, the paper demonstrates the usefulness of this technique.
Similar content being viewed by others
Abbreviations
- BPNN:
-
Layered back-propagation neural networks
- CCR:
-
Correct classification rate
- HMDA:
-
Home Mortgage Disclosure Act
- LDA:
-
Linear discriminant analysis
- MI:
-
Multiple imputation
- MSA:
-
Metropolitan Statistical Area
- C :
-
Classifier
- RC :
-
A set of reference information
- RM :
-
A set of classification results for SM k
- S :
-
A set of survey data with incomplete data
- SC :
-
A data set with complete data
- SC Test :
-
Sub-set of SC for test of the classifier
- SC Train :
-
Sub-set of SC for training of the classifier
- SM :
-
A set of survey data with missing values
- SM k :
-
Data sets with artificial imputation values for the missing values
References
Aggarwal CC, Parthasarathy S (2001) Mining massively incomplete data sets by conceptual reconstruction. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 227–232
Alyuda Research Inc. (2009) Alyuda Forecaster. Version XL (software program). http://www.alyuda.com. Retrieved January 5, 2009
Archer NP, Wang S (1993) Application of the back propagation neural network algorithm with monotonicity conditions for two-Group classification problems. Decis Sci 24(1): 60–75
Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5/6): 519–533
Chang X, Lilly JH (2004) Evolutionary design of a fuzzy classifier from data. IEEE Trans Syst Man Cybern B 34(4): 1894–1906
Consumers Union (2000) Consumers Union Southwest Regional Office and Austin Tenant’s Council. Access to the dream: Subprime and prime mortgage lending in Texas—executive summary. April 2000. http://www.consumersunion.org/finance/access/access1.htm [Retrieved January 15, 2009]
Dempster AP, Laird NM, Rubin DB (1997) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B39(1): 1–38
Duda R, Hart P (1973) Pattern Classification and Scene Analysis. Wiley, New York
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78: 316–331
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7: 179–188
Green PE (1978) Analyzing multivariate data. Dryden Press, Hinsdale
Hand DJ (1981) Discrimination and classification. Wiley, New York
Hjorth JSU (1994) Computer intensive statistical methods validation, model selection, and bootstrap. Chapman & Hall, London
Ishibuchi H, Nakashima T, Murata T (1999) Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Trans Syst Man Cybern B 29(5): 601–618
Lachenbruch PA (1975) Discriminant analysis. Hafner Press, New York
Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental investigation. Knowl Inf Syst 10(4): 453–468
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Parthasarathy S, Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Comput Soc 15(6): 1512–1521
Partovi FY, Anandarajan M (2002) Classifying inventory using an artificial neural network approach. Comput Ind Eng 41(4): 389–404
Rubin DB (1978) Multiple imputations in sample survey—a phenomenological Bayesian approach to nonresponse. In: Proceedings of the survey research methods section, American Statistical Association, pp 20–34
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Rubin DB (1996) Multiple imputation after 18+ year. J Am Stat Assoc 91(434): 473–489
Rumelhart D, McClelland J, The PDP Research Group (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol 1. Foundations. The MIT Press, Cambridge
Sinharay S, Stern HS, Russell D (2001) The use of multiple imputation for the analysis of missing data. Psychol Methods 6(4): 317–329
SPSS (2009) SPSS for Windows, Version 15.0 (software program). http://www.spss.com. Retrieved January 15, 2009
UCI (2009) UCI machine learning repository, adult database, http://www.ics.uci.edu/~mlearn/MLRepository.html. Retrieved January 14, 2009
US Census Bureau (2009) Home Mortgage Disclosure Act. DataFerrett for TheDataWeb (software program). http://dataferrett.census.gov. Retrieved January 15, 2009
Wang JS, Lee CSG (2002) Self-adaptive neuro-fuzzy inference systems for classification applications. IEEE Trans Fuzzy Syst 10(6): 790–802
Wang S (1995) The unpredictability of standard back propagation neural networks in classification applications. Manag Sci 41(3): 555–559
Weiss SM, Kulikowski CA (1991) Computer systems that learn. Morgan Kaufmann, New York
Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithm in data mining. Knowl Inf Syst 14(1): 1–37
Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9(3): 339–352
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, H., Wang, S. Mining incomplete survey data through classification. Knowl Inf Syst 24, 221–233 (2010). https://doi.org/10.1007/s10115-009-0245-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0245-8