Data mining with incomplete survey data is an immature subject area. Mining a database with incomplete data, the patterns of missing data as well as the potential implication of these missing data constitute valuable knowledge. This paper presents the conceptual foundations of data mining with incomplete data through classification which is relevant to a specific decision making problem. The proposed technique generally supposes that incomplete data and complete data may come from different sub-populations. The major objective of the proposed technique is to detect the interesting patterns of data missing behavior that are relevant to a specific decision making, instead of estimation of individual missing value. Using this technique, a set of complete data is used to acquire a near-optimal classifier. This classifier provides the prediction reference information for analyzing the incomplete data. The data missing behavior concealed in the missing data is then revealed. Using a real-world survey data set, the paper demonstrates the usefulness of this technique.
- CCR:
Correct classification rate
Home Mortgage Disclosure Act
- LDA:
Linear discriminant analysis
- MI:
Multiple imputation
- MSA:
Metropolitan Statistical Area
- C :
- RC :
A set of reference information
- RM :
A set of classification results for SM k
- S :
A set of survey data with incomplete data
- SC :
A data set with complete data
- SC Test :
Sub-set of SC for test of the classifier
- SC Train :
Sub-set of SC for training of the classifier
- SM :
A set of survey data with missing values
- SM k :
Data sets with artificial imputation values for the missing values
Wang, H., Wang, S. Mining incomplete survey data through classification. Knowl Inf Syst 24, 221–233 (2010). https://doi.org/10.1007/s10115-009-0245-8
DOI: https://doi.org/10.1007/s10115-009-0245-8