ABSTRACT
In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely. The very limited previous studies on cleaning disguised missing data highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers.
Recently, we have studied the problem of cleaning disguised missing data systematically, and proposed an effective heuristic approach [2]. In this paper, we describe a demonstration of DiMaC, a <u>Di</u>sguised <u>M</u>issing D<u>a</u>ta <u>C</u>leaning system which can find the frequently used disguise values in data sets without requiring any domain background knowledge. In this demo, we will show (1) the critical techniques of finding suspicious disguise values; (2) the architecture and user interface of DiMaC system; (3) an empirical case study on both real and synthetic data sets, which verifies the effectiveness and the efficiency of the techniques; (4) some challenges arising from real applications and several direction for future work.
- D. DesJardins. Outliers, inliers, and just plain liars -new graphical EDA+ (EDA Plus) techniques for understanding data. In Proc. SAS User's Group International Conference (SUGI26), Long Beach, CA, 2001.Google Scholar
- M. Hua and J. Pei. Cleaning disguised missing data: a heuristic approach. In KDD, pages 950--958, 2007. Google ScholarDigital Library
- B. Kégl and L. Wang. Boosting on manifolds: Adaptive regularization of base classifiers. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 665--672, Cambridge, MA, 2005. MIT Press.Google Scholar
- S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191--201, 1995. Google ScholarDigital Library
- R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 1987. Google ScholarDigital Library
- R. Pearson. Mining imperfect data: Dealing with contamination and incomplete records. In Proc. 2005 SIAM Int. Conf. Data Mining, New Port Beach, CA, April 2005.Google ScholarCross Ref
- R. K. Pearson. Data mining in the face of contaminated and incomplete records. In Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, April 2002.Google Scholar
- R. K. Pearson. The problem of distuised missing data. ACM SIGKDD Explorations, 8:(1) 83--92, 2006. Google ScholarDigital Library
- G. Webb. Further experimental evidence against the utility of occam's razor. The Journal of Artificial Intelligence Research, 4:397--417, 1996. Google ScholarDigital Library
Index Terms
- DiMaC: a system for cleaning disguised missing data
Recommendations
Cleaning disguised missing data: a heuristic approach
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningIn some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, ...
DiMaC: a disguised missing data cleaning tool
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningIn some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, ...
Missing Data Imputation Techniques
Intelligent data analysis techniques are useful for better exploring real-world data sets. However, the real-world data sets always are accompanied by missing data that is one major factor affecting data quality. At the same time, good intelligent data ...
Comments