demonstration

DiMaC: a system for cleaning disguised missing data

Authors:
Ming Hua

Simon Fraser University, Burnaby, BC, Canada

Simon Fraser University, Burnaby, BC, Canada
View Profile

,
Jian Pei

Simon Fraser University, Burnaby, BC, Canada

Simon Fraser University, Burnaby, BC, Canada
View Profile

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataJune 2008Pages 1263–1266https://doi.org/10.1145/1376616.1376751

Published:09 June 2008Publication History

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Pages 1263–1266

ABSTRACT

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely. The very limited previous studies on cleaning disguised missing data highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers.

Recently, we have studied the problem of cleaning disguised missing data systematically, and proposed an effective heuristic approach [2]. In this paper, we describe a demonstration of DiMaC, a Disguised Missing Data Cleaning system which can find the frequently used disguise values in data sets without requiring any domain background knowledge. In this demo, we will show (1) the critical techniques of finding suspicious disguise values; (2) the architecture and user interface of DiMaC system; (3) an empirical case study on both real and synthetic data sets, which verifies the effectiveness and the efficiency of the techniques; (4) some challenges arising from real applications and several direction for future work.

References

D. DesJardins. Outliers, inliers, and just plain liars -new graphical EDA+ (EDA Plus) techniques for understanding data. In Proc. SAS User's Group International Conference (SUGI26), Long Beach, CA, 2001.Google Scholar
M. Hua and J. Pei. Cleaning disguised missing data: a heuristic approach. In KDD, pages 950--958, 2007. Google ScholarDigital Library
B. Kégl and L. Wang. Boosting on manifolds: Adaptive regularization of base classifiers. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 665--672, Cambridge, MA, 2005. MIT Press.Google Scholar
S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191--201, 1995. Google ScholarDigital Library
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 1987. Google ScholarDigital Library
R. Pearson. Mining imperfect data: Dealing with contamination and incomplete records. In Proc. 2005 SIAM Int. Conf. Data Mining, New Port Beach, CA, April 2005.Google ScholarCross Ref
R. K. Pearson. Data mining in the face of contaminated and incomplete records. In Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, April 2002.Google Scholar
R. K. Pearson. The problem of distuised missing data. ACM SIGKDD Explorations, 8:(1) 83--92, 2006. Google ScholarDigital Library
G. Webb. Further experimental evidence against the utility of occam's razor. The Journal of Artificial Intelligence Research, 4:397--417, 1996. Google ScholarDigital Library

Index Terms

DiMaC: a system for cleaning disguised missing data
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Cleaning disguised missing data: a heuristic approach
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, ...
Read More
DiMaC: a disguised missing data cleaning tool
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, ...
Read More
Missing Data Imputation Techniques

Intelligent data analysis techniques are useful for better exploring real-world data sets. However, the real-world data sets always are accompanied by missing data that is one major factor affecting data quality. At the same time, good intelligent data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
June 2008
1396 pages
ISBN:9781605581026
DOI:10.1145/1376616
General Chairs:
Laks V. S. Lakshmanan
University of British Columbia, Canada
,
Raymond T. Ng
University of British Columbia, Canada
,
Dennis Shasha
New York University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data cleaning
data quality
disguised missing data
Qualifiers
- demonstration
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 427
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DiMaC: a system for cleaning disguised missing data

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cleaning disguised missing data: a heuristic approach

DiMaC: a disguised missing data cleaning tool

Missing Data Imputation Techniques