skip to main content
10.1145/1376616.1376751acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
demonstration

DiMaC: a system for cleaning disguised missing data

Published:09 June 2008Publication History

ABSTRACT

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely. The very limited previous studies on cleaning disguised missing data highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers.

Recently, we have studied the problem of cleaning disguised missing data systematically, and proposed an effective heuristic approach [2]. In this paper, we describe a demonstration of DiMaC, a <u>Di</u>sguised <u>M</u>issing D<u>a</u>ta <u>C</u>leaning system which can find the frequently used disguise values in data sets without requiring any domain background knowledge. In this demo, we will show (1) the critical techniques of finding suspicious disguise values; (2) the architecture and user interface of DiMaC system; (3) an empirical case study on both real and synthetic data sets, which verifies the effectiveness and the efficiency of the techniques; (4) some challenges arising from real applications and several direction for future work.

References

  1. D. DesJardins. Outliers, inliers, and just plain liars -new graphical EDA+ (EDA Plus) techniques for understanding data. In Proc. SAS User's Group International Conference (SUGI26), Long Beach, CA, 2001.Google ScholarGoogle Scholar
  2. M. Hua and J. Pei. Cleaning disguised missing data: a heuristic approach. In KDD, pages 950--958, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Kégl and L. Wang. Boosting on manifolds: Adaptive regularization of base classifiers. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 665--672, Cambridge, MA, 2005. MIT Press.Google ScholarGoogle Scholar
  4. S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191--201, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Pearson. Mining imperfect data: Dealing with contamination and incomplete records. In Proc. 2005 SIAM Int. Conf. Data Mining, New Port Beach, CA, April 2005.Google ScholarGoogle ScholarCross RefCross Ref
  7. R. K. Pearson. Data mining in the face of contaminated and incomplete records. In Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, April 2002.Google ScholarGoogle Scholar
  8. R. K. Pearson. The problem of distuised missing data. ACM SIGKDD Explorations, 8:(1) 83--92, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Webb. Further experimental evidence against the utility of occam's razor. The Journal of Artificial Intelligence Research, 4:397--417, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DiMaC: a system for cleaning disguised missing data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
      June 2008
      1396 pages
      ISBN:9781605581026
      DOI:10.1145/1376616

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • demonstration

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader