skip to main content
10.1145/2463676.2465327acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

NADEEF: a commodity data cleaning system

Published: 22 June 2013 Publication History

Abstract

Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

References

[1]
M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. TPLP, 3(4--5), 2003.
[2]
C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006.
[3]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 18(1), 2009.
[4]
A. Biere, M. Heule, H. van Maaren, and T. Walsh, editors. Handbook of Satisfiability, volume 185 of Frontiers in Artificial Intelligence and Applications. IOS Press, 2009.
[5]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
[6]
L. Bravo, W. Fan, F. Geerts, and S. Ma. Increasing the expressivity of conditional functional dependencies without extra complexity. In ICDE, 2008.
[7]
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 19(1), 2011.
[8]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.
[9]
M. G. Elfeky, A. K. Elmagarmid, and V. S. Verykios. Tailor: A record linkage tool box. In ICDE, 2002.
[10]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007.
[11]
W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.
[12]
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, 33(2), 2008.
[13]
W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1), 2009.
[14]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD Conference, 2011.
[15]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012.
[16]
I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353), 1976.
[17]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, 2001.
[18]
E. Giunchiglia and A. Tacchella, editors. Theory and Applications of Satisfiability Testing, SAT, 2004.
[19]
M. A. Hernandez and S. Stolfo. Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1), 1998.
[20]
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, 2009.
[21]
G. M. Kuper. Aggregation in constraint databases. In PPCP, 1993.
[22]
Y. S. Mahajan, Z. Fu, and S. Malik. Zchaff2004: An efficient sat solver. In SAT (Selected Papers), 2004.
[23]
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.
[24]
F. Naumann, A. Bilke, J. Bleiholder, and M. Weis. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull., 29(2), 2006.
[25]
G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In WSDM, 2012.
[26]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, 2001.
[27]
S. E. Whang, O. Benjelloun, and H. Garcia-Molina. Generic entity resolution with negative rules. VLDB J., 18(6), 2009.
[28]
J. Wijsen. Database repairing using updates. TODS, 30(3), 2005.
[29]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5), 2011.

Cited By

View all
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 30-May-2024
  • (2024)Mobility Data Science: Perspectives and ChallengesACM Transactions on Spatial Algorithms and Systems10.1145/365215810:2(1-35)Online publication date: 1-Jul-2024
  • (2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. conditional functional dependency
  2. data cleaning
  3. etl
  4. matching dependency

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'13
Sponsor:

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)15
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 30-May-2024
  • (2024)Mobility Data Science: Perspectives and ChallengesACM Transactions on Spatial Algorithms and Systems10.1145/365215810:2(1-35)Online publication date: 1-Jul-2024
  • (2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
  • (2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
  • (2024)Preliminary Guidelines for Combining Data Integration and Visual Data AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.333451330:10(6678-6690)Online publication date: Oct-2024
  • (2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
  • (2024)BClean: A Bayesian Data Cleaning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00263(3407-3420)Online publication date: 13-May-2024
  • (2024)Optimized Continuous Quality and Storage Management Model for Big Data Analysis2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868899(1-6)Online publication date: 21-Nov-2024
  • (2024)CrowdDA: Difficulty-aware crowdsourcing task optimization for cleaning web tablesExpert Systems with Applications10.1016/j.eswa.2023.122139238(122139)Online publication date: Mar-2024
  • (2024)Computing Minimum Subset Repair on Incomplete DataWeb and Big Data10.1007/978-981-97-7238-4_28(444-459)Online publication date: 28-Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media