research-article

NADEEF: a commodity data cleaning system

Authors:

Michele Dallachiesa,

Ahmed Elmagarmid,

Mourad Ouzzani,

Nan TangAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 541 - 552

https://doi.org/10.1145/2463676.2465327

Published: 22 June 2013 Publication History

Abstract

Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.

References

[1]

M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. TPLP, 3(4--5), 2003.

Digital Library

[2]

C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006.

Digital Library

[3]

O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 18(1), 2009.

Digital Library

[4]

A. Biere, M. Heule, H. van Maaren, and T. Walsh, editors. Handbook of Satisfiability, volume 185 of Frontiers in Artificial Intelligence and Applications. IOS Press, 2009.

Digital Library

[5]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.

Digital Library

[6]

L. Bravo, W. Fan, F. Geerts, and S. Ma. Increasing the expressivity of conditional functional dependencies without extra complexity. In ICDE, 2008.

Digital Library

[7]

P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 19(1), 2011.

Digital Library

[8]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.

Digital Library

[9]

M. G. Elfeky, A. K. Elmagarmid, and V. S. Verykios. Tailor: A record linkage tool box. In ICDE, 2002.

[10]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007.

Digital Library

[11]

W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.

Digital Library

[12]

W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, 33(2), 2008.

Digital Library

[13]

W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1), 2009.

Digital Library

[14]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD Conference, 2011.

Digital Library

[15]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012.

Digital Library

[16]

I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353), 1976.

[17]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, 2001.

Digital Library

[18]

E. Giunchiglia and A. Tacchella, editors. Theory and Applications of Satisfiability Testing, SAT, 2004.

Digital Library

[19]

M. A. Hernandez and S. Stolfo. Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1), 1998.

Digital Library

[20]

S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, 2009.

Digital Library

[21]

G. M. Kuper. Aggregation in constraint databases. In PPCP, 1993.

[22]

Y. S. Mahajan, Z. Fu, and S. Malik. Zchaff2004: An efficient sat solver. In SAT (Selected Papers), 2004.

Digital Library

[23]

C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.

Digital Library

[24]

F. Naumann, A. Bilke, J. Bleiholder, and M. Weis. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull., 29(2), 2006.

[25]

G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In WSDM, 2012.

Digital Library

[26]

V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, 2001.

Digital Library

[27]

S. E. Whang, O. Benjelloun, and H. Garcia-Molina. Generic entity resolution with negative rules. VLDB J., 18(6), 2009.

Digital Library

[28]

J. Wijsen. Database repairing using updates. TODS, 30(3), 2005.

Digital Library

[29]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5), 2011.

Digital Library

Cited By

Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 30-May-2024
https://doi.org/10.14778/3654621.3654624
Mokbel MSakr MXiong LZüfle AAlmeida JAnderson TAref WAndrienko GAndrienko NCao YChawla SCheng RChrysanthis PFei XGhinita GGraser AGunopulos DJensen CKim JKim KKröger PKrumm JLauer JMagdy ANascimento MRavada SRenz MSacharidis DSalim FSarwat MSchoemans MShahabi CSpeckmann BTanin ETeng XTheodoridis YTorp KTrajcevski Gvan Kreveld MWenk CWerner MWong RWu SXu JYoussef MZeinalipour DZhang MZimányi E(2024)Mobility Data Science: Perspectives and ChallengesACM Transactions on Spatial Algorithms and Systems10.1145/365215810:2(1-35)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3652158
Pereira JFonseca MLopes AGalhardas H(2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
https://doi.org/10.1145/3648476
Show More Cited By

Index Terms

NADEEF: a commodity data cleaning system
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Relational database model
    2. Query languages
      1. Relational database query languages

Recommendations

Interaction between record matching and data repairing
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These ...
Interaction between Record Matching and Data Repairing

Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity ...
Data Deduplication Techniques and Analysis
ICETET '10: Proceedings of the 2010 3rd International Conference on Emerging Trends in Engineering and Technology

Data warehouses are the repositories of data collected from several data sources, which form the backbone of most of the decision support applications. As the data sources are independent, they may adopt independent and potentially inconsistent ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

255
Total Citations
View Citations
1,405
Total Downloads

Downloads (Last 12 months)123
Downloads (Last 6 weeks)15

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 30-May-2024
https://doi.org/10.14778/3654621.3654624
Mokbel MSakr MXiong LZüfle AAlmeida JAnderson TAref WAndrienko GAndrienko NCao YChawla SCheng RChrysanthis PFei XGhinita GGraser AGunopulos DJensen CKim JKim KKröger PKrumm JLauer JMagdy ANascimento MRavada SRenz MSacharidis DSalim FSarwat MSchoemans MShahabi CSpeckmann BTanin ETeng XTheodoridis YTorp KTrajcevski Gvan Kreveld MWenk CWerner MWong RWu SXu JYoussef MZeinalipour DZhang MZimányi E(2024)Mobility Data Science: Perspectives and ChallengesACM Transactions on Spatial Algorithms and Systems10.1145/365215810:2(1-35)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3652158
Pereira JFonseca MLopes AGalhardas H(2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
https://doi.org/10.1145/3648476
Xiong KXu XFu SWeng DWang YWu Y(2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
https://doi.org/10.1109/TVCG.2024.3388556
Coscia ASuh AChang REndert A(2024)Preliminary Guidelines for Combining Data Integration and Visual Data AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.333451330:10(6678-6690)Online publication date: Oct-2024
https://doi.org/10.1109/TVCG.2023.3334513
Rahman MNadal SRomero OSacharidis D(2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00269
Qin JHuang SWang YZhu JZhang YMiao YMao ROnizuka MXiao C(2024)BClean: A Bayesian Data Cleaning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00263(3407-3420)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00263
Ahmad PRai M(2024)Optimized Continuous Quality and Storage Management Model for Big Data Analysis2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868899(1-6)Online publication date: 21-Nov-2024
https://doi.org/10.1109/AKGEC62572.2024.10868899
Xi YWang NZhang YChen X(2024)CrowdDA: Difficulty-aware crowdsourcing task optimization for cleaning web tablesExpert Systems with Applications10.1016/j.eswa.2023.122139238(122139)Online publication date: Mar-2024
https://doi.org/10.1016/j.eswa.2023.122139
Wang HZhang AZhang JGuo NXia X(2024)Computing Minimum Subset Repair on Incomplete DataWeb and Big Data10.1007/978-981-97-7238-4_28(444-459)Online publication date: 28-Aug-2024
https://doi.org/10.1007/978-981-97-7238-4_28
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten