research-article

Data Cleaning: Overview and Emerging Challenges

Authors:

Sanjay Krishnan,

Jiannan WangAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 2201 - 2206

https://doi.org/10.1145/2882903.2912574

Published: 26 June 2016 Publication History

Abstract

Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.

References

[1]

Trifacta. http://www.trifacta.com.

[2]

C. C. Aggarwal. Outlier Analysis. Springer, 2013.

[3]

Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. PVLDB, 7(11), 2014.

Digital Library

[4]

H. Altwaijry, S. Mehrotra, and D. V. Kalashnikov. Query: A framework for integrating entity resolution with query processing. PVLDB, 9(3):120--131, 2015.

Digital Library

[5]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In PVLDB, pages 586--597, 2002.

Digital Library

[6]

M. Balazinska, A. Deshpande, M. J. Franklin, P. B. Gibbons, J. Gray, M. H. Hansen, M. Liebhold, S. Nath, A. S. Szalay, and V. Tao. Data management in the worldwide sensor web. IEEE Pervasive Computing, 6(2):30--40, 2007.

Digital Library

[7]

M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In SIGMOD, 2015.

Digital Library

[8]

L. Berti-Equille, T. Dasu, and D. Srivastava. Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In ICDE, pages 733--744, 2011.

Digital Library

[9]

L. E. Bertossi. Consistent query answering in databases. SIGMOD Record, 35(2):68--76, 2006.

Digital Library

[10]

G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010.

Digital Library

[11]

G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013.

Digital Library

[12]

G. Beskales, M. A. Soliman, I. F. Ilyas, and S. Ben-David. Modeling and querying possible repairs in duplicate detection. PVLDB, pages 598--609, 2009.

Digital Library

[13]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005.

Digital Library

[14]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.

[15]

L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner. Scalable distance-based outlier detection over high-volume data streams. In ICDE, pages 76--87, 2014.

[16]

A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 445--456, 2014.

Digital Library

[17]

S. Chawla and P. Sun. Outlier detection: Principles, techniques and applications. In PAKDD, 2006.

[18]

Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In KDD. ACM, 2014.

Digital Library

[19]

F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011.

Digital Library

[20]

X. Chu, I. F. Ilyas, and P. Koutris. Distributed Data Deduplication. Technical Report CS-2016-02, University of Waterloo, 2016.

[21]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.

Digital Library

[22]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.

Digital Library

[23]

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015.

Digital Library

[24]

Y. Chung, M. L. Mortensen, C. Binnig, and T. Kraska. Estimating the impact of unknown unknowns on aggregate query results. CoRR, abs/1507.05591, 2015.

[25]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315--326. VLDB Endowment, 2007.

Digital Library

[26]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013.

Digital Library

[27]

A. Deligiannakis, Y. Kotidis, V. Vassalos, V. Stoumpos, and A. Delis. Another outlier bites the dust: Computing meaningful aggregates in sensor networks. In ICDE, pages 988--999, 2009.

Digital Library

[28]

A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In PVLDB, pages 588--599, 2004.

Digital Library

[29]

C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In STOC, pages 117--126, 2015.

Digital Library

[30]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1--2):173--184, 2010.

Digital Library

[31]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, pages 469--480. ACM, 2011.

Digital Library

[32]

Gartner. Forecast: The internet of things, worldwide. https://www.gartner.com/doc/2625419/forecast-internet-things-worldwide-.

[33]

F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9):625--636, 2013.

Digital Library

[34]

D. Georgiadis, M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos. Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms. In SIGMOD, pages 1061--1064, 2013.

Digital Library

[35]

C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.

Digital Library

[36]

L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, 2008.

Digital Library

[37]

D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12), 2015.

Digital Library

[38]

D. Haas, J. Wang, E. Wu, and M. J. Franklin. Clamshell: Speeding up crowds for low-latency data labeling. PVLDB, 9(4):372--383, Dec. 2015.

Digital Library

[39]

A. Heise, G. Kasneci, and F. Naumann. Estimating the number and sizes of fuzzy-duplicate clusters. In CIKM Conference, 2014.

Digital Library

[40]

J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.

[41]

I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015.

Digital Library

[42]

S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. A pipelined framework for online cleaning of sensor data streams. In ICDE, 2006.

Digital Library

[43]

S. R. Jeffery, M. N. Garofalakis, and M. J. Franklin. Adaptive cleaning for RFID data streams. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006, pages 163--174, 2006.

Digital Library

[44]

T. Johnson and T. Dasu. Data quality and data cleaning: An overview. In SIGMOD, page 681, 2003.

Digital Library

[45]

Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. pages 1215--1230, 2015.

Digital Library

[46]

S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.

Digital Library

[47]

L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012.

Digital Library

[48]

H.-P. Kriegel, P. Kröger, and A. Zimek. Outlier detection techniques. In Tutorial at SIGKDD, 2010.

[49]

S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg. A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In RecSys, 2014.

Digital Library

[50]

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. PVLDB, 8(12), 2015.

Digital Library

[51]

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3):59--75, 2015.

[52]

S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. In Arxiv: http://arxiv.org/pdf/1601.03797.pdf, 2015.

[53]

Z. Li, S. Shang, Q. Xie, and X. Zhang. Cost reduction for web-based data imputation. In Database Systems for Advanced Applications, pages 438--452. Springer, 2014.

[54]

S. Madden. Database abstractions for managing sensor network data. Proceedings of the IEEE, 98(11):1879--1886, 2010.

[55]

J. Mahler, S. Krishnan, M. Laskey, S. Sen, A. Murali, B. Kehoe, S. Patil, J. Wang, M. Franklin, P. Abbeel, and K. Y. Goldberg. Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression. In CASE, 2014.

[56]

A. Marcus and A. Parameswaran. Crowdsourced data management: Industry and academic perspectives. Foundations and Trends in Databases, 6(1--2):1--161, 2013.

Digital Library

[57]

C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010.

Digital Library

[58]

A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, pages 505--516, 2011.

Digital Library

[59]

B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2), 2014.

Digital Library

[60]

A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: Algorithms for filtering data with humans.

[61]

A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. 2011.

[62]

H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, 2014.

Digital Library

[63]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.

[64]

V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001.

Digital Library

[65]

P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection, volume 589. John Wiley & Sons, 2005.

[66]

D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. CoRR, abs/1511.05219, 2015.

[67]

G. Simoes, H. Galhardas, and L. Gravano. When speed has a price: Fast information extraction using approximate algorithms. PVLDB, 6(13):1462--1473, 2013.

Digital Library

[68]

E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 1951.

[69]

M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.

[70]

S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In PVLDB, pages 187--198, 2006.

Digital Library

[71]

Y. Tong, C. C. Cao, C. J. Zhang, Y. Li, and L. Chen. Crowdcleaner: Data cleaning for multi-version data on the web via crowdsourcing. In ICDE, pages 1182--1185, 2014.

[72]

R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013.

Digital Library

[73]

M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.

[74]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 2012.

Digital Library

[75]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.

Digital Library

[76]

J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, pages 229--240, 2013.

Digital Library

[77]

J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468. ACM, 2014.

Digital Library

[78]

S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, Apr. 2013.

Digital Library

[79]

E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.

Digital Library

[80]

H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.

Digital Library

[81]

M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.

Digital Library

[82]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.

Digital Library

Cited By

Yu HWang LXu WMi SZhang Y(2025)RPM-ETC: A Risk Prediction Model for Elevators Based on Transformer and Self-Temporal Compression MechanismApplied Sciences10.3390/app1503132615:3(1326)Online publication date: 27-Jan-2025
https://doi.org/10.3390/app15031326
Singh MCambronero JGulwani SLe VNegreanu CRadhakrishna AVerbruggen G(2025)DataVinci: Learning Syntactic and Semantic String RepairsProceedings of the ACM on Management of Data10.1145/37096773:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709677
Xin RWang JChen PZhao Z(2025)Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A ReviewACM Computing Surveys10.1145/370174057:5(1-37)Online publication date: 9-Jan-2025
https://doi.org/10.1145/3701740
Show More Cited By

Index Terms

Data Cleaning: Overview and Emerging Challenges
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
A Comparative Study of Data Cleaning Tools

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...
An Ontology-based Methodology for Reusing Data Cleaning Knowledge
IC3K 2015: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

The organizations' demand to integrate several heterogeneous data sources and an ever-increasing volume of data is revealing the presence of quality problems in data. Currently, most of the data cleaning approaches (for detection and correction of data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSERC

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

307
Total Citations
View Citations
8,635
Total Downloads

Downloads (Last 12 months)1,906
Downloads (Last 6 weeks)162

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu HWang LXu WMi SZhang Y(2025)RPM-ETC: A Risk Prediction Model for Elevators Based on Transformer and Self-Temporal Compression MechanismApplied Sciences10.3390/app1503132615:3(1326)Online publication date: 27-Jan-2025
https://doi.org/10.3390/app15031326
Singh MCambronero JGulwani SLe VNegreanu CRadhakrishna AVerbruggen G(2025)DataVinci: Learning Syntactic and Semantic String RepairsProceedings of the ACM on Management of Data10.1145/37096773:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709677
Xin RWang JChen PZhao Z(2025)Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A ReviewACM Computing Surveys10.1145/370174057:5(1-37)Online publication date: 9-Jan-2025
https://doi.org/10.1145/3701740
Miedema DTaipalus TAjanovski VAlawini AGoodfellow MLiut MPeltsverger SYoung TMonga MLonati VBarendsen ESheard JPaterson J(2025)Data Systems Education: Curriculum Recommendations, Course Syllabi, and Industry Needs2024 Working Group Reports on Innovation and Technology in Computer Science Education10.1145/3689187.3709609(95-123)Online publication date: 22-Jan-2025
https://dl.acm.org/doi/10.1145/3689187.3709609
Agin FFauteux-Lefebvre CThibault J(2025)Optimal Design of Catalytic Conversion of SO2 to SO3 via Machine LearningJournal of Machine Intelligence and Data Science10.11159/jmids.2025.0016Online publication date: 2025
https://doi.org/10.11159/jmids.2025.001
Bischof LTeodoropol SFüchslin RStockinger K(2025)Hybrid quantum neural networks show strongly reduced need for free parameters in entity matchingScientific Reports10.1038/s41598-025-88177-z15:1Online publication date: 5-Feb-2025
https://doi.org/10.1038/s41598-025-88177-z
Lam PNguyen HDang XTran VLe MNguyen TNguyen SVo H(2025)Leveraging local and global relationships for corrupted label detectionFuture Generation Computer Systems10.1016/j.future.2025.107729166(107729)Online publication date: May-2025
https://doi.org/10.1016/j.future.2025.107729
Khan JAhmad KJagatheesaperumal SSohn K(2025)Textual variations in social media text processing applications: challenges, solutions, and trendsArtificial Intelligence Review10.1007/s10462-024-11071-z58:3Online publication date: 13-Jan-2025
https://doi.org/10.1007/s10462-024-11071-z
Sarkar SJha P(2025)Application of AI/ML in Water Resource Management to Resolve Transboundary Water ConflictNavigating the Nexus10.1007/978-3-031-76532-2_18(431-455)Online publication date: 2-Feb-2025
https://doi.org/10.1007/978-3-031-76532-2_18
Lee HKim MCho S(2024)CoCoder : Concrete Autoencoder using Covariance for Unsupervised Feature SelectionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.3.24229:3(242-251)Online publication date: 31-May-2024
https://doi.org/10.5909/JBE.2024.29.3.242
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten