research-article

Machine Learning and Data Cleaning: Which Serves the Other?

Authors:

Theodoros RekatsinasAuthors Info & Claims

ACM Journal of Data and Information Quality (JDIQ), Volume 14, Issue 3

Article No.: 13, Pages 1 - 11

https://doi.org/10.1145/3506712

Published: 21 July 2022 Publication History

Abstract

The last few years witnessed significant advances in building automated or semi-automated data quality, data cleaning and data integration systems powered by machine learning (ML). In parallel, large deployment of ML systems in business, science, environment and various other areas started to realize the strong dependency on the quality of the input data to these ML models to get reliable predictions or insights. That dual relationship between ML and data cleaning has been addressed by many recent research works under terms such as “Data cleaning for ML” and “ML for automating data cleaning and data preparation”. In this article, we highlight this symbiotic relationship between ML and data cleaning and discuss few challenges that require collaborative efforts of multiple research communities.

References

[1]

2016-04-27. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). OJ (2016-04-27).

[2]

2019. Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD 2019, Amsterdam, The Netherlands, July 5, 2019. ACM.

[3]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). https://www.tensorflow.org/. Software available from tensorflow.org.

[4]

Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2018. A survey on homomorphic encryption schemes: Theory and implementation. ACM Comput. Surv. 51, 4 (2018), 79:1–79:35.

[5]

Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. 783–794.

[6]

Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. 68–79.

[7]

Brooke Auxier, Lee Rainie, Monica Anderson, Andrew Perrin, Madhu Kumar, and Erica Turner. 2019. Americans and privacy - concerned, confused and feeling lack of control over their personal information. Pew Research Center (2019).

[8]

Vic Barnett and Toby Lewis. 1994. Outliers in Statistical Data. Wiley New York.

[9]

Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17). Association for Computing Machinery, New York, NY, USA, 1387–1395.

Digital Library

[10]

Leopoldo E. Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers.

Digital Library

[11]

George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the repairs of functional dependency violations under hard constraints. 3, 1–2 (2010), 197–207.

[12]

Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. ACM, 143–154.

[13]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: Composable Transformations of Python+NumPy Programs. (2018). http://github.com/google/jax.

[14]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

[15]

Yunqiang Chen, Xiang Sean Zhou, and Thomas S. Huang. 2001. One-class SVM for learning in image retrieval. In Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), Vol. 1. IEEE, 34–37.

[16]

Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. 1, 1 (2008), 1166–1177.

[17]

Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. 6, 13 (2013), 1498–1509.

[18]

Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. 458–469.

[19]

Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. 315–326.

[20]

Victor Costan and Srinivas Devadas. 2016. Intel SGX explained. IACR Cryptol. ePrint Arch. 2016 (2016), 86.

[21]

Xin Luna Dong and Felix Naumann. 2009. Data fusion: Resolving data conflicts for integration. 2, 2 (2009), 1654–1655.

[22]

Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3–4 (2014), 211–407.

Digital Library

[23]

Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, and Charles Sutton. 2020. Robust variational autoencoders for outlier detection in mixed-type data. In The 23rd International Conference on Artificial Intelligence and Statistics.

[24]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. 19, 1 (2007), 1–16.

[25]

Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Discovering conditional functional dependencies. 23, 5 (2011), 683–698.

[26]

Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. 2, 1 (2009), 407–418.

[27]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2010. Towards certain fixes with editing rules and master data. 3, 1–2 (2010), 173–184.

[28]

Benoît Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25, 5 (2013), 845–869.

[29]

Chang Ge, Xi He, Ihab F. Ilyas, and Ashwin Machanavajjhala. 2019. APEx: Accuracy-aware differentially private data exploration. In SIGMOD. 177–194.

[30]

Chang Ge, Ihab F. Ilyas, and Florian Kerschbaum. 2019. Secure multi-party functional dependency discovery. Proc. VLDB Endow. 13, 2 (2019), 184–196.

Digital Library

[31]

Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F. Ilyas. 2021. Kamino: Constraint-aware differentially private data synthesis. Proc. VLDB Endow. 14, 10 (2021), 1886–1899.

Digital Library

[32]

Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: Theory, practice & open challenges. 5, 12 (2012), 2018–2019.

[33]

Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. 2017. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017).

[34]

Yeye He, Zhongjun Jin, and Surajit Chaudhuri. 2020. Auto-transform: Learning-to-transform by patterns. Proc. VLDB Endow. 13, 11 (2020), 2368–2381. http://www.vldb.org/pvldb/vol13/p2368-he.pdf.

[35]

Alireza Heidari, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. Approximate inference in structured instances with noisy categorical observations. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019 (Proceedings of Machine Learning Research), Amir Globerson and Ricardo Silva (Eds.), Vol. 115. AUAI Press, 412–421. http://proceedings.mlr.press/v115/heidari20a.html.

[36]

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). Association for Computing Machinery, New York, NY, USA, 829–846.

Digital Library

[37]

Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Techniques. Springer Science & Business Media.

Digital Library

[38]

Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein, and J. Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence. 43–58.

Digital Library

[39]

Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Computer Journal 42, 2 (1999), 100–111.

[40]

IBM. 2020. Cost of a data breach report. (2020).

[41]

Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM.

Digital Library

[42]

Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. 53–62.

[43]

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. 2016. Magellan: Toward building entity matching management systems. 9, 12 (2016), 1197–1208.

[44]

Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. 2009. Metric functional dependencies. 1275–1278.

[45]

Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. 2006. Record linkage: Similarity measures and algorithms. 802–803.

[46]

Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. 2020. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Silvia Chiappa and Roberto Calandra (Eds.), Vol. 108. PMLR, 4313–4324. https://proceedings.mlr.press/v108/li20j.html.

[47]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14, 1 (Sept. 2020), 50–60.

Digital Library

[48]

Yehuda Lindell and Benny Pinkas. 2009. Secure multiparty computation for privacy-preserving data mining. J. Priv. Confidentiality 1, 1 (2009).

[49]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 413–422.

Digital Library

[50]

Zifan Liu, Jong Ho Park, Theodoros Rekatsinas, and Christos Tzamos. 2021. On robust mean estimation under coordinate-level corruption. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, 6914–6924. https://proceedings.mlr.press/v139/liu21r.html.

[51]

Zifan Liu, Zhechun Zhou, and Theodoros Rekatsinas. 2020. Picket: Self-supervised data diagnostics for ML pipelines. CoRR abs/2006.04730 (2020). arxiv:2006.04730 https://arxiv.org/abs/2006.04730.

[52]

Greta M. Ljung. 1993. On outlier detection in time series. Journal of the Royal Statistical Society: Series B (Methodological) 55, 2 (1993), 559–567.

[53]

Andrei Lopatenko and Leopoldo E. Bertossi. 2007. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. 179–193.

[54]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations. https://openreview.net/forum?id=rJzIBfZAb.

[55]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). Association for Computing Machinery, New York, NY, USA, 865–882.

Digital Library

[56]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 6, Article 115 (July 2021), 35 pages.

Digital Library

[57]

Jason Mohoney, Roger Waleffe, Henry Xu, Theodoros Rekatsinas, and Shivaram Venkataraman. 2021. Marius: Learning massive graph embeddings on a single machine. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14-16, 2021, Angela Demke Brown and Jay R. Lorch (Eds.). USENIX Association, 533–549. https://www.usenix.org/conference/osdi21/presentation/mohoney.

[58]

Piero Molino, Yaroslav Dudin, and Sai Sumanth Miryala. 2019. Ludwig: A Type-based Declarative Deep Learning Toolbox. (2019). arXiv:arXiv:1909.07930

[59]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. ACM, 19–34.

[60]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). Association for Computing Machinery, New York, NY, USA, 19–34.

Digital Library

[61]

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2013/file/3871bd64012152bfb53fdf04b401193f-Paper.pdf.

[62]

Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection.

Digital Library

[63]

Curtis G. Northcutt, Anish Athalye, and Jonas Mueller. 2021. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749 (2021).

[64]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.

[65]

Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 3574–3582.

[66]

Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2015. Training deep neural networks on noisy labels with bootstrapping. In ICLR 2015. http://arxiv.org/abs/1412.6596.

[67]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1190–1201.

Digital Library

[68]

Kevin Roth, Yannic Kilcher, and Thomas Hofmann. 2019. The odds are odd: A statistical test for detecting adversarial examples. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 5498–5507. http://proceedings.mlr.press/v97/roth19a.html.

[69]

Christopher Ré, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. 2020. Overton: A data system for monitoring and improving machine-learned products. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org.

[70]

Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. 2019. A formal framework for probabilistic unclean databases. In 22nd International Conference on Database Theory (ICDT 2019) (Leibniz International Proceedings in Informatics (LIPIcs)), Pablo Barcelo and Marco Calautti (Eds.), Vol. 127. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 6:1–6:18.

[71]

Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. 269–278.

[72]

Sebastian Schelter, Felix Biessmann, Dustin Lange, Tammo Rukat, Phillipp Schmidt, Stephan Seufert, Pierre Brunelle, and Andrey Taptunov. 2019. Unit testing data with Deequ. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). Association for Computing Machinery, New York, NY, USA, 1993–1996.

Digital Library

[73]

M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, Mitch Cherniack, S. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data curation at scale: The data tamer system. In CIDR.

[74]

Sheila Tejada, Craig A. Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607–633.

Digital Library

[75]

Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity resolution using zero labeled examples. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, Online Conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1149–1164.

Digital Library

[76]

Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based learning for missing data imputation in HoloClean. Proceedings of Machine Learning and Systems (2020), 307–325.

[77]

Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. 4, 5 (2011), 279–289.

[78]

Yan Yan, Stephen Meyles, Aria Haghighi, and Dan Suciu. 2020. Entity matching in the wild: A consistent and versatile framework to unify data in industrial applications. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20). Association for Computing Machinery, New York, NY, USA, 2287–2301.

Digital Library

[79]

M. Zaharia, Andrew Chen, A. Davidson, A. Ghodsi, S. Hong, A. Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41 (2018), 39–45.

[80]

Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017. Time series data cleaning: From anomaly detection to anomaly repairing. Proc. VLDB Endow. 10, 10 (June 2017), 1046–1057.

Digital Library

Cited By

Nguyen VChung NBalaji GRudzki KHoang A(2025)Internet of things-driven approach integrated with explainable machine learning models for ship fuel consumption predictionAlexandria Engineering Journal10.1016/j.aej.2025.01.067118(664-680)Online publication date: Apr-2025
https://doi.org/10.1016/j.aej.2025.01.067
Zang NTao YYuan ZYuan CJing BLiu R(2024)Rasterized Data Image Processing (RDIP) Techniques for Photovoltaic (PV) Data Cleaning and Application in Power PredictionEnergies10.3390/en1712300017:12(3000)Online publication date: 18-Jun-2024
https://doi.org/10.3390/en17123000
Bachinger FEhrlinger LKronberger GWöss W(2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3661826
Show More Cited By

Index Terms

Machine Learning and Data Cleaning: Which Serves the Other?

Recommendations

Data cleaning and machine learning: a systematic literature review
Abstract
Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches ...
Learning Over Dirty Data Without Cleaning
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Real-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing ...
Data Cleaning: Overview and Emerging Challenges
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 14, Issue 3

September 2022

155 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3533272

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 July 2022

Online AM: 04 March 2022

Accepted: 01 December 2021

Received: 01 November 2021

Published in JDIQ Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
3,513
Total Downloads

Downloads (Last 12 months)1,026
Downloads (Last 6 weeks)91

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nguyen VChung NBalaji GRudzki KHoang A(2025)Internet of things-driven approach integrated with explainable machine learning models for ship fuel consumption predictionAlexandria Engineering Journal10.1016/j.aej.2025.01.067118(664-680)Online publication date: Apr-2025
https://doi.org/10.1016/j.aej.2025.01.067
Zang NTao YYuan ZYuan CJing BLiu R(2024)Rasterized Data Image Processing (RDIP) Techniques for Photovoltaic (PV) Data Cleaning and Application in Power PredictionEnergies10.3390/en1712300017:12(3000)Online publication date: 18-Jun-2024
https://doi.org/10.3390/en17123000
Bachinger FEhrlinger LKronberger GWöss W(2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3661826
Miao XJia ZCui BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654683
Guha SKhan FStoyanovich JSchelter S(2024)Automated Data Cleaning can Hurt Fairness in Machine Learning-Based Decision MakingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336552436:12(7368-7379)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3365524
Wang YZhang JMa ZLu NLi TMa J(2024)Location-Aware and Privacy-Preserving Data Cleaning for Intelligent TransportationIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.345334025:12(20405-20418)Online publication date: 18-Sep-2024
https://dl.acm.org/doi/10.1109/TITS.2024.3453340
Roscher RRusswurm MGevaert CKampffmeyer MDos Santos JVakalopoulou MHänsch RHansen SNogueira KPrexl JTuia D(2024)Better, Not Just More: Data-centric machine learning for Earth observationIEEE Geoscience and Remote Sensing Magazine10.1109/MGRS.2024.347098612:4(335-355)Online publication date: Dec-2024
https://doi.org/10.1109/MGRS.2024.3470986
Bechara HIbrahim RZemouri RKedjar BMerkhouf ATahan AAl-Haddad K(2024)Review of Artificial Intelligence Methods for Faults Monitoring, Diagnosis, and Prognosis in Hydroelectric Synchronous GeneratorsIEEE Access10.1109/ACCESS.2024.350254612(173599-173617)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3502546
Kumar SDatta SSingh VSingh SSharma R(2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3369417
Safavi VMohammadi Vaniar ABazmohammadi NVasquez JKeysan OGuerrero J(2024)Early prediction of battery remaining useful life using CNN-XGBoost model and Coati optimization algorithmJournal of Energy Storage10.1016/j.est.2024.11317698(113176)Online publication date: Sep-2024
https://doi.org/10.1016/j.est.2024.113176
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents