skip to main content
research-article

Machine Learning and Data Cleaning: Which Serves the Other?

Published: 21 July 2022 Publication History

Abstract

The last few years witnessed significant advances in building automated or semi-automated data quality, data cleaning and data integration systems powered by machine learning (ML). In parallel, large deployment of ML systems in business, science, environment and various other areas started to realize the strong dependency on the quality of the input data to these ML models to get reliable predictions or insights. That dual relationship between ML and data cleaning has been addressed by many recent research works under terms such as “Data cleaning for ML” and “ML for automating data cleaning and data preparation”. In this article, we highlight this symbiotic relationship between ML and data cleaning and discuss few challenges that require collaborative efforts of multiple research communities.

References

[1]
2016-04-27. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). OJ (2016-04-27).
[2]
2019. Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD 2019, Amsterdam, The Netherlands, July 5, 2019. ACM.
[3]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). https://www.tensorflow.org/. Software available from tensorflow.org.
[4]
Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2018. A survey on homomorphic encryption schemes: Theory and implementation. ACM Comput. Surv. 51, 4 (2018), 79:1–79:35.
[5]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. 783–794.
[6]
Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. 68–79.
[7]
Brooke Auxier, Lee Rainie, Monica Anderson, Andrew Perrin, Madhu Kumar, and Erica Turner. 2019. Americans and privacy - concerned, confused and feeling lack of control over their personal information. Pew Research Center (2019).
[8]
Vic Barnett and Toby Lewis. 1994. Outliers in Statistical Data. Wiley New York.
[9]
Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17). Association for Computing Machinery, New York, NY, USA, 1387–1395.
[10]
Leopoldo E. Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers.
[11]
George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the repairs of functional dependency violations under hard constraints. 3, 1–2 (2010), 197–207.
[12]
Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. ACM, 143–154.
[13]
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: Composable Transformations of Python+NumPy Programs. (2018). http://github.com/google/jax.
[14]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[15]
Yunqiang Chen, Xiang Sean Zhou, and Thomas S. Huang. 2001. One-class SVM for learning in image retrieval. In Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), Vol. 1. IEEE, 34–37.
[16]
Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. 1, 1 (2008), 1166–1177.
[17]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. 6, 13 (2013), 1498–1509.
[18]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. 458–469.
[19]
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. 315–326.
[20]
Victor Costan and Srinivas Devadas. 2016. Intel SGX explained. IACR Cryptol. ePrint Arch. 2016 (2016), 86.
[21]
Xin Luna Dong and Felix Naumann. 2009. Data fusion: Resolving data conflicts for integration. 2, 2 (2009), 1654–1655.
[22]
Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3–4 (2014), 211–407.
[23]
Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, and Charles Sutton. 2020. Robust variational autoencoders for outlier detection in mixed-type data. In The 23rd International Conference on Artificial Intelligence and Statistics.
[24]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. 19, 1 (2007), 1–16.
[25]
Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Discovering conditional functional dependencies. 23, 5 (2011), 683–698.
[26]
Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. 2, 1 (2009), 407–418.
[27]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2010. Towards certain fixes with editing rules and master data. 3, 1–2 (2010), 173–184.
[28]
Benoît Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25, 5 (2013), 845–869.
[29]
Chang Ge, Xi He, Ihab F. Ilyas, and Ashwin Machanavajjhala. 2019. APEx: Accuracy-aware differentially private data exploration. In SIGMOD. 177–194.
[30]
Chang Ge, Ihab F. Ilyas, and Florian Kerschbaum. 2019. Secure multi-party functional dependency discovery. Proc. VLDB Endow. 13, 2 (2019), 184–196.
[31]
Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F. Ilyas. 2021. Kamino: Constraint-aware differentially private data synthesis. Proc. VLDB Endow. 14, 10 (2021), 1886–1899.
[32]
Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: Theory, practice & open challenges. 5, 12 (2012), 2018–2019.
[33]
Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. 2017. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017).
[34]
Yeye He, Zhongjun Jin, and Surajit Chaudhuri. 2020. Auto-transform: Learning-to-transform by patterns. Proc. VLDB Endow. 13, 11 (2020), 2368–2381. http://www.vldb.org/pvldb/vol13/p2368-he.pdf.
[35]
Alireza Heidari, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. Approximate inference in structured instances with noisy categorical observations. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019 (Proceedings of Machine Learning Research), Amir Globerson and Ricardo Silva (Eds.), Vol. 115. AUAI Press, 412–421. http://proceedings.mlr.press/v115/heidari20a.html.
[36]
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). Association for Computing Machinery, New York, NY, USA, 829–846.
[37]
Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Techniques. Springer Science & Business Media.
[38]
Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein, and J. Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence. 43–58.
[39]
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Computer Journal 42, 2 (1999), 100–111.
[40]
IBM. 2020. Cost of a data breach report. (2020).
[41]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM.
[42]
Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. 53–62.
[43]
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. 2016. Magellan: Toward building entity matching management systems. 9, 12 (2016), 1197–1208.
[44]
Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. 2009. Metric functional dependencies. 1275–1278.
[45]
Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. 2006. Record linkage: Similarity measures and algorithms. 802–803.
[46]
Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. 2020. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Silvia Chiappa and Roberto Calandra (Eds.), Vol. 108. PMLR, 4313–4324. https://proceedings.mlr.press/v108/li20j.html.
[47]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14, 1 (Sept. 2020), 50–60.
[48]
Yehuda Lindell and Benny Pinkas. 2009. Secure multiparty computation for privacy-preserving data mining. J. Priv. Confidentiality 1, 1 (2009).
[49]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 413–422.
[50]
Zifan Liu, Jong Ho Park, Theodoros Rekatsinas, and Christos Tzamos. 2021. On robust mean estimation under coordinate-level corruption. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, 6914–6924. https://proceedings.mlr.press/v139/liu21r.html.
[51]
Zifan Liu, Zhechun Zhou, and Theodoros Rekatsinas. 2020. Picket: Self-supervised data diagnostics for ML pipelines. CoRR abs/2006.04730 (2020). arxiv:2006.04730https://arxiv.org/abs/2006.04730.
[52]
Greta M. Ljung. 1993. On outlier detection in time series. Journal of the Royal Statistical Society: Series B (Methodological) 55, 2 (1993), 559–567.
[53]
Andrei Lopatenko and Leopoldo E. Bertossi. 2007. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. 179–193.
[54]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations. https://openreview.net/forum?id=rJzIBfZAb.
[55]
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). Association for Computing Machinery, New York, NY, USA, 865–882.
[56]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 6, Article 115 (July 2021), 35 pages.
[57]
Jason Mohoney, Roger Waleffe, Henry Xu, Theodoros Rekatsinas, and Shivaram Venkataraman. 2021. Marius: Learning massive graph embeddings on a single machine. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14-16, 2021, Angela Demke Brown and Jay R. Lorch (Eds.). USENIX Association, 533–549. https://www.usenix.org/conference/osdi21/presentation/mohoney.
[58]
Piero Molino, Yaroslav Dudin, and Sai Sumanth Miryala. 2019. Ludwig: A Type-based Declarative Deep Learning Toolbox. (2019). arXiv:arXiv:1909.07930
[59]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. ACM, 19–34.
[60]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). Association for Computing Machinery, New York, NY, USA, 19–34.
[61]
Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2013/file/3871bd64012152bfb53fdf04b401193f-Paper.pdf.
[62]
Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection.
[63]
Curtis G. Northcutt, Anish Athalye, and Jonas Mueller. 2021. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749 (2021).
[64]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
[65]
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 3574–3582.
[66]
Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2015. Training deep neural networks on noisy labels with bootstrapping. In ICLR 2015. http://arxiv.org/abs/1412.6596.
[67]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1190–1201.
[68]
Kevin Roth, Yannic Kilcher, and Thomas Hofmann. 2019. The odds are odd: A statistical test for detecting adversarial examples. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 5498–5507. http://proceedings.mlr.press/v97/roth19a.html.
[69]
Christopher Ré, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. 2020. Overton: A data system for monitoring and improving machine-learned products. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org.
[70]
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. 2019. A formal framework for probabilistic unclean databases. In 22nd International Conference on Database Theory (ICDT 2019) (Leibniz International Proceedings in Informatics (LIPIcs)), Pablo Barcelo and Marco Calautti (Eds.), Vol. 127. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 6:1–6:18.
[71]
Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. 269–278.
[72]
Sebastian Schelter, Felix Biessmann, Dustin Lange, Tammo Rukat, Phillipp Schmidt, Stephan Seufert, Pierre Brunelle, and Andrey Taptunov. 2019. Unit testing data with Deequ. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). Association for Computing Machinery, New York, NY, USA, 1993–1996.
[73]
M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, Mitch Cherniack, S. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data curation at scale: The data tamer system. In CIDR.
[74]
Sheila Tejada, Craig A. Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607–633.
[75]
Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity resolution using zero labeled examples. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, Online Conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1149–1164.
[76]
Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based learning for missing data imputation in HoloClean. Proceedings of Machine Learning and Systems (2020), 307–325.
[77]
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. 4, 5 (2011), 279–289.
[78]
Yan Yan, Stephen Meyles, Aria Haghighi, and Dan Suciu. 2020. Entity matching in the wild: A consistent and versatile framework to unify data in industrial applications. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20). Association for Computing Machinery, New York, NY, USA, 2287–2301.
[79]
M. Zaharia, Andrew Chen, A. Davidson, A. Ghodsi, S. Hong, A. Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41 (2018), 39–45.
[80]
Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017. Time series data cleaning: From anomaly detection to anomaly repairing. Proc. VLDB Endow. 10, 10 (June 2017), 1046–1057.

Cited By

View all
  • (2025)Internet of things-driven approach integrated with explainable machine learning models for ship fuel consumption predictionAlexandria Engineering Journal10.1016/j.aej.2025.01.067118(664-680)Online publication date: Apr-2025
  • (2024)Rasterized Data Image Processing (RDIP) Techniques for Photovoltaic (PV) Data Cleaning and Application in Power PredictionEnergies10.3390/en1712300017:12(3000)Online publication date: 18-Jun-2024
  • (2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 14, Issue 3
September 2022
155 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3533272
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 July 2022
Online AM: 04 March 2022
Accepted: 01 December 2021
Received: 01 November 2021
Published in JDIQ Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Machine learning
  2. data cleaning

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,026
  • Downloads (Last 6 weeks)91
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Internet of things-driven approach integrated with explainable machine learning models for ship fuel consumption predictionAlexandria Engineering Journal10.1016/j.aej.2025.01.067118(664-680)Online publication date: Apr-2025
  • (2024)Rasterized Data Image Processing (RDIP) Techniques for Photovoltaic (PV) Data Cleaning and Application in Power PredictionEnergies10.3390/en1712300017:12(3000)Online publication date: 18-Jun-2024
  • (2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
  • (2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
  • (2024)Automated Data Cleaning can Hurt Fairness in Machine Learning-Based Decision MakingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336552436:12(7368-7379)Online publication date: Dec-2024
  • (2024)Location-Aware and Privacy-Preserving Data Cleaning for Intelligent TransportationIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.345334025:12(20405-20418)Online publication date: 18-Sep-2024
  • (2024)Better, Not Just More: Data-centric machine learning for Earth observationIEEE Geoscience and Remote Sensing Magazine10.1109/MGRS.2024.347098612:4(335-355)Online publication date: Dec-2024
  • (2024)Review of Artificial Intelligence Methods for Faults Monitoring, Diagnosis, and Prognosis in Hydroelectric Synchronous GeneratorsIEEE Access10.1109/ACCESS.2024.350254612(173599-173617)Online publication date: 2024
  • (2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
  • (2024)Early prediction of battery remaining useful life using CNN-XGBoost model and Coati optimization algorithmJournal of Energy Storage10.1016/j.est.2024.11317698(113176)Online publication date: Sep-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media