research-article

Public Access

Complaint-Driven Training Data Debugging at Interactive Speeds

Authors:

Lampros Flokas,

Eugene WuAuthors Info & Claims

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Pages 369 - 383

https://doi.org/10.1145/3514221.3517849

Published: 11 June 2022 Publication History

Abstract

Modern databases support queries that perform model inference (inference queries). Although powerful and widely used, inference queries are susceptible to incorrect results if the model is biased due to training data errors. Recently, prior work Rain proposed complaint-driven data debugging which uses user-specified errors in the output of inference queries (Complaints) to rank erroneous training examples that most likely caused the complaint. This can help users better interpret results and debug training sets. Rain combined influence analysis from the ML literature with relaxed query provenance polynomials from the DB literature to approximate the derivative of complaints w.r.t. training examples. Although effective, the runtime is O(|T|d), where T and d are the training set and model sizes, due to its reliance on the model's second order derivatives (the Hessian). On a Wide Resnet Network (WRN) model with 1.5 million parameters, it takes >1 minute to debug a complaint. We observe that most complaint debugging costs are independent of the complaint, and that modern models are overparameterized. In response, Rain++ uses precomputation techniques, based on non-trivial insights unique to data debugging, to reduce debugging latencies to a constant factor independent of model size. We also develop optimizations when the queried database is known apriori, and for standing queries over streaming databases. Combining these optimizations in Rain++ ensures interactive debugging latencies (~1ms) on models with millions of parameters.

References

[1]

Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jeff Naughton, Peter Bailis, and Matei Zaharia. 2018. DIFF: A Relational Interface for Large-Scale Data Explanation. Proc. VLDB Endow., Vol. 12, 4 (Dec. 2018), 419--432. https://doi.org/10.14778/3297753.3297761

Digital Library

[2]

Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for Aggregate Queries. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Athens, Greece) (PODS '11). Association for Computing Machinery, New York, NY, USA, 153--164. https://doi.org/10.1145/1989284.1989302

Digital Library

[3]

Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam Koenigstein, Nir Nice, and Ulrich Paquet. 2014. Speeding up the Xbox Recommender System Using a Euclidean Transformation for Inner-Product Spaces. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, Silicon Valley, California, USA) (RecSys '14). Association for Computing Machinery, New York, NY, USA, 257--264. https://doi.org/10.1145/2645710.2645741

Digital Library

[4]

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. https://mlsys.org/Conferences/2019/doc/2019/167.pdf

[5]

W. Cleveland and R. McGill. 1984. Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. J. Amer. Statist. Assoc., Vol. 79 (1984), 531--554.

[6]

George Corliss, Christèle Faure, Andreas Griewank, Laurent Hascoët, and Uwe Naumann (Eds.). 2002. Differentiation Methods for Industrial Strength Problems .Springer New York, New York, NY. https://doi.org/10.1007/978--1--4613-0075--5_1

[7]

Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., Redhook, NY, USA, 18518--18529. https://proceedings.neurips.cc/paper/2020/file/d77c703536718b95308130ff2e5cf9ee-Paper.pdf

[8]

Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

[9]

Carl Eckart and Gale Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika, Vol. 1, 3 (1936), 211--218.

[10]

Open Neural Network Exchange. 2019. ONNX. https://onnx.ai/. [Online; accessed 1-December-2020].

[11]

Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net.

[12]

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. 2019. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2232--2241. https://proceedings.mlr.press/v97/ghorbani19b.html

[13]

W. D. Gray and D. Boehm-Davis. 2000. Milliseconds matter: an introduction to microstrategies and to their use in describing and predicting interactive behavior. Journal of experimental psychology. Applied, Vol. 6 4 (2000), 322--35.

[14]

Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance Semirings. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Beijing, China) (PODS '07). Association for Computing Machinery, New York, NY, USA, 31--40. https://doi.org/10.1145/1265530.1265535

Digital Library

[15]

Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. 2018. Gradient Descent Happens in a Tiny Subspace. CoRR, Vol. abs/1812.04754 (2018). arxiv: 1812.04754 http://arxiv.org/abs/1812.04754

[16]

Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. 2019. Data Cleansing for Models Trained with SGD. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Redhook, NY, USA. https://proceedings.neurips.cc/paper/2019/file/5f14615696649541a025d3d0f8e0447f-Paper.pdf

[17]

Jeffrey Heer and Michael Bostock. 2010. Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design. In Proceedings of the SIGCHI Conference on Human Factors in Computing System (Atlanta, Georgia, USA) (CHI '10). Association for Computing Machinery, New York, NY, USA, 203--212. https://doi.org/10.1145/1753326.1753357

Digital Library

[18]

Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow., Vol. 5, 12 (Aug. 2012), 1700--1711. https://doi.org/10.14778/2367502.2367510

Digital Library

[19]

Magnus R Hestenes and Eduard Stiefel. 1952. Methods of Conjugate Gradients for Solving Linear Systems1. J. Res. Nat. Bur. Standards, Vol. 49, 6 (1952).

[20]

Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. 2021. Approximate Data Deletion from Machine Learning Models. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 130), Arindam Banerjee and Kenji Fukumizu (Eds.). PMLR, 2008--2016. https://proceedings.mlr.press/v130/izzo21a.html

[21]

JAX. 2020. JAX reference documentation - JAX documentation. https://jax.readthedocs.io/en/latest/. [Online; accessed 1-December-2020].

[22]

Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1885--1894. https://proceedings.mlr.press/v70/koh17a.html

[23]

Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. 2019. On the Accuracy of Influence Functions for Measuring Group Effects. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Redhook, NY, USA. https://proceedings.neurips.cc/paper/2019/file/a78482ce76496fcf49085f2190e675b4-Paper.pdf

[24]

Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. (2009).

[25]

Cornelius Lanczos. 1950. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators .United States Governm. Press Office Los Angeles, CA.

[26]

Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/.

[27]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, New York, NY, USA, 13--24. https://doi.org/10.1109/ICDE51399.2021.00009

[28]

Z. Liu and J. Heer. 2014. The Effects of Interactive Latency on Exploratory Visual Analysis. IEEE Transactions on Visualization and Computer Graphics, Vol. 20 (2014), 2122--2131.

[29]

Z. Liu and J. Stasko. 2010. Mental Models, Visual Reasoning and Interaction in Information Visualization: A Top-down Perspective. IEEE Transactions on Visualization and Computer Graphics, Vol. 16 (2010), 999--1008.

Digital Library

[30]

Google LLC. 2019. Introduction to BigQuery ML. https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. [Online; accessed 10-October-2019].

[31]

Alexandra Meliou and Dan Suciu. 2012. Tiresias: The Database Oracle for How-to Queries. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (Scottsdale, Arizona, USA) (SIGMOD '12). Association for Computing Machinery, New York, NY, USA, 337--348. https://doi.org/10.1145/2213836.2213875

Digital Library

[32]

Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going Beyond Provenance: Explaining Query Answers with Pattern-Based Counterbalances. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 485--502. https://doi.org/10.1145/3299869.3300066

Digital Library

[33]

OpenML. 2020. OpenML Supervised Classification on adult. https://www.openml.org/t/7592. [Online; accessed 1-December-2020].

[34]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 1135--1144. https://doi.org/10.1145/2939672.2939778

Digital Library

[35]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating Large-Scale Data Quality Verification. Proc. VLDB Endow., Vol. 11, 12 (Aug. 2018), 1781--1794. https://doi.org/10.14778/3229863.3229867

Digital Library

[36]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18--21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 1631--1642. https://aclanthology.org/D13--1170/

[37]

SQLFlow. 2019. SQLFlow: Bridging Data and AI. https://sqlflow.org. [Online; accessed 1-December-2020].

[38]

Justin Talbot, V. Setlur, and A. Anand. 2014. Four Experiments on the Perception of Bar Charts. IEEE Transactions on Visualization and Computer Graphics, Vol. 20 (2014), 2152--2160.

[39]

Tensorflow. 2020. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xla. [Online; accessed 1-December-2020].

[40]

Jason Teoh, Muhammad Ali Gulzar, and Miryung Kim. 2020. Influence-Based Provenance for Dataflow Applications with Taint Propagation. In Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event, USA) (SoCC '20). Association for Computing Machinery, New York, NY, USA, 372--386. https://doi.org/10.1145/3419111.3421292

Digital Library

[41]

Aad W Van der Vaart. 2000. Asymptotic statistics. Vol. 3. Cambridge university press.

[42]

Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1231--1245. https://doi.org/10.1145/2723372.2750549

Digital Library

[43]

Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow., Vol. 6, 8 (June 2013), 553--564. https://doi.org/10.14778/2536354.2536356

Digital Library

[44]

Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020 b. Complaint-Driven Training Data Debugging for Query 2.0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1317--1334. https://doi.org/10.1145/3318464.3389696

Digital Library

[45]

Yinjun Wu, Edgar Dobriban, and Susan Davidson. 2020 a. DeltaGrad: Rapid retraining of machine learning models. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 10355--10366. https://proceedings.mlr.press/v119/wu20b.html

[46]

Yinjun Wu, Val Tannen, and Susan B. Davidson. 2020 c. PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 447--462. https://doi.org/10.1145/3318464.3380571

Digital Library

[47]

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. CoRR, Vol. abs/1708.07747 (2017). arxiv: 1708.07747 http://arxiv.org/abs/1708.07747

[48]

Mingchao Yu, Zhifeng Lin, Krishna Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, and Salman Avestimehr. 2018. GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc., Redhook, NY, USA. https://proceedings.neurips.cc/paper/2018/file/cf05968255451bdefe3c5bc64d550517-Paper.pdf

[49]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. CoRR, Vol. abs/1605.07146 (2016). arxiv: 1605.07146 http://arxiv.org/abs/1605.07146

[50]

Xuezhou Zhang, Xiaojin Zhu, and Stephen J. Wright. 2018. Training Set Debugging Using Trusted Items. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2--7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 4482--4489. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16155

[51]

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2021. A Comprehensive Survey on Transfer Learning. Proc. IEEE, Vol. 109, 1 (2021), 43--76. https://doi.org/10.1109/JPROC.2020.3004555

Cited By

Côté PNikanjam AAhmed NHumeniuk DKhomh F(2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
https://doi.org/10.1007/s10515-024-00453-w
Flokas LWu WWang JVerma NWu EBoehm MVarma PXin D(2022)How I stopped worrying about training data bugs and started complainingProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533305(1-5)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533028.3533305

Index Terms

Complaint-Driven Training Data Debugging at Interactive Speeds
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Data provenance
    2. Information integration
      1. Data cleaning

Recommendations

Complaint-driven Training Data Debugging for Query 2.0
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is ...
Complaint Driven Training Data Debugging for Machine Learning Workflows
Querying data provenance
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

June 2022

2597 pages

ISBN:9781450392495

DOI:10.1145/3514221

General Chair:
Zachary Ives
University of Pennsylvania (USA)
,
Program Chairs:
Angela Bonifati
Lyon 1 University (France)
,
Amr El Abbadi
University of California, Santa Barbara (USA)

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12 - 17, 2022

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
495
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)20

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Côté PNikanjam AAhmed NHumeniuk DKhomh F(2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
https://doi.org/10.1007/s10515-024-00453-w
Flokas LWu WWang JVerma NWu EBoehm MVarma PXin D(2022)How I stopped worrying about training data bugs and started complainingProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533305(1-5)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533028.3533305

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten