skip to main content
10.1145/3514221.3517849acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Complaint-Driven Training Data Debugging at Interactive Speeds

Published: 11 June 2022 Publication History

Abstract

Modern databases support queries that perform model inference (inference queries). Although powerful and widely used, inference queries are susceptible to incorrect results if the model is biased due to training data errors. Recently, prior work Rain proposed complaint-driven data debugging which uses user-specified errors in the output of inference queries (Complaints) to rank erroneous training examples that most likely caused the complaint. This can help users better interpret results and debug training sets. Rain combined influence analysis from the ML literature with relaxed query provenance polynomials from the DB literature to approximate the derivative of complaints w.r.t. training examples. Although effective, the runtime is O(|T|d), where T and d are the training set and model sizes, due to its reliance on the model's second order derivatives (the Hessian). On a Wide Resnet Network (WRN) model with 1.5 million parameters, it takes >1 minute to debug a complaint. We observe that most complaint debugging costs are independent of the complaint, and that modern models are overparameterized. In response, Rain++ uses precomputation techniques, based on non-trivial insights unique to data debugging, to reduce debugging latencies to a constant factor independent of model size. We also develop optimizations when the queried database is known apriori, and for standing queries over streaming databases. Combining these optimizations in Rain++ ensures interactive debugging latencies (~1ms) on models with millions of parameters.

References

[1]
Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Atul Shenoy, Asvin Ananthanarayan, John Sheu, Erik Meijer, Xi Wu, Jeff Naughton, Peter Bailis, and Matei Zaharia. 2018. DIFF: A Relational Interface for Large-Scale Data Explanation. Proc. VLDB Endow., Vol. 12, 4 (Dec. 2018), 419--432. https://doi.org/10.14778/3297753.3297761
[2]
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for Aggregate Queries. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Athens, Greece) (PODS '11). Association for Computing Machinery, New York, NY, USA, 153--164. https://doi.org/10.1145/1989284.1989302
[3]
Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam Koenigstein, Nir Nice, and Ulrich Paquet. 2014. Speeding up the Xbox Recommender System Using a Euclidean Transformation for Inner-Product Spaces. In Proceedings of the 8th ACM Conference on Recommender Systems (Foster City, Silicon Valley, California, USA) (RecSys '14). Association for Computing Machinery, New York, NY, USA, 257--264. https://doi.org/10.1145/2645710.2645741
[4]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. https://mlsys.org/Conferences/2019/doc/2019/167.pdf
[5]
W. Cleveland and R. McGill. 1984. Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. J. Amer. Statist. Assoc., Vol. 79 (1984), 531--554.
[6]
George Corliss, Christèle Faure, Andreas Griewank, Laurent Hascoët, and Uwe Naumann (Eds.). 2002. Differentiation Methods for Industrial Strength Problems .Springer New York, New York, NY. https://doi.org/10.1007/978--1--4613-0075--5_1
[7]
Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., Redhook, NY, USA, 18518--18529. https://proceedings.neurips.cc/paper/2020/file/d77c703536718b95308130ff2e5cf9ee-Paper.pdf
[8]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[9]
Carl Eckart and Gale Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika, Vol. 1, 3 (1936), 211--218.
[10]
Open Neural Network Exchange. 2019. ONNX. https://onnx.ai/. [Online; accessed 1-December-2020].
[11]
Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net.
[12]
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. 2019. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2232--2241. https://proceedings.mlr.press/v97/ghorbani19b.html
[13]
W. D. Gray and D. Boehm-Davis. 2000. Milliseconds matter: an introduction to microstrategies and to their use in describing and predicting interactive behavior. Journal of experimental psychology. Applied, Vol. 6 4 (2000), 322--35.
[14]
Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance Semirings. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Beijing, China) (PODS '07). Association for Computing Machinery, New York, NY, USA, 31--40. https://doi.org/10.1145/1265530.1265535
[15]
Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. 2018. Gradient Descent Happens in a Tiny Subspace. CoRR, Vol. abs/1812.04754 (2018). arxiv: 1812.04754 http://arxiv.org/abs/1812.04754
[16]
Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. 2019. Data Cleansing for Models Trained with SGD. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Redhook, NY, USA. https://proceedings.neurips.cc/paper/2019/file/5f14615696649541a025d3d0f8e0447f-Paper.pdf
[17]
Jeffrey Heer and Michael Bostock. 2010. Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design. In Proceedings of the SIGCHI Conference on Human Factors in Computing System (Atlanta, Georgia, USA) (CHI '10). Association for Computing Machinery, New York, NY, USA, 203--212. https://doi.org/10.1145/1753326.1753357
[18]
Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow., Vol. 5, 12 (Aug. 2012), 1700--1711. https://doi.org/10.14778/2367502.2367510
[19]
Magnus R Hestenes and Eduard Stiefel. 1952. Methods of Conjugate Gradients for Solving Linear Systems1. J. Res. Nat. Bur. Standards, Vol. 49, 6 (1952).
[20]
Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. 2021. Approximate Data Deletion from Machine Learning Models. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 130), Arindam Banerjee and Kenji Fukumizu (Eds.). PMLR, 2008--2016. https://proceedings.mlr.press/v130/izzo21a.html
[21]
JAX. 2020. JAX reference documentation - JAX documentation. https://jax.readthedocs.io/en/latest/. [Online; accessed 1-December-2020].
[22]
Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1885--1894. https://proceedings.mlr.press/v70/koh17a.html
[23]
Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. 2019. On the Accuracy of Influence Functions for Measuring Group Effects. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Redhook, NY, USA. https://proceedings.neurips.cc/paper/2019/file/a78482ce76496fcf49085f2190e675b4-Paper.pdf
[24]
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. (2009).
[25]
Cornelius Lanczos. 1950. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators .United States Governm. Press Office Los Angeles, CA.
[26]
Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/.
[27]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, New York, NY, USA, 13--24. https://doi.org/10.1109/ICDE51399.2021.00009
[28]
Z. Liu and J. Heer. 2014. The Effects of Interactive Latency on Exploratory Visual Analysis. IEEE Transactions on Visualization and Computer Graphics, Vol. 20 (2014), 2122--2131.
[29]
Z. Liu and J. Stasko. 2010. Mental Models, Visual Reasoning and Interaction in Information Visualization: A Top-down Perspective. IEEE Transactions on Visualization and Computer Graphics, Vol. 16 (2010), 999--1008.
[30]
Google LLC. 2019. Introduction to BigQuery ML. https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. [Online; accessed 10-October-2019].
[31]
Alexandra Meliou and Dan Suciu. 2012. Tiresias: The Database Oracle for How-to Queries. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (Scottsdale, Arizona, USA) (SIGMOD '12). Association for Computing Machinery, New York, NY, USA, 337--348. https://doi.org/10.1145/2213836.2213875
[32]
Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going Beyond Provenance: Explaining Query Answers with Pattern-Based Counterbalances. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 485--502. https://doi.org/10.1145/3299869.3300066
[33]
OpenML. 2020. OpenML Supervised Classification on adult. https://www.openml.org/t/7592. [Online; accessed 1-December-2020].
[34]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 1135--1144. https://doi.org/10.1145/2939672.2939778
[35]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating Large-Scale Data Quality Verification. Proc. VLDB Endow., Vol. 11, 12 (Aug. 2018), 1781--1794. https://doi.org/10.14778/3229863.3229867
[36]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18--21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 1631--1642. https://aclanthology.org/D13--1170/
[37]
SQLFlow. 2019. SQLFlow: Bridging Data and AI. https://sqlflow.org. [Online; accessed 1-December-2020].
[38]
Justin Talbot, V. Setlur, and A. Anand. 2014. Four Experiments on the Perception of Bar Charts. IEEE Transactions on Visualization and Computer Graphics, Vol. 20 (2014), 2152--2160.
[39]
Tensorflow. 2020. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xla. [Online; accessed 1-December-2020].
[40]
Jason Teoh, Muhammad Ali Gulzar, and Miryung Kim. 2020. Influence-Based Provenance for Dataflow Applications with Taint Propagation. In Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event, USA) (SoCC '20). Association for Computing Machinery, New York, NY, USA, 372--386. https://doi.org/10.1145/3419111.3421292
[41]
Aad W Van der Vaart. 2000. Asymptotic statistics. Vol. 3. Cambridge university press.
[42]
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1231--1245. https://doi.org/10.1145/2723372.2750549
[43]
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow., Vol. 6, 8 (June 2013), 553--564. https://doi.org/10.14778/2536354.2536356
[44]
Weiyuan Wu, Lampros Flokas, Eugene Wu, and Jiannan Wang. 2020 b. Complaint-Driven Training Data Debugging for Query 2.0. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1317--1334. https://doi.org/10.1145/3318464.3389696
[45]
Yinjun Wu, Edgar Dobriban, and Susan Davidson. 2020 a. DeltaGrad: Rapid retraining of machine learning models. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 10355--10366. https://proceedings.mlr.press/v119/wu20b.html
[46]
Yinjun Wu, Val Tannen, and Susan B. Davidson. 2020 c. PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 447--462. https://doi.org/10.1145/3318464.3380571
[47]
Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. CoRR, Vol. abs/1708.07747 (2017). arxiv: 1708.07747 http://arxiv.org/abs/1708.07747
[48]
Mingchao Yu, Zhifeng Lin, Krishna Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, and Salman Avestimehr. 2018. GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc., Redhook, NY, USA. https://proceedings.neurips.cc/paper/2018/file/cf05968255451bdefe3c5bc64d550517-Paper.pdf
[49]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. CoRR, Vol. abs/1605.07146 (2016). arxiv: 1605.07146 http://arxiv.org/abs/1605.07146
[50]
Xuezhou Zhang, Xiaojin Zhu, and Stephen J. Wright. 2018. Training Set Debugging Using Trusted Items. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2--7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 4482--4489. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16155
[51]
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2021. A Comprehensive Survey on Transfer Learning. Proc. IEEE, Vol. 109, 1 (2021), 43--76. https://doi.org/10.1109/JPROC.2020.3004555

Cited By

View all
  • (2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
  • (2022)How I stopped worrying about training data bugs and started complainingProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533305(1-5)Online publication date: 12-Jun-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data cleaning
  2. data provenance
  3. machine learning debugging

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)181
  • Downloads (Last 6 weeks)20
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
  • (2022)How I stopped worrying about training data bugs and started complainingProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533305(1-5)Online publication date: 12-Jun-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media