research-article

MULTIVERSE: Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Analysis Approaches

Authors:

Mike A. Merrill,

Ge Zhang,

Tim AlthoffAuthors Info & Claims

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Pages 1212 - 1222

https://doi.org/10.1145/3447548.3467455

Published: 14 August 2021 Publication History

Get Access

Abstract

Data analyses are based on a series of "decision points" including data filtering, feature operationalization and selection, model specification, and parametric assumptions. "Multiverse Analysis" research has shown that a lack of exploration of these decisions can lead to non-robust conclusions based on highly sensitive decision points. Importantly, even if myopic analyses are technically correct, analysts' focus on one set of decision points precludes them from exploring alternate formulations that may produce very different results. Prior work has also shown that analysts' exploration is often limited based on their training, domain, and personal experience. However, supporting analysts in exploring alternative approaches is challenging and typically requires expert feedback that is costly and hard to scale.

Here, we formulate the tasks of identifying decision points and suggesting alternative analysis approaches as a classification task and a sequence-to-sequence prediction task, respectively. We leverage public collective data analysis knowledge in the form of code submissions to the popular data science platform Kaggle to build the first predictive model which supports Multiverse Analysis. Specifically, we mine this code repository for 70k small differences between 40k submissions, and demonstrate that these differences often highlight key decision points and alternative approaches in their respective analyses.We leverage information on relationships within libraries through neural graph representation learning in a multitask learning framework. We demonstrate that our model, MULTIVERSE, is able to correctly predict decision points with up to 0.81 ROC AUC, and alternative code snippets with up to 50.3% GLEU, and that it performs favorably compared to a suite of baselines and ablations. We show that when our model has perfect information about the location of decision points, say provided by the analyst, its performance increases significantly from 50.3% to 73.4% GLEU. Finally, we show through a human evaluation that real data analysts find alternatives provided by MULTIVERSE to be more reasonable, acceptable, and syntactically correct than alternatives from comparable baselines, including other transformer-based seq2seq models.

Supplementary Material

MP4 File (multiverse_mining_collective_data_science-mike_a_merrill-ge_zhang-38957981-nkg7.mp4)

Presentation Video

Download
162.99 MB

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. (2014).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

multiverse: Multiplexing Alternative Data Analyses in R Notebooks

Understanding and Supporting Debugging Workflows in Multiverse Analysis

Paths Explored, Paths Omitted, Paths Obscured: Decision Points & Selective Reporting in End-to-End Data Analysis

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations