skip to main content
10.1145/3447548.3467455acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

MULTIVERSE: Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Analysis Approaches

Published: 14 August 2021 Publication History

Abstract

Data analyses are based on a series of "decision points" including data filtering, feature operationalization and selection, model specification, and parametric assumptions. "Multiverse Analysis" research has shown that a lack of exploration of these decisions can lead to non-robust conclusions based on highly sensitive decision points. Importantly, even if myopic analyses are technically correct, analysts' focus on one set of decision points precludes them from exploring alternate formulations that may produce very different results. Prior work has also shown that analysts' exploration is often limited based on their training, domain, and personal experience. However, supporting analysts in exploring alternative approaches is challenging and typically requires expert feedback that is costly and hard to scale.
Here, we formulate the tasks of identifying decision points and suggesting alternative analysis approaches as a classification task and a sequence-to-sequence prediction task, respectively. We leverage public collective data analysis knowledge in the form of code submissions to the popular data science platform Kaggle to build the first predictive model which supports Multiverse Analysis. Specifically, we mine this code repository for 70k small differences between 40k submissions, and demonstrate that these differences often highlight key decision points and alternative approaches in their respective analyses.We leverage information on relationships within libraries through neural graph representation learning in a multitask learning framework. We demonstrate that our model, MULTIVERSE, is able to correctly predict decision points with up to 0.81 ROC AUC, and alternative code snippets with up to 50.3% GLEU, and that it performs favorably compared to a suite of baselines and ablations. We show that when our model has perfect information about the location of decision points, say provided by the analyst, its performance increases significantly from 50.3% to 73.4% GLEU. Finally, we show through a human evaluation that real data analysts find alternatives provided by MULTIVERSE to be more reasonable, acceptable, and syntactically correct than alternatives from comparable baselines, including other transformer-based seq2seq models.

Supplementary Material

MP4 File (multiverse_mining_collective_data_science-mike_a_merrill-ge_zhang-38957981-nkg7.mp4)
Presentation Video

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. (2014).
[2]
Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, and Christopher Ré. 2020. Low-Dimensional Hyperbolic Knowledge Graph Embeddings. arXiv:2005.00545 (2020).
[3]
Ines Chami, Rex Ying, Christopher Re, and Jure Leskovec. 2019. Hyperbolic Graph Convolutional Neural Networks. NeurIPS (2019).
[4]
Zimin Chen, Steve James Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2019. Sequencer: Sequence-to-sequence learning for end-to-end program repair. TSE (2019).
[5]
S. Chollampatt and H.T. Ng. 2018. A multilayer convolutional encoder-decoder neural network for grammatical error correction. arXiv:1801.08831 (2018).
[6]
J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2019).
[7]
Bijaya et. al. 2018a. Sub2Vec: Feature Learning for Subgraphs. In KDD.
[8]
Silberzahn et. al. 2018b. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science 3 (2018).
[9]
Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2017. Style transfer in text: Exploration and evaluation. arXiv:1711.06861 (2017).
[10]
Andrew Gelman and Eric Loken. 2014. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ?'shing expedition" or ?p-hacking" and the research hypothesis was posited ahead of time. (2014).
[11]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD.
[12]
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In SIGSOFT.
[13]
Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In FSE.
[14]
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. arxiv: cs.CL/1904.09751
[15]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arxiv: cs.CL/1508.01991
[16]
J. Weston Hughes, Keng-hao Chang, and Ruofei Zhang. 2019. Generating Better Search Engine Text Advertisements with Deep Reinforcement Learning. KDD.
[17]
JetBrains. 2018. JetBrains Data Science in 2018.
[18]
Shaojie Jiang, Pengjie Ren, Christof Monz, and Maarten de Rijke. 2019. Improving Neural Response Diversity with Frequency-Aware Cross-Entropy Loss. In WWW.
[19]
Kyle Kelley and Brian Granger. 2017. Jupyter Frontends: From the Classic Jupyter Notebook to JupyterLab, nteract, and Beyond. JupyterCon (2017).
[20]
Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In CHI.
[21]
Mary Beth Kery, Bonnie E John, Patrick O'Flaherty, Amber Horvath, and Brad A Myers. 2019. Towards effective foraging by data scientists to find past analysis choices. In CHI.
[22]
Mary Beth Kery and Brad A. Myers. 2018. Interactions for Untangling Messy History in a Computational Notebook. In VL/HCC.
[23]
An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2015. Combining deep learning with information retrieval to localize buggy files for bug reports (n). In ASE. IEEE.
[24]
Etienne P. LeBel, Randy J. McCarthy, Brian D. Earp, Malte Elson, and Wolf Vanpaemel. 2018. A Unified Framework to Quantify the Credibility of Scientific Findings. Advances in Methods and Practices in Psychological Science 3 (2018).
[25]
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv:1910.13461 (2019).
[26]
Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. 2017. Deep recurrent generative decoder for abstractive text summarization. (2017).
[27]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out.
[28]
Yang Liu, Tim Althoff, and Jeffrey Heer. 2020 a. Paths Explored, Paths Omitted, Paths Obscured: Decision Points & Selective Reporting in End-to-End Data Analysis. CHI (2020).
[29]
Yang Liu, Alex Kale, Tim Althoff, and Jeffrey Heer. 2020 b. Boba: Authoring and visualizing multiverse analyses. arxiv: cs.HC/2007.05551
[30]
Alireza Mohammadshahi and James Henderson. 2020. Graph-to-Graph Transformer for Transition-based Dependency Parsing. arXiv:1911.03561 (2020).
[31]
Eugene W. Myers. 1986. AnO(ND) difference algorithm and its variations. Algorithmica 1 (1986).
[32]
Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2016. GLEU without tuning. arXiv preprint arXiv:1605.02592 (2016).
[33]
Daniel Povey and et. al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
[34]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural Machine Translation of Rare Words with Subword Units. CoRR (2015). arxiv: 1508.07909
[35]
Uri Simonsohn, Joseph P. Simmons, and Leif D. Nelson. 2015. Specification Curve: Descriptive and Inferential Statistics on All Reasonable Specifications. SSRN Electronic Journal (2015).
[36]
Sara Steegen, Francis Tuerlinckx, Wolf Vanpaemel, and Andrew Gelman. 2016. Increasing Transparency Through a Multiverse Analysis. Perspectives on Psychological Science 5 (2016).
[37]
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode Compose: Code Generation Using Transformer. arXiv:2005.08025 [cs] (2020). arXiv: 2005.08025.
[38]
Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On learning meaningful code changes via neural machine translation. In ICSE. IEEE.
[39]
Wei-Hung Weng, Yu-An Chung, and Peter Szolovits. 2019. Unsupervised Clinical Language Translation. KDD.
[40]
J.M. Wicherts, C.L.S. Veldkamp, H.E.M. Augusteijn, M. Bakker, R.C.M. van Aert, and M.A.L.M. van Assen. 2016. Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Frontiers in Psychology (2016).
[41]
Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. In NeruIPs.
[42]
Ge Zhang, Mike A. Merrill, Yang Liu, Jeffrey Heer, and Tim Althoff. 2020. CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis. arxiv: cs.LG/2008.12828

Cited By

View all
  • (2022)CORAL: COde RepresentAtion learning with weakly-supervised transformers for analyzing data analysisEPJ Data Science10.1140/epjds/s13688-022-00327-911:1Online publication date: 18-Mar-2022

Index Terms

  1. MULTIVERSE: Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Analysis Approaches

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
      August 2021
      4259 pages
      ISBN:9781450383325
      DOI:10.1145/3447548
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 August 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. code representation learning
      2. garden of forking paths
      3. metascience
      4. multiverse analysis
      5. robust data science
      6. seq2seq

      Qualifiers

      • Research-article

      Funding Sources

      • NSF

      Conference

      KDD '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)37
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)CORAL: COde RepresentAtion learning with weakly-supervised transformers for analyzing data analysisEPJ Data Science10.1140/epjds/s13688-022-00327-911:1Online publication date: 18-Mar-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media