Open access
Author
Date
2021Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Improving developer productivity is an important, but very difficult task, that researchers from both academia and industry have been trying to solve for decades. This has become even more challenging given the enormous scale at which today’s software is produced. There is, however, an upside to this scale: the increased availability of code creates an exciting opportunity to learn from these large datasets.
The goal of this work is to leverage these datasets and to create programming tools that accomplish tasks that were previously difficult or practically infeasible. We address this problem, both at the foundational level by developing new techniques that learn over existing code and synthesize new programs, as well as at the application level, by creating software tools based on these models.
First, we address the core task of learning probabilistic models of code that achieve state-of-the-art precision and are applicable across a variety of programming languages. For this, we developed a novel probabilistic model, we identified the right program representation to be compiled into that model, and we designed suitable learning and inference algorithms. The key novelty of our approach is that our probabilistic model is parametrized by a learned program, rather than a set of non-interpretable weights, as typically done in machine learning.
Next, we address the problem of learning models of code that are not only accurate, but also robust. This is a critical issue as existing models have shown to be highly non-robust -- a small input modification (e.g., code refactoring) can cause the model to consistently produce the wrong result, thus hindering the tool's adoption in practice and pose a potential security risk. This is a highly non-trivial task with several key challenges: learning the parts of the program relevant for the prediction without conditioning on the entire program, allowing the model to over-approximate the result when uncertain and developing models that learn compositional rules. In our work, we solve this problem from two perspectives: first, from the programming languages angle, we learn interpretable rules of a static analyzer, and second, from the machine learning perspective, we learn a robust deep learning model that infers type annotations for dynamically typed languages.
Finally, we develop two tools, InferUI and FastSMT, that automate the tedious and inefficient task of writing programs for two different application domains: writing relational layouts for the Android platform and writing strategies that are fast at solving SMT formulas. Further, both our tools significantly improve upon the programs written manually by domain experts; they prevent common layouts errors and achieve two orders of magnitude speed-up over the Z3 solver, respectively. To make these tools practical, we combine program synthesis and machine learning. This allows us to synthesize programs from a single input-output example, while the machine learning component enables the synthesis to scale and generalize to real-world programs and formulas. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000498126Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Brockschmidt, Marc
Examiner: Charles, Sutton
Examiner: Yahav, Eran
Examiner: Vechev, Martin
Publisher
ETH ZurichOrganisational unit
03948 - Vechev, Martin / Vechev, Martin
Funding
680358 - Learning from Big Code: Probabilistic Models, Analysis and Synthesis (EC)
Related publications and datasets
More
Show all metadata
ETH Bibliography
yes
Altmetrics