1 Introduction

Probabilistic programming languages (PPLs) enable users who are not experts in statistics to cleanly and concisely express probabilistic models [10, 11, 20]. They offer users simple abstractions and easy access to sophisticated statistical inference algorithms [4, 16, 18, 22] for analyzing their models.

However, writing a PPL model by hand is still challenging for non-statisticians and non-programmers. First, understanding data is difficult. Reviewing large amounts of data to develop a mental model is time-consuming, and humans are prone to misinterpretations and biases. Second, translating insights to a precise statistical model of the data is difficult. To write probabilistic models that reflect their insights, users must first learn some probability theory, understand the subtleties of various probability distributions, and express the details of how different variables in a model should depend on each other.

For these reasons, we believe PPL models should be synthesized from datasets automatically. PPL models offer an interesting point in the modeling design space. Expressing models in PPLs does not make them more expressive or more accurate, but it does give users access to powerful abstractions. They can easily ask how likely an event is in a model, performing complicated inference tasks with a single line of code. They can turn a generative model into a classifier or a predictor in under a minute. They can hypothesize alternative worlds or insert interventions and observe how those edits change outcomes. The PPL synthesis in this paper is not aimed at producing models that exceed the accuracy of state-of-the-art ML on important problems, but we believe a PL-centric approach does put usable, powerful models in the hands of non-experts.

To date, we know of one tool, PSketch [23], that synthesizes PPL programs. PSketch takes as input a PPL sketch and a dataset. A sketch, in this case, is a nearly complete PPL program, with some holes. Once a user expresses which variables may affect each hole, PSketch synthesizes expressions to fill the holes. While synthesizing partial PPL programs is already a tremendous step forward, the sketch writing process still requires users to carefully inspect the data, write most of the program structure, and specify causal dependencies. Ultimately, the user still writes a piece of code that is quite close to a complete model.

We introduce DaPPer (Data-Guided Probabilistic Program Synthesizer), a tool that synthesizes full PPL models from relational datasets. Our system decomposes the PPL synthesis problem into three stages. In the first stage, we generate a graph of dependencies between variables using one of three techniques: including all possible dependencies, analyzing the correlation between variables, or applying network deconvolution [8]. We use the dependency graph to write a program sketch that restricts the program structure. Second, we fill the holes in our sketch using a data-guided stochastic synthesis approach built on top of simulated annealing. At each iteration of our search, we mutate the candidate program and use the input dataset to tune some program parameters. We follow PSketch in computing the candidate’s score—its likelihood given the dataset—using Mixtures of Gaussian distributions. Finally, after we obtain an accurate program from the prior stage, we use a redundancy reduction algorithm to make the output program more readable while maintaining its accuracy.

We have evaluated our synthesizer on a suite of 14 benchmarks, a mix of existing PPL models and models designed to stress our tool. Each benchmark in the suite has 10,000 rows of training data and 10,000 rows of test data, both generated from the same probabilistic model; thus, each benchmark is also associated with a ground truth PPL program to which we can compare synthesized programs. In our experiments, our synthesizer produced accurate models in 21 s on average. To test whether our approach works on real data, we also used DaPPer to synthesize a model of airline delay data. Leveraging our target PPL’s built-in inference functionality, we used this model to predict flight delays.

This paper makes the following contributions:

  • We present a tool for synthesizing PPL models from data. To our knowledge, this is the first synthesizer that generates full PPL models.

  • We introduce a data-guided stochastic technique for generating candidate programs. Data-guidance improves synthesis time by two orders of magnitude compared to a data-blind approach.

  • We compare three techniques for generating dependency graphs from data.

  • We present an algorithm for improving program readability after synthesis while maintaining accuracy. We can reduce the size of a synthesized program by up to 8x with less than a 1% penalty in accuracy.

2 Probabilistic Programs

Probabilistic programming languages are standard languages with two additional constructs: (i) random variables whose values are drawn from probability distributions, and (ii) mechanisms for conditioning variable values on the observed values of other variables [12]. Although these constructs form a common backbone, other language features vary greatly from PPL to PPL [10, 11, 20]. Probabilistic programs offer a natural way to represent probabilistic models. For example, the classic burglary model can be expressed with the PPL program in Fig. 1.

Fig. 1.
figure 1

The classic burglary model in BLOG (a PPL).

The output of a probabilistic program is not a value but a probability distribution over all possible values. While a deterministic program produces the same value for a given variable during every execution, a probabilistic program may produce different values. The value of each variable is drawn from one or more probability distributions, as defined by the programmer. We can obtain an approximation of the distribution of a variable by running the program many times. For example, if we run the burglary program in Fig. 1 many times, we observe that Burglary has the value true in approximately 0.001 of the executions. Alarm only becomes common if Burglary or Earthquake is true, but both are rare, so running the programalso reveals that Alarm is often false.

Programmers can add observations in PPLs to obtain posterior probability distributions conditioned on the observations. For example, if we run our sample program with an observation statement obs JohnCalls = true, the program rejects executions in which JohnCalls is false, and we observe that in many runs Burglary is also true. For a thorough introduction to PPLs, we recommend [12].

3 System Overview

We will explain DaPPer with a working example, a model of how a student’s tiredness and skill level affect performance on a test. The inputs to our tool are a relational dataset (e.g. Table 1) and a hypothesis about the direction of causal links between variables. Each column in a dataset is treated as a variable in the output program, and each row represents an independent run of the program. A causation hypothesis is an ordering of the dataset column identifiers; for our running example, the hypothesis might be \(\texttt {tired} \rightarrow \texttt {skillLevel} \rightarrow \texttt {testPerformance}\). The order specifies the direction of dependencies but does not restrict which variables are connected. From the given order, we can conclude that if tired and testPerformance are related, tired affects testPerformance, rather than testPerformance affecting tired. While this means that our tool still demands some insights from users, they only have to use their knowledge of the world to guess in which direction causality may flow; that is, our tool does not ask ‘is there a relationship between tiredness and test performance?’ but it asks ‘if there is a relationship between the two, does tiredness affect test performance, or does test performance affect tiredness?’ In Fig. 2, we show the ground truth program that produced the data in Table 1.

DaPPer generates PPL programs by decomposing the synthesis task into three subtasks: (i) dependency graph generation, (ii) data-guided stochastic synthesis, and (iii) redundancy reduction. In this section, we briefly discuss their roles, and how they interoperate, as illustrated in Fig. 3.

Table 1. Dataset with the tiredness, skill level, and test performance of several students.
Fig. 2.
figure 2

Running example, a program that expresses a model of how a student’s tiredness and skill level affect test performance.

Fig. 3.
figure 3

A system diagram illustrating the workflow of the synthesizer.

3.1 Dependency Graph Generation

The synthesizer’s first task is to determine whether any given random variable—any given dataset column—depends on any other random variables. This problem corresponds to the model selection problem in Bayesian networks. In our context, this is the dependency graph generation problem because a directed graph of which variables affect which other variables defines a program structure.

Fig. 4.
figure 4

The dependency graphs generated for the dataset sampled in Table 1.

Fig. 5.
figure 5

The program skeletons generated from the dependency graphs in Fig. 4. We use ?? to represent a part of the program that we have not yet synthesized.

We explore three techniques for generating dependency graphs. First, a Complete approach produces the largest possible dependency graph that the user’s causation hypothesis permits. Second, a Simple Correlation approach adds edges greedily in order of correlation, from highest to lowest, excluding edges for which there already exists a path. Third, we use an existing Network Deconvolution algorithm [8]. Given a dataset like the one in Table 1 and the causation hypothesis represented by the order of the columns (tired \(\rightarrow \) skillLevel \(\rightarrow \) testPerformance), the three techniques produce the dependency graphs depicted in Fig. 4.

Given a dependency graph, we generate a sketch. For each variable x depending on \(\{x_1, x_2, ...\}\), we define x as a nested conditional expression with holes:

figure a

For each condition (xi ?? ??), the first ?? is a comparison operator, and the second ?? is an expression. Figure 5 shows the sketches generated from the dependency graphs in Fig. 4.

3.2 Data-Guided Stochastic Synthesis

The second task is to complete the holes ?? in the program sketch generated by the previous step. We use simulated annealing (SA) to complete this task.

Fig. 6.
figure 6

One mutation of the program in Fig. 5(c). highlights the changes. (Color figure online)

In each SA iteration, DaPPer creates a new program candidate from a current candidate by (i) mutating an expression, then (ii) deterministically updating all parameters in the program associated with the mutated expression, using knowledge about the data. Consider the candidate program in Fig. 6(a), a completion of the Fig. 5(c) program sketch. To generate a new candidate, our tool randomly mutates one condition, changing the RHS of a comparison from 15 to 16; this produces the sketch in Fig. 6(b). With the condition changed, the parameters for the distributions are out of date. DaPPer identifies all rows of the input data in which skillLevel is less than 16, then uses those rows to select parameters for the distributions in the true branch, and the remaining rows for the false branch, producing the new program in Fig. 6(c).

To evaluate programs, DaPPer uses a custom likelihood estimation approach.

3.3 Redundancy Reduction

Sometimes the most accurate model that results from the synthesis step is a model that distinguishes between many cases, even more than the ground truth. Such models can be very large, often unreadable. Because our goal is to produce readable programs from which users can extract high-level insights, this is undesirable for our purposes.

To improve the readability of our outputs, we developed a redundancy reduction algorithm that collapses similar branches. Because this is applied as a final processing stage to an already synthesized program, the reduction process is very fast, and users can easily and quickly tune the amount of reduction to their needs, based on the output at hand.

4 Language

DaPPer generates programs from the grammar shown in Fig. 7. This grammar represents a subset of the BLOG language [20], centered on the features necessary for declaring random variables.Footnote 1 While BLOG has many interesting features that set it apart from other PPLs, such as open-universe semantics, our synthesized programs do not make use of these. DaPPer needs only the features that allow it to introduce random variables drawn from distributions and describe how they depend on other random variables.

Many PPLs can express such programs, so many PPLs would be reasonable target languages. We chose to synthesize programs in the BLOG language, but with small changes to our code generator, DaPPer could easily target others.

Fig. 7.
figure 7

The subset of the BLOG language used by our synthesizer.

5 Generating Dependency Graphs

We use dependency graphs to generate a skeleton program structure, a program with variable declarations and partial conditional expressions in their definitions, but without the conditions or bodies. To generate these dependency graphs, we have tested three approaches, which we describe here.

5.1 Complete

The complete approach constructs a dependency graph based on the assumption that each variable depends on all other variables that precede it in the user’s causation hypothesis. Note that this approach does not use the input dataset. If the user provides the hypothesis \(A \rightarrow B \rightarrow C \rightarrow D\), the complete approach produces a graph in which A depends on no other variables, B depends on \(\{A\}\), C depends on \(\{A, B\}\), and D depends on \(\{A, B, C\}\).

If the causation hypothesis is correct, this approach is always sufficient to express the ground truth. It breaks the outcomes into the largest number of cases and thus theoretically allows the greatest customization of the program to the input data. However, it may also introduce redundancy, distinguishing between cases even when the distinction does not affect the outcome.

5.2 Correlation Heuristic

The correlation heuristic approach uses information from the input dataset as well as the causation hypothesis. It calculates the correlation for every pair of columns in the dataset. The pairs are sorted according to the effect size of the correlation. We iterate through the sorted list of pairs, checking for each pair (AB) whether there is already a path in the dependency graph between A and B. If yes, we do nothing. If no, we add an edge between A and B; the direction of the edge is determined by the positions of A and B in the causation hypothesis ordering. If we reach a point in the list of pairs where the correlation effect size or statistical significance is very low, we stop adding edges.

See Appendix A for a summary of how we produce correlation measures for columns with incompatible types.

5.3 Network Deconvolution

Our final approach uses the network deconvolution algorithm, developed by Feizi et al. [8], a method for inferring which nodes in a network directly affect each other, based on an observed correlation matrix that reflects both direct and indirect effects. To build a PPL model, we can observe the correlation between each pair of columns, but should only condition a variable on the variables that directly affect it. Thus, a direct link from x to y in network deconvolution corresponds to an immediate dependence of y on x in a PPL model.

Network deconvolution takes as input a similarity matrix. We use a symmetric square matrix of correlations, reflecting the association of each column with each other column in the dataset. Each entry in the network deconvolution output matrix represents the likelihood that a column pair is connected by a direct edge. For each entry in the output matrix, we add an edge to the dependency graph if it is above a low threshold.

6 Data-Guided Stochastic Synthesis

DaPPer applies simulated annealing (SA) to synthesize PPL programs. We use an exponential cooling schedule and the standard Kirkpatrick acceptance probability function. To apply SA, we must generate new candidate programs that are ‘adjacent’ to existing programs. Our synthesizer decomposes the process of creating a new candidate program into two stages: make a random mutation of the current candidate program (Sect. 6.1), then tune all parameters in the affected subtree of the AST to best match the input dataset (Sect. 6.2). Once a new candidate has been produced, SA scores the candidate to decide whether to accept or reject it (Sect. 6.3).

6.1 Mutations

The first step in creating a new adjacent program is to randomly mutate the current program. We allow three classes of mutation, described below.

Conditions. Our synthesizer is permitted to synthesize conditions of a restricted form (see cond in Fig. 7). They must have a single identifier (fixed based on the dependency graph) on the LHS, and an expression on the RHS. Because we deterministically generate fixed RHSs for conditions with Boolean and categorical variables in the LHS (e.g., boolVar == true, boolVar == false), the mutation process may not manipulate those conditions. Instead, its primary role is to generate new RHSs for conditions associated with real-valued variables. To alter a RHS, the mutator may: (i) replace any constant or use of a real-valued variable with a new constant or real-valued variable, (ii) slightly adjust a current constant, (iii) add, remove, or change a numOp, (iv) change a cmpOp.

Branches. Less commonly, the mutator may add or remove a condition associated with a real-valued variable. The mutator may not alter the structure of the dependency graph, but it may add a branch to an existing case split or remove a branch if it has more than two.

Distribution Selection. Finally, the mutator may alter what type of distribution appears in the body of a conditional, for definitions of real-valued variables. For instance, it may change a Gaussian to a UniformReal distribution.

6.2 Data Guidance

Once we obtain the control flow for a new candidate program from the mutator, we tune the distribution parameters to fit the dataset. For each distribution node in the abstract syntax tree, we identify the path condition associated with the node. We convert the path condition into a filter on the input dataset. For instance, consider the control flow in Fig. 8(a). To produce the parameter for the first Boolean distribution node, we would produce the path condition \(Burglary \wedge Earthquake\). Using this as a filter over the input dataset would produce the rows highlighted in Fig. 8(b). Once we have identified the subset of the dataset consistent with a distribution’s path condition, we use the subset to calculate appropriate distribution parameters. For instance, if the distribution is a Gaussian, we calculate the mean and variance.

Fig. 8.
figure 8

Filtering a dataset for a path condition in the classic burglary model. (Color figure online)

Once DaPPer completes this process for all distributions whose path conditions are affected by the mutation, the candidate generation stage is complete.

6.3 Likelihood Estimation

To evaluate how well a program models a dataset, we must compute the likelihood \(\mathcal {L}(P | D)\) of a candidate program P given the input data D. Computing the exact likelihood [6] requires expensive integral computations, which makes scoring slow. Thus, we adopt the method used by PSketch [23] for approximating likelihood using Mixtures of Gaussian (MoG) distributions, which they show is three orders of magnitude faster than precise likelihood calculation [23].

The approach is to symbolically approximate every variable in a candidate program with MoG or Bernoulli distributions. We approximate real-valued expressions using a MoG, whose probability density function (PDF) is:

$$\begin{aligned} MoG(x ; n,\varvec{w},\varvec{\mu },\varvec{\sigma }) = \sum _{i=1}^{n} w_i \cdot g(x ; \mu _i, \sigma _i) \end{aligned}$$

where \(\varvec{w},\varvec{\mu },\) and \(\varvec{\sigma }\) are vectors of size n, representing the weight, mean, and standard deviation of each Gaussian distribution in the mixture. The function g is the PDF of a univariate Gaussian distribution. A Boolean expression is simply modeled as a Bernoulli distribution, Brn(xp). Therefore, we approximate each variable v in a program as a MoG with the PDF \(MoG_v(x) = MoG(x ; n_v,\varvec{w_v},\varvec{\mu _v},\varvec{\sigma _v})\) or a Bernoulli with the PDF \(Brn(x ; p_v)\).

The approximation of \(\mathcal {L}(P | D)\) is the product of the likelihood of all possible values of all variables from D given the program P:

$$\begin{aligned} \mathcal {L}(P | D) = \prod _{v \in P_{RV}} \prod _{x \in D[v]} MoG_v(x) \times \prod _{v \in P_{BV}} \prod _{x \in D[v]} Brn(x ; p_v) \end{aligned}$$

where \(P_{RV}\) is a set of real variables and \(P_{BV}\) is a set of Boolean variables in the program P that appear in data D. D[v] is a set of values of the variable v in D.

In addition to the distributions supported by PSketch, we add support for categorical distributions and uniform distributions, which we describe here.

Categorical Distribution. A categorical distribution specifies probabilities for each value in a finite discrete set. The PDF of a categorical distribution is:

$$\begin{aligned} Ctg(x ; \{ (x_i \rightarrow p_i) | i \in \{1,...,k\} \}) = p_i \,{\hbox {when}\, x = x_i} \end{aligned}$$

We introduce reduction rules to symbolically evaluate expressions with categorical distributions, shown in Fig. 9. The first rule evaluates an if expression, which may contain categorical distributions. The second rule evaluates a case expression. Although a case expression can be desugared to a nested if expression, the PDF that results is often less precise. In particular, \( case( [(Ctg_v(x) == x_1, Y_1), (Ctg_v(x) == x_2, Y_2), ...]) \) can be desugared to \( ite(Ctg_v(x) == x_1, Y_1, ite(Ctg_v(x) == x_2, Y_2, ...)) \), where \(Ctg_v(x) = Ctg(x ; \{ (x_i \rightarrow p_i) | i \in \{1,...,k_v\} \})\). When we evaluate the former expression, we expect the resulting distribution to be the summation of \(Y_1\), \(Y_2\), ..., \(Y_{k-1}\), \(Y_k\) weighted by \(p_1\), \(p_2\), ..., \(p_{k-1}\), \(p_k\) respectively. However, if we evaluate the latter expression using the first rule, we will obtain the summation of \(Y_1\), \(Y_2\), ..., \(Y_{k-1}\), \(Y_k\) weighted by \(p_1\), \((1-p_1)p_2\), ..., \((\prod _{i=1}^{k-2} (1-p_i))p_{k-1}\), \(\prod _{i=1}^{k-1} (1-p_i)\) respectively. This is because the ite rule is designed for the scenario in which path conditions are independent. Therefore, we introduce the case rule to handle case expressions whose conditions are dependent and mutually exclusive in order to obtain better likelihood estimations. The remaining rules in Fig. 9 define how to evaluate \(\oplus \) and \(\otimes \) used in the ite and case rules.

Fig. 9.
figure 9

Reduction rules to symbolically execute expressions that use categorical distributions.

Uniform Real Distribution. A uniform real distribution has constant probability for any real number between a lower bound a and an upper bound b. The PDF of a uniform real distribution is defined as:

$$\begin{aligned} Uniform(x ; a,b) = {\left\{ \begin{array}{ll} \frac{1}{b-a} &{} \hbox {if} {a \le x \le b} \\ 0 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

We approximate a uniform distribution as follows:

$$\begin{aligned}&[\![ Uniform(x ; a,b) ]\!] := MoG(x ; n,w,\varvec{\mu },\varvec{\sigma }) \\&\text {where } w_i = \frac{1}{n}, \mu _i = a + (i+\frac{1}{2}) \cdot \frac{b-a}{n}, \sigma _i = \frac{b-a}{n} \end{aligned}$$

We use \(n = 32\). We can obtain better approximations by increasing n, but at the cost of slower evaluation.

7 Redundancy Reduction

To improve the readability of our outputs, we have developed a simple redundancy reduction algorithm that combines the similar branches of a given conditional expression. The key idea is to compare the parameters of branches’ descendant distributions. For a pair of branches, we align the two sets of descendant distributions according to their associated path condition suffixes. These distribution pairs are the pairs that we combine if we collapse the branches into a single branch for executions that satisfy either of their path conditions. The decision to collapse them is based on both the differences in distribution pairs’ parameters and on how much data DaPPer used to tune them.

For a concrete example, recall that in the classic Burglary model, MaryCalls does not depend on JohnCalls. After stochastic synthesis, we might see the following snippet within the MaryCalls definition: if JohnCalls then Boolean(0.010) else Boolean(0.008). The parameters of the distributions in these two branches are very close. Further, we know both parameters were trained on a small number of rows. Given normal reduction parameters, this would lead our algorithm to collapse the two branches into something like Boolean(0.009) – essentially to conclude that MaryCalls does not depend on JohnCalls.

Our algorithm accepts two user-selected threshold parameters. By manipulating these parameters, users can explore different levels of readability without having to re-synthesize. Reduction is applied as a fast post-processing step, after the bulk of the synthesis is completed, so users can quickly and easily tune the amount of redundancy reduction to their needs. They can choose to edit and run analyzes on a readable version, knowing how much accuracy they have sacrificed, or they can temporarily adjust reduction parameters to extract high-level insights about program structure without permanently sacrificing accuracy.

See Appendix B for the full algorithm and details of the motivation.

8 Limitations

We see several limitations of the current synthesizer, all of which suggest interesting directions for future work.

Relational Input Data. Our approach handles only relational datasets, specifically relational datasets that treat each row as an independent run of the program. While there are many such datasets and our tool already allows us to model many interesting processes, our current technique does not apply for datasets that cannot be transformed into this format. Thus, we cannot take advantage of some of the BLOG language’s more interesting features (e.g., open-universe semantics, which allows programs in the language to represent uncertainty about whether variables exist, or how many variables there are of a given type). For our current synthesis model, we must know the number of variables.

Hidden Variables. Our synthesis model assumes there are no hidden variables, that no additional columns of data about the world are necessary to produce a correct output. Our decision to exclude hidden variables is one of the crucial differences from the PSketch [23] approach. While PSketch is targeted at programmers looking to write functions that include randomness, we want our tool to be accessible to scientists and social scientists modeling real world phenomena. For these purposes, we expect that hypothesizing the existence of a hidden variable—which may have no correspondence to any real world variable—would only confuse the user and make output models less intelligible and less useful to the target audience. We see the addition of optional hidden variable introduction as an interesting technical challenge and a good direction for future work. However, we would never make hidden variable hypotheses the default. Although we believe this is a improvement over PSketch from the perspective of our target user, this is a limitation from the perspective of a PSketch user.

Restricted Grammar. For this first foray into synthesis of full PPL programs, we selected a fairly restricted grammar. We examined many existing BLOG models and designed a grammar that would express the sorts of models people already like to write. However, we find it easy to imagine models we would like to obtain that cannot be expressed in our chosen subset of BLOG. Although this may be the most serious limitation, it is also the most easily addressed, since we can cleanly extend our technique by simply expanding the grammar and the set of allowable mutations. Because each increase in the grammar size also expands the search space, this will probably be more than a trivial extension. We expect it may offer a good setting for exploring the use of a probabilistic language model and weighted search for a program synthesis application.

9 Evaluation

We evaluated DaPPer on a suite of 14 benchmarks. Each benchmark consists of 10,000 rows of training data, 10,000 rows of test data, and the ground truth BLOG program used to generate both datasets. Some benchmarks were taken directly from the sample programs packaged with the BLOG language. Since these were quite simple, we also wrote new programs to test whether our tool can synthesize more complex models. The DaPPer source and all benchmark programs and associated datasets are available at github.com/schasins/PPL-synthesis. DaPPer synthesized accurate programs for all benchmarks, taking less than 21 s on average. We also used DaPPer to generate a model of a real flight delay dataset.

9.1 Dependency Graph Generation

We start with an exploration of how dependency graphs affect synthesis time and accuracy. Recall that the Complete approach produces the largest possible dependency graph. The Correlation approach produces graphs smaller than Complete graphs but usually larger than Network Deconvolution (ND) graphs.

Figure 10(a) shows how the choice of dependency graph affects synthesis time. Each bar represents the average time DaPPer took to synthesize a program whose score on the test dataset is within 10% of the ground truth program’s. Each synthesis task ran for a fixed number of SA iterations. The Fig. 10(a) timing numbers represent the first point during those iterations at which the current candidate program achieves the target likelihood score on the test (holdout) dataset. For all benchmarks except ‘students,’ all dependency graphs were sufficient to reach the target accuracy within the allotted SA iterations. The average times to reach the target accuracy using the Complete and Correlation approaches were 20.90 and 55.77 s, respectively. If we exclude ‘students,’ we can compare all three; Complete averaged 12.54 s, Correlation 44.40, and ND 15.02.

Fig. 10.
figure 10

Comparison of the three dependency graph generation approaches.

We observe a number of trends playing out in the timing numbers. First, Complete gains a small early time advantage by generating a dependency graph without examining the input data, while Correlation and ND both face the overhead of calculating correlations. This head start gave Complete the win for the first six benchmarks. Second, using the dependency graph closest to that of the ground truth confers a time advantage. If the ground truth dependency graph is dense, Complete is typically closest, so Complete finishes first (e.g., ‘tugwar-v1’). If the ground truth dependency graph is sparse, ND is usually closest, so ND finishes first (e.g., ‘tugwar-v3’). Finally, if any approach eliminates a necessary dependency, synthesis may fail to reach the target accuracy. Recall from Fig. 4 that ND dropped the direct dependence of testPerformance on tired. Thus, ND never reached the target accuracy for the ‘students’ benchmark.

Next, we evaluate whether DaPPer’s performance is within the acceptable range. We cannot make a direct comparison to the most similar tool, PSketch, because the PSketch task is substantially different. Instead, we include some PSketch performance numbers to give a sense of the acceptable timescale. PSketch synthesized partial PPL programs using small datasets (100–400 rows) in 146 s on average. Note that large datasets are desirable because they result in more accurate programs, but they make synthesis slower because the likelihood estimator must use all the data to calculate a score at each iteration. To synthesize part of the ‘burglary’ model, PSketch took 89 s, while DaPPer synthesized a full ‘burglary’ model in 0.17 s. Again, since the tasks are very different, these numbers do not indicate that DaPPer outperforms PSketch. However, we are satisfied with DaPPer’s synthesis times overall.

Figure 10(b) shows the likelihood scores of the final synthesized programs on the test datasets, normalized by the scores of the ground truth programs. Overall, the scores reached by the Complete, Correlation, and ND approaches, averaged across benchmarks, were 1.014, 1.024, and 1.051, respectively. As expected, larger dependency graphs typically allowed the synthesizer to reach better scores. Thus, Complete always produced likelihoods very close to those of the ground truth programs, with Correlation performing slightly worse, and ND worst of all. Still, even ND always produced likelihoods within 20% of the ground truth.

Given Complete’s dominance in both synthesis time and accuracy, we conclude that Complete is the best dependency graph approach of the three we tried. If we were to select an approach on a case-by-case basis, we would only switch away from the Complete strategy when faced with a dataset for which we strongly suspect the dependency graph is sparse, and even then only if faster synthesis time is more critical than accuracy. The dominance of the Complete approach drove us to develop our redundancy reduction algorithm, which allows us to recover small, readable programs from large ones.

9.2 Data-Guided vs. Data-Blind Stochastic Synthesis

One of the primary innovations of our tool is its use of input data not only to score candidate programs but also to generate them. We evaluate DaPPer against DaPPer-blind. DaPPer-blind is a simple data-blind variation on our tool. It is identical to DaPPer in every way except: (i) in addition to DaPPer mutations, it may mutate distribution parameters and (ii) it does not run data-guided parameter adjustment after mutations.

Recall that after each mutation, DaPPer identifies affected distributions and tunes their parameters to reflect the input data that corresponds to the new path condition. Thus, the data-guided approach has the advantage of always producing programs tuned to the input data. However, filtering the data and calculating the appropriate parameters does impose a time penalty. For this reason, DaPPer-blind can complete more mutations per unit of time than DaPPer. Thus, it is not immediately clear which approach will perform better.

Fig. 11.
figure 11

Score over time for data-blind vs. data-guided synthesis. Lower is better.

Our empirical evaluation reveals that the data-guided approach outperforms the data-blind approach. Figure 11 shows how the likelihood score changed over time for five runs of DaPPer and DaPPer-blind, for each benchmark. While 100% of the data-guided runs reached a likelihood within 10% of the ground truth’s likelihood, only 36% of the data-blind runs reached that same target level. For the data-guided runs, the average time to achieve the target likelihood was 20.9 s. For the 36% of data-blind runs that did reach the target likelihood, the average time was 151.6 s. Thus, using a data-guided program generation approach offers at least a 7x speedup compared to a data-blind approach.

We can acquire a better speedup estimate by including the benchmarks for which the data-blind approach never reached the target accuracy. For each benchmark, we identified the best (lowest) score that all runs managed to reach (i.e. the best score of the worst run). Reaching these scores took data-blind synthesis an average of 347.63 s and took data-guided synthesis an average of 0.54 s, indicating that data-guidance provides a 648.9x speedup.

9.3 Redundancy Reduction

To explore the tradeoff between accuracy and readability, we evaluated how much accuracy we lose by applying our redundancy reduction algorithm to our synthesized programs. While Network Deconvolution offers small, readable programs by default, the other techniques do not. In this section, we explore the effects of redundancy reduction on programs synthesized using Complete dependency graphs on a subset of the 14 benchmarksFootnote 2. We observe up to a 7.9x reduction in program size, with negligible decreases in accuracy.

Recall that we synthesize programs in which all AST leaf nodes are distribution nodes. For this reason, the number of distribution nodes is the most informative measure of program size and complexity.

As Fig. 12(a) reveals, the reduction process does not make program alterations that substantially alter the likelihood score. However, it does significantly reduce the size of the program, producing output programs that are much more readable. Figure 12(b) shows the effects on program size, depicting the ratio of the number of distribution nodes in the output to the number of distribution nodes required to express the ground truth. We see that as the reduction parameter \(\alpha \) increases, the synthesized programs ultimately converge to the ground truth range, but do so gradually enough that the user can explore a variety of different structures. As we see in Fig. 12(a), even when users set \(\alpha \) quite aggressively, the reduction algorithm does not tend to make merges that substantially alter the likelihood score. Most importantly, we think the benefits for readability and for extracting high-level insights are clear. For instance, in the case of the ‘healthiness’ model, reduction collapses a program with 127 distinct distribution nodes to a much more readable program with 16 distribution nodes.

Overall, we find that redundancy reduction allows us to benefit from the accuracy and fast synthesis times of the Complete approach without sacrificing readability and editability. For a more concrete illustration of the resultant readability, see Appendix C’s side-by-side comparison of a ground truth program and a synthesized program after redundancy reduction.

Fig. 12.
figure 12

Effects of redundancy reduction with varying \(\alpha \) parameter values.

9.4 Case Study: Airline Delay Dataset

Although testing on data for which we have a ground truth model is the best way to investigate whether DaPPer produces correct programs, we also want to be sure that DaPPer functions well on real data. To that end, we completed a case study using our tool to produce a probabilistic model of a popular airline delay dataset from the U.S. Department of Transportation [1]. We selected this dataset because it has already been thoroughly studied, explored, and visualized. Thus, although we lacked a ground truth PPL model for this dataset, we knew from past work that we should expect delays to vary according to days of the week [15] and to increase over the course of the day [35].

We ran DaPPer on a dataset with 447,013 rows. The output program indicates that delays vary by day, reflecting the findings of Hofmann et al. [15]. It also indicates that delays rise as the departure time (time of day) rises, reflecting the findings of Wicklin et al. [35]. Taking advantage of BLOG’s built-in inference algorithms, we used this model to predict flight delays on a holdout set of 10,000 dataset rows. On average, the model’s predictions were off by less than 15 min. While 15 min is substantial, it is worth noting that delays in the dataset range from \(-82\) to 1971. For comparison, a baseline predictor that always guessed the average flight delay had a root-mean-square-error (RMSE) of 39.4, while the DaPPer predictor had an RMSE of 24.1.

10 Related Work

The body of research that addresses learning programs from data is far greater than we can cover, encompassing the entire fields of machine learning and program synthesis. Since we are interested in generating models that are both readable and probabilistic, we will limit our discussion to approaches that offer at least one of those characteristics.

10.1 Readable and Probabilistic

Of the related work, our goals are most closely aligned with the goals of PSketch [23]. The primary difference between our tool and PSketch is the target user. We want DaPPer to be accessible to a user who would not manually code even a partial PPL model. Naturally, this difference in target user comes with a number of technical differences. First and foremost, while PSketch requires the user to write a program sketch—including specifying which variables may affect each program hole—our tool requires no coding. This brings us to the second primary difference, which is that while PSketch can work over any dataset for which the user can write most of the model, our tool is targeted specifically at relational datasets in which each row represents an independent draw from the model. Third, our tool does not hypothesize the existence of hidden variables that do not appear in the input dataset. Fourth, PSketch is designed for small datasets (they tested on datasets up to size 400), while DaPPer is designed to handle datasets with hundreds of thousands of rows. To make this feasible, DaPPer uses data-guided mutations, while PSketch’s mutations are data-blind.

10.2 Readable and Deterministic

There has been a massive body of work in deterministic program synthesis and program induction, some of which may be useful in future iterations of our tool. Many synthesizers use off-the-shelf constraint solvers to search for candidate programs and verify their correctness [13, 32, 33]. We cannot directly apply these techniques to our problem since we do not have precise correctness constraints. Some recent work uses constraint solving to synthesize programs that optimize a cost function, with no precise correctness constraint [7]; unfortunately, this approach is only applicable to cost functions without floating-point computations, which makes it incompatible with likelihood estimation.

On the other hand, stochastic synthesis is a good fit for our problem. Our synthesizer is among the many that apply simulated annealing. Other stochastic synthesizers perform MCMC sampling [3, 29]. Some use symbolic regression or other forms of genetic programming [26, 30, 36, 37]. In the future, we may investigate how varying the search technique affects DaPPer’s performance.

Some tools use enumerative search. Previous work has shown that enumerative synthesizers outperform other synthesizers for some problems [2, 3, 5, 25, 34]. With custom pruning strategies, this may be another path to faster synthesis.

10.3 Unreadable and Probabilistic

The machine learning literature includes a rich body of work on learning Bayesian networks. Mainstream techniques fall into two categories: constraint-based and search-and-score. Constraint-based techniques focus on generating only the program structure. Search-and-score treats the problems of learning program structure and learning program parameters together.

Constraint-based techniques use statistical methods (e.g., chi-squared and mutual information) to identify relationships between variables in order to produce a network structure. In short, these techniques perform the same task as the first stage of our synthesizer. We have intentionally factored out the generation of the dependency graph from the rest of the synthesis process, which makes it easy to customize DaPPer with new structure learning approaches, including constraint-based Bayesian learning approaches. This is a direction we hope to explore in the future.

Search-and-score techniques produce not just the network structure but a complete Bayesian network. Often these techniques produce outputs that can be translated directly to PPL programs. This makes them seem like a natural fit with our goals. Unfortunately, existing Bayesian learning techniques cannot produce readable models in the presence of continuous variables.

There are many search-and-score techniques for learning Bayesian networks that can be applied to discrete variables [14, 19]. To extend them to continuous variables, one approach is standard discretization [17, 31, 38]. Where our approach first synthesizes a program structure, then searches over the space of conditionals, search-and-score first fixes the set of conditionals (via discretization), then searches over the space of program structures. This approach leads discretization-based tools to produce dense ASTs with high branching factors. An alternative technique uses mixtures of truncated exponentials (MTEs) to do discretization more flexibly [21, 28]. Despite the attempt to reduce discretization, this approach still produces models that use large sets of massive switch statements with a different exponential distribution at each of hundreds of leaf nodes. In short, discretized search-and-score methods produce models that are difficult to read, understand, and adapt. The output models are accurate, but they do not succinctly express high-level insights into data.

Aside from Bayesian networks, a new class of models called sum-product networks (SPNs) [9, 27] is both probabilistic and learnable. SPNs are not suitable for our use case because they are much less readable and editable even than machine-written Bayesian networks.

Although modeling multiple interacting variables is a more common goal, some work learns probabilistic models of individual distributions. We know of one tool designed to generate a fast sampler with outputs that mimic the distribution of a set of input numbers [24]. If it is fed samples from a Gaussian distribution, rather than learning that the input can be modeled by a Gaussian, it learns a fast sampling procedure that produces Gaussian-like data. This tool does not meet our needs because it can only model a single random variable, not interacting variables, but also because its outputs are difficult to read and interpret.

11 Conclusion

This paper offers an alternative way for users without statistics expertise to create probabilistic models of their data. DaPPer synthesizes models quickly and produces human-readable PPL programs that users can explore, expand, and adapt. We introduce data-guided program mutation, which allows PPL synthesis to scale to generating full programs. We hope this extends the class of users willing to venture into using probabilistic models. We believe offering users full PPL programs without asking them to write even a single line of code is an important step towards making PPLs more accessible.