Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Many statistical learning applications make decisions under uncertainty. Probabilistic languages provide a natural way to model uncertainty by representing complex probability distributions as programs  [8, 12, 1921, 25, 32, 39, 46]. Exact probabilistic inference for programs with only discrete random variables is already a #P-hard computational problem [13]. Programs which have both discrete and continuous variables reveal additional challenges, such as representing discrete and continuous components of the joint distribution, computing integrals, and managing a large number of terms in the joint distribution.

For these reasons, most existing probabilistic languages implement inference algorithms that calculate numerical approximations. The general approaches include sampling-based Monte Carlo methods [1921, 32, 39, 40, 46] and projections to convenient probability distributions, such as variational inference [8, 34] or discretization [12, 31]. While these methods scale well, they typically come with no accuracy guarantees, since providing such guarantees is NP-hard [15].

At the same time, there has been a renewed research interest in symbolic inference methods due to their promise for computing more precise inference results. Existing symbolic inference works fall into different categories:

  • Approximate symbolic inference: Several analyses of graphical models approximate continuous distribution functions with a mixture of base functions, such as truncated exponentials or polynomials, which can be integrated more easily [11, 36, 4345]. For instance, SVE [43] approximates distributions as piecewise low-rank polynomials.

  • Interactive symbolic inference: A user can write down the inference steps within modern computer algebra systems, such as Mathematica [3] and Maple [1]. These tools can help the user by automating parts of the integration and/or simplification of distribution expressions.

  • Exact symbolic inference: Bhat et al. [7] presents a type-based analysis translating programs with mixed discrete/continuous variables into symbolic distribution expressions, but does not simplify integral terms symbolically and instead computes them using a numerical integration library. Most recently, Hakaru [10, 38] optimizes probabilistic programs by translating a program into a distribution expression in a DSL within Maple’s expression language, and simplifying this expression utilizing Maple’s engine, before running (if necessary) the optimized program within a MCMC simulation.

While these works are promising steps, the practical effectiveness of exact symbolic inference in hybrid probabilistic models (with both discrete and continuous distributions) remains unknown, dictating the need for further investigation.

This Work. We present the PSI (Probabilistic Symbolic Inference) system, a comprehensive approach for automating exact probabilistic inference via program analysis. PSI’s analysis performs an end-to-end symbolic inference for probabilistic programs with discrete and/or continuous random variables. PSI analyzes a probabilistic program using a symbolic domain which captures the program’s probability distribution in a precise manner. PSI comes with its own symbolic optimization engine which generates compact expressions that represent joint probability density functions using various optimizations, including algebraic simplification and symbolic integration. The symbolic domain and optimizations are designed to strike a balance between the expressiveness of the probability density expressions and the efficiency of automatically computing integrals and generating compact densities.

Our symbolic analysis (Sect. 3) generalizes existing analyzers for exact inference on discrete programs (e.g., those that operate at the level of concrete states [12]). Our optimization engine (Sect. 4) can also automatically simplify many integrals in density expressions and thus directly improve the performance of works that generate unoptimized expressions, such as [7], without requiring the full complexity of a general computer algebra system, as in [10, 38]. As a result, PSI is able to compute precise and compact inference results even when the existing approaches fail (Sect. 5).

Contributions. Our main contributions are:

  • Symbolic inference for programs with continuous/discrete variables: A novel approach for fully symbolic probabilistic inference. The algorithm represents the posterior distribution within a symbolic domain.

  • Probabilistic inference system: PSI, an implementation of our algorithm together with optimizations that simplify the symbolic representation of the posterior distribution. PSI is available at http://www.psisolver.org.

  • Evaluation: The paper shows an experimental evaluation of PSI against state-of-the-art symbolic inference techniques – Hakaru, Maple, Mathematica, and SVE – on a corpus of 21 benchmarks selected from the literature. PSI produced correct and compact distribution expressions for more benchmarks than the alternatives. In addition, we compare PSI to state-of-the-art approximate inference engines, Infer.NET [34] and R2 [39], and show the benefits of exact inference.

Based on our results, we believe that PSI is the most effective exact symbolic inference engine to date and is useful for understanding the potential of exact symbolic inference for probabilistic programs.

2 Overview

Figure 1 presents the ClickGraph probabilistic program, adapted from a Fun language program from [34]. It describes an information retrieval model that calculates the posterior distribution of the similarity of two files, conditioned on the users’ access patterns to these files.

The program first specifies the prior distribution on the document similarity (line 2) and the recorded accesses to A and B for each user (lines 4–5). It then specifies a trial in which the variable sim is the similarity of the documents for an issued query (line 7). If the documents are similar, the probabilities of accessing them (p1 and p2) are the same, otherwise p1 and p2 are independent (lines 9–14). Finally, the variables clickA and clickB represent outcomes of the users accessing the documents (lines 16–17), and each trial produces a specific observation using the observe statements (lines 18–19). The return statement specifies that PSI should compute the posterior distribution of simAll.

2.1 Analysis

To compute the posterior distribution, PSI analyzes the probabilistic program via a symbolic domain that captures probability distributions, and applies optimizations to simplify the resulting expression after each analysis step.

Symbolic Analysis: For each statement, the analysis computes a symbolic expression that captures the program’s probability distribution at that point. The analysis operates forward, starting form the beginning of the function. As a pre-processing step, the analysis unrolls all loops and lowers the array elements into a sequence of scalars or inlined constants. The state of the analysis at each program point captures (1) the correct execution of the program as a map that relates live program variables \(x_1,\ldots , x_n\) to a symbolic expression e representing a probability density of the computation at this point in the program, and (2) erroneous executions (e.g., due to an assertion violation) represented by an aggregate error probability expression \(\bar{e}\).

Fig. 1.
figure 1

ClickGraph Example

The analysis of the first statement (line 2) identifies that the state consists of the variable simAll, which has the \({\small {\texttt {Uniform}}}(0,1)\) distribution. In general, for \(x:={\small {\texttt {Uniform}}}(a,b)\), the analysis generates the expression \([a\le x]\cdot [x\le b]/(b-a)\), which denotes the density of this distribution. The factors \([a\le x]\) and \([x\le b]\) are Iverson brackets, guard functions that equal 1 if their conditions are true, or equal 0 otherwise. Therefore, this density has a non-zero value only if \(x \in [a,b]\). In particular, for \(\small {\texttt {simAll}}\), the analysis generates \(e_{\small {\texttt {L.2}}}=[0\le \small {\texttt {simAll}}]\cdot [\small {\texttt {simAll}}\le 1]\).

Since the constant arrays are inlined, the analysis next processes the statement on line 7 (in the first loop iteration). The analysis adds the variable sim to the state, and multiplies the expression \(e_{\small {\texttt {L.2}}}\) with the density function for the distribution Bernoulli(simAll):

The expression \(e_{\texttt {L.7}}\) represents a generalized joint probability density over \(\small {\texttt {simAll}}\) and \(\small {\texttt {sim}}\). To encode discrete distributions like Bernoulli, the analysis uses Dirac deltas, \(\delta (e)\), which specify a distribution with point masses at the zeros of e.

Optimizations: After analyzing each statement, the analysis simplifies the generated distribution expression by applying equivalence-preserving optimizations:

  • basic algebraic manipulations (e.g., in the previous expression, an optimization can distribute multiplication over addition);

  • removal of factors with trivial or unsatisfiable guards (e.g., in this example the analysis checks whether a product \([0\le \small {\texttt {simAll}}]\cdot [\small {\texttt {simAll}} \le 1]\) is always equal to zero, and since it is not, leaves the expression unchanged);

  • symbolic integration of the distribution expressions; for instance, at the end of the each loop iteration, the analysis expression \(e_\texttt {L.19}\) contains several loop-local variables: sim, p1, p2, clickA, and clickB. The analysis integrates out these local variables because they will not be referenced by the subsequent computation. It first removes the discrete variables sim, clickA, and clickB by exploiting the properties of Dirac deltas. For the continuous variables p1 and p2, it computes the antiderivative (indefinite integral) using PSI’s integration engine, finds the integration bounds, and evaluates the antiderivative on these bounds. After the analysis of the first loop iteration, this optimization reduces the size of the distribution expression from 22 to 6 summands.

Result of the Analysis: After analyzing the entire program, the analysis produces the final posterior probability density expression for the variable simAll:

The analysis also computes that the final error probability \(\bar{e}_{L.21}\) is 0. This is the exact posterior distribution for this program. We present this posterior density function graphically in Fig. 8.

2.2 Applications of PSI

PSI’s source language (with conditional and bounded loop statements) has the expressive power to represent arbitrary Bayesian networks, which encode many probabilistic models relevant in practice [22]. PSI’s analysis is analogous to the variable elimination algorithm for inference in graphical models. We anticipate that PSI can be successfully deployed in several contexts:

Probabilistic Inference: PSI allows a developer to specify several classes of queries. For joint posterior distribution, a user may return multiple variables in the return statement. The special operators \(\small {\texttt {FromMarginal}}(\small {\texttt {e}})\) and \(\small {\texttt {Expectation}}(\small {\texttt {e}})\) return the marginal distribution and the expectation of an expression e, respectively. A developer can also specify assertions using the assert(e) statement.

Testing and Debugging: The exact inference results produced by PSI can be used as reference versions for debugging and testing approximate inference engines. It can also be used to test existing computer algebra systems – using PSI, we found errors in Maple’s simplifier (see Sect. 5).

Sampling from Optimized Probabilistic Programs: Optimized distribution expressions generated by PSI’s symbolic optimizer can be used, in principle, for computing proposal distributions in MCMC simulations, as done by [7, 38].

Uncertainty Propagation Analysis: PSI’s analysis can serve as a basis for static analyses that propagate uncertainty through computations and determine error bars for the result. This provides a powerful alternative to existing analyses that are primarily sampling-based [9, 41], with at most limited support for simplifying algebraic identities that involve random variables [41].

Fig. 2.
figure 2

PSI’s source language syntax

3 Symbolic Inference

In this section we describe our core analysis: the procedure analyzes each statement in the program and produces a corresponding expression in our symbolic domain which captures probability distributions.

3.1 Source Language

Figure 2 presents the syntax of PSI’s source language. This is a simple imperative language that operates on real-valued scalar and array data. The language defines probabilistic assignments, which can assign a random value drawn from a distribution \(\mathrm {Dist}\), and \(\texttt {observe}\) statements, which allow constraining the values of probabilistic expressions. The language also supports the standard sequence, conditional statement, and bounded loop statement.

3.2 Symbolic Domain for Probability Distributions

Figure 3 presents the syntax of our symbolic domain. The domain can succinctly describe joint probability distributions with discrete and continuous components:

  • Basic terms include variables, numerical constants (such as \(\texttt {e}\) and \(\pi \)), logarithms and uninterpreted functions. These terms can form sums, products, or exponents. Division is handled using the rewrite \(a/b\rightarrow a\cdot b^{-1}\).

  • Dirac deltas represent distributions that have weight in low-dimensional sets (such as single points). In our analysis, they encode variable definitions and assignments, and linear combinations of Dirac deltas specify discrete distributions.

  • Iverson brackets represent functions that are 1 if the condition within the brackets is satisfied and 0 otherwise. In our analysis, they encode comparison operators and certain primitive probability distributions (e.g., \(\text {Uniform}\)).

  • Integrals and infinite sums are used during the analysis to represent marginalization of variables and \(\text {UniformInt}\) distributions respectively.

  • Gaussian antiderivative\((\text{ d }/\text{ d }x)^{-1}[\text {e}^{-x^2}](e)\) – used to denote the function \(\underset{-\infty }{\overset{\;\;e}{\int }}\text{ d }x\, \text {e}^{-x^2}\), which cannot be decomposed into simpler elementary functions.

Fig. 3.
figure 3

Symbolic domain for probability distributions

We use the notation \(e\!\! \llbracket x_1,\ldots ,x_n\rrbracket \) to denote that a symbolic distribution expression e may contain free variables \(x_1,\ldots ,x_n\) that are bound by an outer operator (such as a sum or integral).

Our design of the symbolic domain aims to strike a balance between expressiveness – the kinds of distributions it can represent – and efficiency – the ability of the analysis to automatically integrate functions and find simple equivalent expressions. In particular, our symbolic domain enables us to define most discrete and continuous distributions from the exponential family and other well-known primitive distributions, such as Student-t and Laplace (see the Appendix A).

Primitive Distributions: For each primitive distribution \(\text {Dist}\), we define two mappings, \(\text {PDF}_{\text {Dist}}\), and \(\text {Conditions}_{\text {Dist}}\) to respectively specify the probability density function, and valid parameter and input ranges. For instance, the Bernoulli distribution with a parameter \(e_p\) has \(\text {PDF}_{\text {Bern}}(x,e_p) = e_p\cdot \delta (1-x)+(1-e_p)\cdot \delta (x)\) and \(\text {Conditions}_{\text {Bern}}=[0\le e_p]\cdot [e_p\le 1]\). We present the encodings of several other primitive distributions in the Appendix A. Additionally, PSI allows the developer to specify an arbitrary density function of the resulting distribution using the SampleFrom (sym_expr,...) primitive, which takes as inputs a distribution expression and a set of its parameters.

Program State: A symbolic program state \(\sigma \) denotes a probability distribution over the program variables with an additional error state:

$$\begin{aligned} \sigma \in \varSigma \,\,{:}{:=}&\lambda M.\, \mathbf{case }\ M\ \mathbf{of }\ (x_1,\ldots ,x_n) \Rightarrow e_1\!\! \llbracket x_1,\ldots ,x_n\rrbracket ,\ \bot \Rightarrow e_2 \end{aligned}$$
(1)

In a regular execution, the state is represented with the variables \(x_1, \ldots , x_n\) and the posterior distribution expression \(e_1\). We represent the error state as a symbol \(\bot \) and the expression for the probability of error \(e_2\). Conceptually, the map \(\sigma \) associates a probability density with each concrete program state \(M\), which is either a tuple of values of program variables or the error state.

3.3 Analysis of Expressions

Figure 4 presents the analysis of expressions. The function \(A_e\) converts each expression of the source language to a transformer \(t\in \varSigma \rightarrow \varSigma \times E\) on the symbolic representation. The transformer returns both a new state (\(\sigma \in \varSigma \)) and a result of expression evaluation (\(e\in E\)), thus capturing side effects (e.g., sampling values from probability distributions or exhibiting errors such as division by zero).

Operations: The first five rules transform source language variables to distribution expression variables (including operators via the helper function SymbolicOp). The rules are standard, with Boolean constants true and false encoded as numbers 1 and 0, respectively. The rules compose the side effects of the operands. The division rule additionally uses the \(\text {Assert}\) helper function to add the guard \([e_2\ne 0]\) to the distribution expression and aggregate the probability of \(e_2=0\) to the overall error probability.

Fig. 4.
figure 4

Symbolic snalysis of expressions

Distribution Sampling: The expression \(\text {Dist}(\mathrm {se}_1,\ldots ,\mathrm {se}_n)\) accepts distribution parameters \(\mathrm {se}_1,\ldots ,\mathrm {se}_n\), which can be arbitrary expressions. For a primitive distribution \(\text {Dist}\), the analysis obtains expressions from the mappings \(\text {PDF}_{\text {Dist}}\) and \(\text {Conditions}_{\text {Dist}}\) (Sect. 3.2).

The rule first analyzes all of the distribution’s parameters (which can represent random quantities). To iterate over the parameters, the rule uses the helper function \(A^*_e\), defined inductively as:

figure a

To ensure that distribution parameters have the correct values, the rule invokes a helper function \(\text {Assert}\), which adds guards from the \(\text {Conditions}_{\text {Dist}}\). Finally, the rule declares a fresh temporary variable \(\tau \) (specified by a predicate \(\text {FreshVar}\)), which is then distributed according to the density function \(\text {PDF}_{\text {Dist}}\), using the helper function \(\text {Distribute}\). In the definitions of \(\text {Assert}\) and \(\text {Distribute}\), we specified the states in their expanded forms (Eq. 1).

Marginalization: Marginalization aggregates the probability by summing up over the variables in an expression (e.g., local variables at the end of scope or variables in an error expression). To marginalize all variables, we define the function

$$\begin{aligned} \text {MarginalizeAll}(e\!\! \llbracket x_1,\ldots ,x_n\rrbracket )&:=\int _{\mathbb R}\text {d}x_1\cdots \int _{\mathbb R}\text {d}x_n e\!\! \llbracket x_1,\ldots ,x_n\rrbracket . \end{aligned}$$

The function KeepOnly performs selective marginalization. It takes as input the variables \(x'_1\ldots ,x'_m\) to keep and the input state \(\sigma \), and marginalizes out the remaining variables in \(\sigma \)’s distribution expressions:

$$\begin{aligned}&\text {KeepOnly}(x'_1,...,x'_m)(\lambda M.\,\mathbf{case }\,M\,\mathbf{of }\,(x_1,...,x_n)\Rightarrow e_1\!\! \llbracket x_1,...,x_n\rrbracket , \bot \Rightarrow e_2)=\\&\qquad \mathbf{let }\,\{x''_1,...,x''_l\}=\{x_1,...,x_n\}\setminus \{x'_1,...,x'_m\}\\&\qquad \!\!\mathbf{in }\, \lambda M.\,\mathbf{case }\,M\,\mathbf{of }\, (x'_1,...,x'_m)\Rightarrow \int _{\mathbb R}\text {d}x''_1\cdots \int _{\mathbb R}\text {d}x''_le_1\!\! \llbracket x_1,...,x_n\rrbracket , \bot \Rightarrow e_2 \end{aligned}$$
Fig. 5.
figure 5

Symbolic analysis of statements

3.4 Analysis of Statements

Figure 5 presents the definition of function \(A_s\): it analyzes each statement and produces a transformer of states: \(\varSigma \rightarrow \varSigma \). The initial analysis state \(\sigma _{0}\) is defined as follows: \(\sigma _0 = (\lambda M.\,\mathbf{case }\,M\,\mathbf{of }\) \(\vec {x}\Rightarrow \varphi (\vec {x}), \bot \Rightarrow ~0\)). Here, the function F under analysis has parameters \(\vec {x}=(x_1,...x_n)\) where \(\varphi \) is an uninterpreted function representing the joint probability density of \(\vec {x}\). If F has no parameters, we replace \(\varphi ()\) with 1.

Definitions: The statement \(x:=\mathrm {se}\) declares a new variable x and distributes it as a point mass centered at \(e\) (the symbolic expression corresponding to \(\mathrm {se}\)), i.e. the analysis binds x by multiplying the joint probability density by \(\delta (x-e)\).

Assignments: Analysis of assignments to existing variables (\(x=\mathrm {se}\)) consistently renames these variable and introduces a new variable with the previous name. The substitution \(\mathrm {se}\!\! \llbracket \tau /x\rrbracket \) renames x to \(\tau \) in the source expression \(\mathrm {se}\), since the variable being assigned may itself occur in \(\mathrm {se}\). The function \(\text {Rename}(x,\tau )\) alpha-renames all occurrences of the variable x to \(\tau \) in an existing state (\(\sigma \)) to avoid capture (Fig. 6). It is necessary to rename x in \(\mathrm {se}\) separately, because \(\mathrm {se}\) is a source program expression, while \(\text {Rename}\) renames variables in the analysis state.

Observations: Observations are handled by a call to the helper function \(\text {Observe}\) (Fig. 6), which conditions the probability distribution on the given expression being true. We do not renormalize the distribution after an observation, but only once, before reporting the final result (Sect. 3.5). Therefore, observations do not immediately change the error part of the distribution.

Fig. 6.
figure 6

Analysis of statements - helper functions

Conditionals: The analysis of conditionals first analyzes the condition, and then creates two copies of the resulting state \(\sigma _0\). In one of the copies, the condition is then observed to be true, and in the other copy, the condition is observed to be false. Analysis of the ‘then’ and ‘else’ statements \(s_1\) and \(s_2\) in the corresponding states yields \(\sigma _1\) and \(\sigma _2\). Finally, \(\sigma _1\) and \(\sigma _2\) are joined together by marginalizing all locally scoped variables, including temporaries created during the analysis of the condition, and then adding the distribution and the error terms (Join; Fig. 6). We subtract the error probability in the original state to avoid counting it twice.

3.5 Final Result and Renormalization

We obtain the final result by applying the state transformer obtained from analysis of the function body to the initial state and renormalizing it. We define the renormalization function as:

$$\begin{aligned}&\text {Renormalize}(\lambda M.\,\mathbf{case }\,M\,\mathbf{of }\, (x_1,\ldots ,x_n)\Rightarrow e_1\!\! \llbracket x_1,\ldots ,x_n\rrbracket , \bot \Rightarrow e_2):=\\&\qquad \mathbf{let }\, e_Z=\text {MarginalizeAll}(e_1)+e_2\\&\qquad ~\!\!\mathbf{in }\, \lambda M.\,\mathbf{case }\,M\,\mathbf{of }\, (x_1,\ldots ,x_n)\Rightarrow [e_Z\ne 0]\cdot e_1\!\! \llbracket x_1,\ldots ,x_n\rrbracket \cdot e_Z^{-1},\\&\ \ \qquad \qquad \qquad \qquad \quad \qquad \qquad \quad \;\,\bot \Rightarrow [e_Z\ne 0]\cdot e_2\cdot e_Z^{-1}+[e_Z=0] \end{aligned}$$

The function obtains a normalization expression \(e_Z\), such that the renormalized distribution expression of the resulting state integrates to 1. This way, PSI computes a normalized joint probability distribution for the function results that depends symbolically on the initial joint distribution of the function’s arguments.

3.6 Discussion

Loop Analysis: PSI analyzes loops like for i in [0..N){...} (as mentioned in Sect. 2.1) by unrolling the loop body a constant N number of times. This approach also extends to loops where the number of iterations N is a random program variable. If N can be bounded from above by a constant \(N_{max}\), a developer can encode the loop as

figure b

To handle for-loops with unbounded random variables and general while-loops, a developer can select \(N_{max}\) such that the probability of error (i.e., probability that the loop runs for more than \(N_{max}\) iterations) is small enough. We anticipate that this approach can be readily automated. Related techniques such as [18, 42] employ similar approximation techniques.

Function Call Analysis: PSI can analyze multiple functions, generating for each function \(f(x_1,...,x_n)\) the density expression of its m outputs, parameterized by the unknown distribution of the function’s n inputs. The distribution of the function inputs is represented by an uninterpreted function \(\varphi (x_1,...,x_n)\) which appears as a subterm in the output density expression.

The rule for the analysis of function calls (see Appendix B) first creates temporary variables \(a_1,\ldots ,a_n\) for each argument of f, and variables \(r_1,\ldots ,r_m\) for each result returned by f. The variables \(a_1,\ldots ,a_n\) are then initialized by the actual parameters \(e_1,\ldots ,e_n\) by multiplying the density of the caller by \(\prod _i \delta (a_i - e_i)\). The result variables in f’s density expression are renamed to match \(r_1,\ldots ,r_n\), and the uninterpreted function \(\varphi \) within f’s density expression is replaced with the new density of the caller (avoiding variable capture).

Formal Argument: A standard approach to prove that the translation from the source language to the target domain (in our case, the symbolic domain) is correct is to show that the transformation preserves semantics, as in [17]. This requires a specification of semantics for both the source language and the symbolic domain language. Below, we outline how one might approach such a formal proof using denotational semantics that map programs and distribution expressions to measure transformers.

Denotational semantics for the source language is easy to define by extending [30], but defining the measure semantics for distribution expressions is more challenging. Defining measure semantics for most expression terms in the symbolic domain is simple (e.g., the measure corresponding to a sum of terms is the sum of the measures of the terms). However, the semantics of expressions containing Dirac deltas is less immediate, since there is no general pointwise product when Dirac delta factors have overlapping sets of free variables.

To assign semantics to a product expression with Dirac delta factors, we therefore (purely formally) integrate the expression against the indicator function of the measured set and simplify it using Dirac delta identities until no Dirac deltas are left. The resulting term can then be easily interpreted as a measure. A formal proof will also need to show that this is a well-formed definition, i.e., that all ways of eliminating Dirac deltas lead to the same measure. Once the semantics for distribution expressions has been defined, the correctness proof proceeds as a straightforward induction over the source language production rules. We consider a complete formalized proof to be an interesting future work item.

4 Symbolic Optimizations

After each step of the analysis from Sect. 3, PSI’s symbolic engine simplifies the joint posterior distribution expressions. The algorithm of this optimization engine is a fixed point computation, which applies various symbolic transformations. We selected these transformations by their ability to optimize expressions that typically arise when analyzing probabilistic programs and that have demonstrated their efficiency for practical programs (as we discuss in Sect. 5). We next describe three main groups of the transformations.

4.1 Algebraic Optimizations

These optimizations implement basic algebraic identities. Some examples include removing zero-valued terms in addition expressions, removing one-valued terms in multiplication expressions, distributing exponents over products, or condensing equivalent summands and factors.

4.2 Guard Simplifications

For each term in an expression with multiple Iverson brackets and/or Dirac deltas, these optimizations analyze the constraints in the bracket factors and delta factors using sound but incomplete heuristics. PSI can then (1) remove the whole term if the constraints are inconsistent and therefore the term is always zero, (2) remove a factor if it is always satisfied, e.g., if both sides of an inequality are constants, or (3) remove a bracket factor if it is implied by other factors.

Guard Linearization: Guard linearization analyzes complex Iverson brackets and Dirac deltas with the goal to rewrite expressions in such a way that all included constraints (expressions in Iverson brackets and Dirac deltas) depend on a specified variable x in a linear way. It handles constraints that are easily recognizable as compositions of quadratic polynomials, multiplications with only one factor depending on x and integer and fractional powers (including in particular multiplicative inverses).

One aspect that requires special care is that the integral of a Dirac delta along x depends on the partial derivative of its argument in the direction of x. For example, we have \(\delta (2x)=\frac{1}{2}\delta (x)\), and in general we have \(\delta (f(x))=\sum _i \frac{\delta (x-x_i)}{\left| f'(x_i)\right| }\text { for }f(x_i)=0\), whenever \(f'(x_i)\ne 0\). We ensure the last constraint by performing a case split on \(f'(x)=0\), and substituting the solutions for x into the delta expression in the “equals” case. For example, \(\delta (y-x^2)\) is linearized to

We present the details of the guard linearization algorithm in the Appendix C.

4.3 Symbolic Integration

These optimizations replace the integration terms with equivalent terms that do not contain integration symbols. If the integrated term is a sum, the symbolic engine integrates each summand separately and pulls all successfully integrated summands out of the integral. If the integrated term is a product, the symbolic engine pulls out all factors that do not contain the integration variable before performing the integration.

Integration of Terms with Deltas: The integration engine first attempts to eliminate the integration variable with a factor that is a Dirac delta, by applying the rule \(f(e) = \int _{\mathbb R}\text{ d }x f(x)\cdot \delta (x-e).\) The engine can often transform deltas that depend on the integration variable x in more complicated ways into equivalent expressions only containing x-dependent deltas of the above form, using guard linearization. This transformation is applied when evaluating the integral.

Integration of Continuous Terms: The symbolic engine integrates continuous terms (without Dirac deltas) in several steps. First, it multiplies out all terms that contain the integration variable and groups together all Iverson bracket terms in a single term. Second, it computes the lower and upper bounds of integration by analyzing the Iverson bracket term. If necessary, it first rewrites the term into an equivalent term within which all Iverson brackets specify the constraints on the integration variable in a direct fashion, using guard linearization. This is necessary as in general, a single condition inside a bracket might not be equivalent to a single lower or upper bound for the integration variable. The integration bounds are then computed as the minimum of all upper bounds and the maximum of all lower bounds, where minimum and maximum are again encoded using Iverson brackets. Third, the symbolic engine applies a number of standard rules for obtaining antiderivatives, including integration of power terms, natural logaritms and exponential functions, and integration by parts. We present the details of PSI’s integration rules in the Appendix C.

5 Evaluation

This section presents an experimental evaluation of PSI and its effectiveness compared to the state-of-the-art symbolic and approximate inference techniques.

Implementation: We implemented PSI using the D programming language. PSI can produce resulting query expressions in several formats including Matlab, Maple, and Mathematica. Our system and additional documentation, including the Appendix, is available at: http://www.psisolver.org.

Benchmarks: We selected two sets of benchmarks distributed with existing inference engines. Specifically, we used examples from R2 [39] and Fun programs from Infer.NET 2.5 [34]. We describe these benchmarks in the Appendix D. We use the data sets and queries provided with the original computations. Out of 21 benchmarks, 10 have bounded loops. The loop sizes are usually equal to the sizes of the data sets (up to 784 data points in DigitRecognition). Since several benchmarks have data sets that are too large for any of the symbolic tools to successfully analyze, we report the results with truncated data sets.

Table 1. Comparison of Exact and Interactive Symbolic Inference Approaches.

5.1 Comparison with Exact Symbolic Inference Engines

Experimental Setup: For comparison with Mathematica 2015 and Maple 2015, we instruct PSI to skip symbolic integration and automatically generate distribution expressions in the formats of the two tools. We run Mathematica’s Simplify() and Maple’s simplify() commands. For Hakaru [10, 38] (commit e61cc72009b5cae1dee33bee26daa53c0599f0bc), we implemented the benchmarks as Hakaru terms in Maple, using the API exposed by the NewSLO.mpl simplifier (as recommended by the Hakaru developers). For each benchmark, we set a timeout of 10 minutes and manually compared the results of the tools.

Results: Table 1 presents the results of symbolic inference. For each benchmark, we present the types of variables it has and whether it has zero-probability observations. We also report the size of the data set provided by the benchmark (if applicable) and the size of the subset we used. For each tool we report the observed inference result. We mark a result as fully simplified () if it does not have any integrals remaining and has a small number of remaining terms. We mark results that have some integral terms remaining (), and partially simplified results (). We mark a result as not normalized () if a tool does not fully simplify the normalization constant. We marked specifically if execution of a tool experienced a crash (\(\times \)) a timeout (t/o) or a tool produced an incorrect result (\(\times \times \)). For five benchmarks, the automatic conversion of PSI’s expressions could not produce Mathematica and Maple expressions, because of the complexity of the benchmarks. We marked those entries as ‘–’. Hakaru’s simplifier does not handle zero-probability observations and expectation queries, and therefore we have not encoded these benchmarks (also marked as ‘–’).

PSI: PSI was able to fully symbolically evaluate many of the benchmark programs and generate compact symbolic distributions. Running PSI took less than a second for most benchmarks. The most time consuming benchmark was DigitRecognition, which PSI analyzed in 37 s. For two benchmarks, PSI was not able to remove all integral terms, although it simplified and removed many intermediate integrals.

Mathematica and Maple: For several benchmarks, both Mathematica and Maple did not produce a result before the timeout, or returned a non-simplified expression as the result. This indicates that the distribution expressions of these benchmarks are too complex, causing general computer algebra systems to navigate a huge search space. However, we note that these results are obtained for a mechanized translation of programs with the specific encoding we described in this paper. It is possible that a human-driven interactive inference with an alternative encoding may result in more simplified distribution expressions.

Maple crashed for addFun/max and addFun/sum. We identified that the crashes were caused by an infinite recursion and subsequent stack overflow during simplification. Four benchmarks – Coins, Evidence/model1, Evidence/model2, and Grass produce results that are different from those produced by the other tools. For instance, Maple simplifies the density function of Coins to 0 (which is incorrect). We attribute this incorrectness to the way Maple integrates Dirac deltas and how it defines Heaviside functions (by default, they are undefined at input 0, but a user can provide a different setting [2]). In our evaluation, none of the alternative settings could yield the correct results. We reported these bugs to the Maple developers. These examples indicate that users should be cautious when using general computer algebra systems to analyze probabilistic programs.

Hakaru: For the ClinicalTrial1 benchmark, Hakaru produced a result differing from PSI’s. To get a reference result, we ran R2’s simulation to compute an approximate result and found that this result is substantially closer to PSI’s. For the DigitRecognition benchmark, Hakaru overflowed Maple’s stack limit. Hakaru does not simplify the AddFun/max benchmark, but unlike Maple (which it uses), it does not crash.

Performance: Summed over all examples where Hakaru produced correct but possibly unsimplified results except BayesPointMachine, PSI and Hakaru ran for about the same time (8.7 s and 8.8 s, respectively). BayesPointMachine is an outlier, for which Hakaru requires 41.9 s, while PSI finds a solution in 1.24 s. Mathematica and Maple are 10–300 times slower than PSI. We present the detailed time measurements for each benchmark in the Appendix E.

Fig. 7.
figure 7

Tracking.query2: PSI (solid; exact) and SVE (dashed).

Fig. 8.
figure 8

ClickGraph: PSI (solid; exact) and Infer.NET (dashed).

Fig. 9.
figure 9

AddFun/max: PSI (solid; exact) and Infer.NET (dashed).

5.2 Comparison with Approximate Symbolic Inference Engine

Experimental Setup: We compared PSI with SVE [43] by running posterior distribution queries on the models from the SVE distribution (from the commit f4cea111f7d489933b36a43c753710bd14ef9f7f). We included models tracking (with 7 provided posterior distribution queries) and radar (with 5 posterior distribution queries). We excluded the competition model because SVE crashes on it. We did not evaluate SVE on R2 and Infer.NET benchmarks as SVE does not encode some distributions (e.g., Beta or Gamma) and lacks support for Dirac deltas, significantly limiting its ability to represent assignment statements.

Results: PSI fully optimized the posterior distributions for all seven queries of the tracking model. PSI fully optimized one query from the radar benchmark and experienced timeout for the remaining queries. Figure 7 presents the posterior density functions (PDFs) for one of the tracking queries. SVE’s polynomial approximation yields a less precise shape of the distribution compared to PSI.

5.3 Comparison with Approximate Numeric Inference Engines

Experimental Setup: We also compared the precision and performance of PSI’s exact inference with the approximate inference engines Infer.NET [34] and R2 [39] for a subset of their benchmarks. Specifically, we compared PSI to Infer.NET on ClickGraph, ClinicalTrial, AddFun/max, AddFun/sum, and MurderMystery and compared PSI to R2 on BurglarAlarm, CoinBias, Grass, NoisyOR, and TwoCoins. We executed both approximate engines with their default parameters.

Results: Infer.NET produces less precise approximate distributions for ClickGraph and AddFun/max (Figs. 8 and 9), Infer.NET’s approximate inference is imprecise in representing the tails of the distributions, although the means of the two distributions are similar (e.g., differing by \(0.7\,\%\) for both benchmarks). PSI and Infer.NET produced identical distributions for the remaining benchmarks. Because of its efficient variational inference algorithms, Infer.NET computed results 5–200 times faster than PSI. The precision loss of R2 on Burglar alarm is \(20\,\%\) (R2’s output burglary probability is 0.0036 compared to the exact probability 0.00299). For the other benchmarks, the difference between the results of PSI and R2 is less than \(3\,\%\). The run times of PSI and R2 were similar, e.g., PSI was two times faster on TwoCoins, and R2 was two times faster on NoisyOR. We present details of the comparison in the Appendix F.

The examples in Figs. 7, 8, and 9 illustrate that the choice of inference method depends on the context in which the inference results are used. While inferences about expectations in machine learning applications may often tolerate imprecision in return for faster or more scalable computation, many uses of probabilistic inference in domains such as security, privacy, and reliability engineering need to reason about a richer set of queries, while requiring correct and precise inference. We believe that the PSI system presented in this paper is particularly suited for such settings and is an important step forward in making automated exact inference feasible.

6 Related Work

This section discusses related work in symbolic inference and probabilistic program analysis.

6.1 Symbolic Inference

Graphical Models: Early research in the machine learning community focused on symbolic inference in Bayesian networks with discrete distributions [44] and combinations of discrete and linearly-dependent Gaussian distributions [11]. For more complex hybrid models, researchers proposed projecting distributions to mixtures of base functions, which can be easily integrated, such as truncated exponentials [36] and piecewise polynomials [43, 45]. In contrast to these approximate approaches, PSI’s algorithm performs exact symbolic integration.

Probabilistic Programs: Claret et al. [12] present a data flow analysis for symbolically computing exact posterior distributions for programs with discrete variables. This analysis operates on the program’s concrete state, while efficiently storing the states using ADD diagrams.

Bhat et al. [6] present a type system for programs with continuous probability distributions. This approach is extended in [7] to programs with discrete and continuous variables (but only discrete observations). Like PSI, the density compiler from [7] computes posterior distribution expressions, but instead of symbolically simplifying and removing integrals, it generates a C program that performs numerical integration (which may, in general, be expensive to run).

The Hakaru probabilistic language [10, 38] runs inference tasks by combining symbolic and sampling-based methods. To optimize MCMC sampling for probabilistic programs, Hakaru’s symbolic optimizer (1) translates the programs to probability density expressions in Maple’s language, (2) calls an extended version of Maple’s simplifier, (3) uses these results to generate an optimized Hakaru program, and, if necessary, (4) calls a MCMC sampler with the optimized program. While Maple’s expression language is more expressive than PSI’s, it also creates a more complex search space for expression optimizations. PSI further reduces the search space by optimizing expressions after each analysis step, while in Hakaru’s workflow, the distribution expression is optimized only after translating the whole program.

6.2 Probabilistic Program Analysis

Verification: Researchers presented various static analyses that verify probabilistic properties of programs, including safety, liveness, and/or expectation queries. These verification techniques have been based on abstract interpretation [14, 16, 33, 35], axiomatic reasoning [5, 29, 37], model checking [26], and symbolic execution [18, 42]. Many of the existing approaches compute exact probabilities of failure only for discrete distributions or make approximations when analyzing computations with both discrete and continuous distributions.

Researchers have also formalized fragments of probability theory inside general-purpose theorem provers, including reasoning about discrete [4, 28] and continuous distributions [17, 24]. The focus of these works is on human-guided interactive verification of (possibly recursive) programs. In contrast, PSI performs fully automated inference of hybrid discrete and continuous distributions for programs with bounded loops.

Transformation: R2 [39] transforms probabilistic programs by moving observe statements next to the sampling statement of the corresponding variable to improve performance of MCMC samplers. Gretz et al. [23] generalize this transformation to move observations arbitrarily through a program. Probabilistic program slicing [27] removes statements that are not necessary for computing a user-provided query. These transformations simplify program structure, while preserving semantics. In comparison, PSI directly transforms and simplifies the probability distribution that underlies a probabilistic program.

7 Conclusion

We presented PSI, an approach for end-to-end exact symbolic analysis of probabilistic programs with discrete and continuous variables. PSI’s symbolic nature provides the necessary flexibility to answer various queries for non-trivial probabilistic programs. More precise and reliable probabilistic inference has the potential to improve the quality of the results in various application domains and help developers when testing and debugging their probabilistic models and inference algorithms. With its rich symbolic domain and optimization engine, we believe that PSI is a useful tool for studying the design of precise and scalable probabilistic inference based on symbolic reasoning.