1 Introduction

Model checking of continuous-time Markov chains (CTMCs) against continuous stochastic logic (CSL) formulae [1, 27] has numerous applications in many areas of science. In biochemistry, there is an interest in analysing hypotheses (formulated using CSL) about reaction networks that can be adequately modelled as CTMCs governed by the Chemical Master Equation [9, 22, 28]. In engineering disciplines, CTMCs are used to study various reliability and performance aspects of computer networks [5], communication [19] and security protocols [29].

Traditionally, stochastic model checking techniques assume that model parameters – namely, the transition rate constants – are known a priori. This is often not the case and one has to consider ranges of parameter values instead, for example, when the parameters result from imprecise measurements, or when designers are interested in finding parameter values such that the model fulfils a given specification. Such problems can be effectively formulated in the framework of parameter synthesis for CTMCs [10, 12, 24]: given a time-bounded CSL formula and a model whose transition rates are functions of the parameters, find parameter values such that the satisfaction probability of the formula meets a given threshold, is maximised, or minimised. In [10, 12] we developed synthesis algorithms that yield answers that are precise up to within an arbitrarily small tolerance value. The algorithms combine the computation of probability bounds with the refinement and sampling of the parameter space.

The complexity of the synthesis algorithms depends mainly on the size of the underlying model and on the number of parameter regions to analyse in order to achieve the desired precision. However, existing techniques do not scale with the model size and the dimensionality of the parameter space. For instance, as reported in [12], the synthesis of two parameters for a model with 5.1 K states requires the analysis of 5 K parameter regions and takes around 3.6 h.

In the last years, many-core graphical processing units (GPUs) have been utilised as general purpose, high-performance processing resources in computationally-intensive scientific applications. In light of this development, we redesign the synthesis algorithms using matrix-vector operations so as to ensure effective data-parallel processing and acceleration of the synthesis procedures on many-core architectures. The novelty of our approach is a two-level parallelisation scheme that distributes the workload for the processing of the state space and the parameter space, in order to optimally utilise the computational power of the GPU. The state space parallelisation builds on a sparse-matrix encoding of the underlying parametric CTMC. The parameter space parallelisation exploits the fact that our synthesis algorithms require the analysis of a large number of parameter regions during the parameter space refinement.

In this paper we present our new publicly available tool PRISM-PSYFootnote 1 that implements the data-parallel algorithms together with a number of optimisations of the sequential algorithms, and employs the front-end of the probabilistic model-checker PRISM [26]. We systematically evaluate the performance of PRISM-PSY and demonstrate the usefulness of our precise parameter synthesis methods on several case studies, including survivability analysis of the Google File System [2, 16]. Our experiments show that the data-parallel synthesis achieves on a single GPU up to a 31-fold speedup with respect to the optimised sequential implementation and that our algorithms provide good scalability with respect to the size of the model and the number of parameter regions to analyse. As a result, PRISM-PSY enables the application of precise parameter synthesis methods to more complex problems, i.e. larger models and higher-dimensional parameter spaces.

The main contributions of this paper can be summarised as follows: (1) improvement of the sequential algorithms of [10, 12], leading in some cases to more than 10-fold speedup; (2) formulation of a backward variant of the parametric transient analysis of [10] using matrix-vector operations, which enables data-parallel implementation; (3) combination of the state space and parameter space parallelisation in order to fully utilise the available computational power; (4) development of the PRISM-PSY tool that enables precise parameter synthesis on many-core architectures and (5) systematic experimental evaluation of the tool.

Related Work. The parameter synthesis problem for CTMCs and bounded reachability specifications was first introduced in [24], where the authors resort to the analysis of the polynomial function describing how the reachability probability depends on the parameter values. Due to the high degree of the polynomials (determined by the number of uniformisation steps), only an approximate solution is obtained through the discretisation of the parameter space.

The function describing how the satisfaction probability of a linear time-bounded formula depends on the parameter values can be approximated through statistical methods. A technique based on Gaussian Process regression is presented in [6] and implemented in the U-check tool [7]. In contrast to our approach, statistical methods cannot provide guaranteed precision, and thus are not suitable for safety-critical applications.

Parameter synthesis has also been studied for discrete-time Markovian models and unbounded temporal properties [15, 23]. The synthesis algorithms are based on constructing a rational function describing the satisfaction probability by performing state elimination. This approach is implemented in the tool PROPhESY [18] that supports incremental parameter synthesis using SMT techniques, but is not suitable for time-bounded specifications and CTMCs.

Our tool builds on methods for the efficient GPU parallelisation of matrix-vector multiplication [4] and probabilistic model checking [8, 33]. In our previous work [3], we showed how the algorithms for LTL model checking can be redesigned in order to accelerate verification on GPUs.

2 Precise Parameter Synthesis

In this section we summarise the parameter synthesis problem for CTMCs and time-bounded CSL properties originally introduced in [12]. We also describe the sequential synthesis algorithms of [10, 12] and the improvements implemented in the PRISM-PSY tool, which provide the foundation for the new data-parallel algorithms (Sect. 3) and the baseline for evaluating the parallelisation speedup.

2.1 Problem Formulation

Parametric continuous-time Markov chains (pCTMCs) [24] extend the notion of CTMCs by allowing transition rates to depend on parameters. We consider pCTMCs with a finite set of states S and a finite set K of parameters ranging over closed real intervals, i.e., \([k^{\bot },k^{\top }] \subseteq \mathbb {R}\) for \(k \in K\). These induce a rectangular parameter space . Subsets of \(\mathcal {P}\) are referred to as parameter regions or subspaces. Given a pCTMC and a parameter space \(\mathcal {P}\), we denote with \(\mathcal {C}_{\mathcal {P}}\) the set \( \{ \mathcal {C}_p \ | \ p \in \mathcal {P} \}\), where \(\mathcal {C}_p\) is the instantiated CTMC obtained by replacing the parameters in the parametric rate matrix \(\mathbf {R}\) with the valuation p.

In the current implementation of the tool, we support only linear rate functions of the following two forms: for any \(s,s' \in S\), \(\mathbf {R}(s, s') = \sum _{k \in K}k \cdot a_{k,s,s'}\) (parametric rate) or \(\mathbf {R}(s, s') = b_{s,s'}\) (constant rate) where \({a_{k,s,s'}, b_{s,s'} \in \mathbb {R}_{\ge 0}}\). Such rate functions are sufficient to describe a wide range of systems, from biological to computer systems, as we will show in Sect. 4.

To specify properties over pCTMCs, we employ the time-bounded fragment of Continuous Stochastic Logic (CSL) [1] extended with time-bounded reward operators [27]. The current version of the tool considers only unnested formulae given by the following syntax: \(\varPhi \,{:}{:}= P_{\sim r} [\phi ] \mid R_{\sim r}[C^{\le t}]\) is a state formula, \({\phi \,{:}{:}= \varPsi \ U^{I} \ \varPsi }\) is a path formula, where \(\varPsi \,{:}{:}= \mathsf {true} \mid a \mid \lnot \varPsi \mid \varPsi \wedge \varPsi \), a is an atomic proposition evaluated over states, \(\sim \ \in \left\{ <, \le , \ge , > \right\} \), r is a probability (\(r \in [0,1]\)) or reward (\(r \in \mathbb {R}_{\ge 0}\)) threshold, \(t\in \mathbb {R}_{\ge 0}\) is a time bound, and I is a time interval of \(\mathbb {R}_{\ge 0}\). The future operator, \(F^{I}\), can be derived as \(F^{I}\ \varPsi \equiv \mathsf {true} \ U^{I}\ \varPsi \). Let \(\vDash \) denote a satisfaction relation. Intuitively, a state \(s \vDash \mathsf {P}_{\sim r}[\phi ]\) iff the probability of the set of paths starting in s and satisfying \(\phi \) meets \(\sim r\). A path \(\omega = s_0t_0s_1t_1\ldots \) satisfies \(\varPhi \mathsf {U}^{I} \varPsi \) iff there exists a time \(t \in I. ( \omega @ t \vDash \varPsi \wedge \forall t' \in [0,t). \omega @ t' \vDash \varPhi )\), where \(\omega @t\) denotes the state in \(\omega \) at time t. A state \(s \vDash \mathsf {R}_{\sim p}[\mathsf {C}^{\le t}]\) iff the expected rewards over the path starting in s cumulated until t time units satisfies \(\sim p\). We remark that the synthesis algorithms can be adapted to support the full fragment of time-bounded CSL including nested formulae, as shown in [10].

The satisfaction function captures how the satisfaction probability of a given property relates to the parameters and initial state. Let \( \phi \) be a CSL path formula, \(\mathcal {C}_{\mathcal {P}}\) be a pCTMC over a space \(\mathcal {P}\) and \(s \in S\). We denote with \(\varLambda _{\phi }: \mathcal {P}\! \rightarrow \! S \!\rightarrow \! [0,1]\) the satisfaction function such that \(\varLambda _{\phi }(p)(s)\) is the probability of the set of paths (from state s) satisfying \(\phi \) in \(\mathcal {C}_{p}\). The satisfaction function for reward formulae can be defined analogously and is omitted to simplify the presentation.

We consider two parameter synthesis problems: the threshold synthesis problem that, given a threshold \(\sim r\) and a CSL path formula \(\phi \), asks for the parameter region where the probability of \(\phi \) meets \(\sim r\); and the max synthesis problem that determines the parameter region where the probability of the input formula attains its maximum, together with probability bounds approximating that maximum. Solutions to the threshold synthesis problem admit parameter points left undecided, while, in the max synthesis problem, the actual set of maximising parameters is contained in the synthesised region. The min synthesis problem is defined and solved in a symmetric way to the max case.

For \(\mathcal {C}_{\mathcal {P}}\), \(\phi \), an initial state \(s_0\), a threshold \(\sim r\) and a volume tolerance \(\varepsilon > 0\), the threshold synthesis problem is finding a partition \(\{\mathcal {T},\mathcal {U},\mathcal {F}\}\) of \(\mathcal {P}\), such that: \(\forall p \in \mathcal {T}: \varLambda _\phi (p)(s_0) \sim r\); \(\forall p \in \mathcal {F}: \varLambda _\phi (p)(s_0) \not \sim r\); and \(\text{ vol }(\mathcal {U})/\text{ vol }(\mathcal {\mathcal {P}}) \le \varepsilon \), where \(\mathcal {U}\) is an undecided subspace and \(\text{ vol }(A)=\int _{_{A}} {1} d\mu \) is the volume of A.

For \(\mathcal {C}_{\mathcal {P}}\), \(\phi \), \(s_0\), and a probability tolerance \(\epsilon > 0\), the max synthesis problem is finding a partition \(\{\mathcal {T}, \mathcal {F}\}\) of \(\mathcal {P}\) and probability bounds \(\varLambda _\phi ^{\bot }\)\(\varLambda _\phi ^{\top }\) such that: \(\forall p \in \mathcal {T}: \varLambda _\phi ^{\bot } \le \varLambda _\phi (p)(s_0) \le \varLambda _\phi ^{\top }\); \(\exists p \in \mathcal {T}: \forall p'\in \mathcal {F}: \varLambda _\phi (p)(s_0) > \varLambda _\phi (p')(s_0)\); and \({\varLambda _\phi ^{\top } - \varLambda _\phi ^{\bot } \le \epsilon }\).

Figure 1 depicts an example of threshold and max synthesis problems. On the left, the satisfaction function describes the probability of the property (y-axis) depending on the values of parameter \(k_1\) (x-axis). In the centre plot, we highlight the parameter regions for which the threshold \(\ge r\) is met (\(\mathcal {T}\), green), is not met (\(\mathcal {F}\), red) and is undecided (\(\mathcal {U}\), yellow). On the right, the solution to the max synthesis problem is the region (\(\mathcal {T}\), green) containing all the maximising parameters and whose probability bounds meet the input tolerance \(\epsilon \).

Fig. 1.
figure 1

Left: Example of a satisfaction function. Centre: Solution of the threshold synthesis problem for \(\ge r\). Right: Solution of the max synthesis problem (Color figure online).

2.2 Solution of the Synthesis Problems

The key ingredient for solving the aforementioned synthesis problems is a procedure that takes a pCTMC \(\mathcal {C}_{\mathcal {P}}\) and CSL path formula \(\phi \), and provides safe under- and over-approximations of the minimal and maximal probability that \(\mathcal {C}_{\mathcal {P}}\) satisfies \(\phi \): for all \(s\in S\), it computes bounds \(\varLambda _{\min }(s)\) and \(\varLambda _{\max }(s)\) such that \(\varLambda _{\min }(s) \le \inf _{p \in \mathcal {P}} \varLambda _{\phi }(p)(s) \text { and } \varLambda _{\max }(s) \ge \sup _{p \in \mathcal {P}} \varLambda _{\phi }(p)(s)\). The procedure builds on a parametric transient analysis that computes safe bounds for the parametric transient probabilities in the discrete-time process derived from the CTMC. This discretisation is obtained through standard uniformisation and the Fox and Glynn algorithm [21] that is used to derive the required number of discrete steps to consider (also called uniformisation steps or iterations) for a given time boundFootnote 2. See [10, 27] for more details.

We now summarise the algorithms for threshold and max synthesis based on the partitioning and iterative refinement of the parameter space [12]. Assume a threshold synthesis problem for a path formula \(\phi \) with threshold \(\ge r\). At each step, the algorithm refines the undecided parameter subspace \(\mathcal {U}\), starting from \(\mathcal {U}=\mathcal {P}\): it generates a partition \(\mathcal {D}\) of \(\mathcal {U}\) and, for each \(\mathcal {R}\in \mathcal {D}\), computes the safe probability bounds \(\varLambda _{\min }^{\mathcal {R}}\) and \(\varLambda _{\max }^{\mathcal {R}}\) of the corresponding pCTMC \(\mathcal {C}_{\mathcal {R}}\). If \(\varLambda _{\min }^{\mathcal {R}} \ge r\), then the satisfaction of the threshold is guaranteed for the whole region \(\mathcal {R}\), which is hence added to \(\mathcal {T}\). Otherwise, the algorithm tests whether \(\mathcal {R}\) can be added to \(\mathcal {F}\) by checking if \(\varLambda _{\max }^{\mathcal {R}} < r\). If \(\mathcal {R}\) is neither in \(\mathcal {T}\) nor in \(\mathcal {F}\), it forms an undecided subspace that is added to the set \(\mathcal {U}\). If the volume tolerance \(\varepsilon \) is not met, the algorithm proceeds to the next iteration, where \(\mathcal {U}\) is further refined. The refinement procedure guarantees termination since the over-approximation \([\varLambda _{\min }^{\mathcal {R}}, \varLambda _{\max }^{\mathcal {R}}]\) can be made arbitrarily precise by reducing the volume of \(\mathcal {R}\) [13].

In the max synthesis case, the algorithm starts from \(\mathcal {T}=\mathcal {P}\) and iteratively refines \(\mathcal {T}\) until the tolerance \(\epsilon \) is met. Let \(\mathcal {D}\) be the partition of \(\mathcal {T}\) at a generic step. The algorithm rules out from \(\mathcal {T}\) subspaces that are guaranteed to be in \(\mathcal {F}\), by deriving an under-approximation M of the maximum satisfaction probability. Indeed, for \(\mathcal {R}\in \mathcal {D}\), \(\varLambda _{\max }^{\mathcal {R}} < M\) implies that \(\mathcal {R}\) is in \(\mathcal {F}\). M is derived by sampling a set of parameter values from the region \(\mathcal {R}\) with the highest \(\varLambda _{\min }^{\mathcal {R}}\) and taking the highest value of the satisfaction function over these values.

Improvements on the Sequential Algorithms. The PRISM-PSY tool introduces several improvements on the prototype implementations used in [10, 12]. Here we present those having the most significant impact on performance.

(1) Backward computation of probabilistic bounds. In [10, 12], the probability bounds \(\varLambda _{\min }\) and \(\varLambda _{\max }\) are computed using a forward variant of the parametric transient analysis, which requires a separate computation of the bounds for each initial state. In our tool, we also implemented a more efficient solution that requires only a single computation for all states, based on backward computation.

(2) Adaptive Fox-Glynn. While in previous implementations the number of uniformisation steps was fixed and obtained using the maximum exit rate (sum of outgoing rates per a state) of the whole parameter space, the adaptive Fox-Glynn technique computes the number of steps for each subregion separately, using the maximum exit rate of the inspected subregion. For large parameter spaces, this technique can significantly decrease the overall number of uniformisation steps, improving the performance by more than a factor of two.

(3) Refinement Strategies. The tool employs improved refinement algorithms that can decrease the total number of subregions to analyse. Specifically, for threshold synthesis, at each step only the undecided subregions with the largest volume are refined while, for max synthesis, only the regions with either the lowest lower probability bound (\(\varLambda _\phi ^{\bot }\)) or the highest upper bound (\(\varLambda _\phi ^{\top }\)).

3 Data-Parallel Algorithms for Parameter Synthesis

In this section we first introduce the basic concepts of the target hardware architecture, i.e. modern general-purpose GPUs. We then formulate the backward variant of the parametric transient analysis using matrix-vector operations, and describe the sparse-matrix representation of pCTMCs. Finally, we present a two-level parallelisation of the synthesis algorithms. A detailed description of the data-parallel algorithms for parameter synthesis can be found in [30].

3.1 Computational Model for Modern GPUs

Typical GPUs consist of multiple Streaming Multiprocessors (SMs), with each SM following a single instruction multiple threads (SIMT) model. This approach establishes a hierarchy of threads prior to the actual computation. Within this hierarchy, threads are arranged into blocks that are assigned for parallel execution on SMs. Threads are hardwired into groups of 32 called warps, which form a basic scheduling unit and execute instructions in a lock-step manner. If a sufficient number of threads is dispatched, each SM maintains a set of active warps to hide the memory access latency and maximise utilisation of its functional units.

The SIMT approach supports code divergence within threads of the warp, but this usually causes significant performance degradation due to the serialisation of the execution. Another characteristic of GPUs that significantly affects their performance is the way in which simultaneous memory requests from multiple threads in a warp are handled. Requests exhibiting spatial locality are maximally coalesced. Simply stated, accesses to consecutive addresses are served by a single memory fetch as long as they are in the same memory segment.

A typical GPU program consists of a host code running on the CPU and a device code running on the GPU. The device code is structured into kernels that execute the same scalar sequential program in many independent data-parallel threads. The combination of out-of-order CPU and data-parallel processing GPU allows for heterogeneous computation.

3.2 Backward Computation of Probability Bounds

For a pCTMC \(\mathcal {C}_{\mathcal {R}}\) over a region and a target set \(A \subseteq S\), the parametric backward analysis computes a series of vectors \(\sigma _{i}^{\min }\) and \(\sigma _{i}^{\max }\) such that, for all \(s\in S\), \({\sigma _{i}^{\min }(s) \le \inf _{p \in \mathcal {P}} \sigma _{i,p}(s)}\) and \(\sigma _{i}^{\max }(s) \ge \sup _{p \in \mathcal {P}} \sigma _{i,p}(s)\), where \(\sigma _{i,p}(s)\) is the probability that, starting from the state s, a state in A is reached after i discrete steps in \(\mathcal {C}_p\). From these vectors, the probability bounds \(\varLambda _{\min }(s)\) and \(\varLambda _{\max }(s)\) are computed in a similar way to non-parametric CTMCs [27].

We define a matrix-vector operator \({\diamond }\) that computes the vector \(\sigma _{i+1}^{\max }\) from \(\sigma _{i}^{\max }\) and the parametric rate matrix \(\mathbf {R}\) as \(\sigma _{i+1}^{\max }(s)\) = \({(\mathbf {R} \diamond \sigma _{i}^{\max })(s)}\), where \(\sigma _{0}^{\max }(s) = 1\) if and 0 otherwise. An analogous operator can be defined for \(\sigma _{i+1}^{\min }\). Similarly to standard uniformisation, the definition of \({\diamond }\) exploits the uniformised matrix, which is, in our case, parametric. For each \(s \in S\), \(\sigma _{i+1}^{\max }(s)\) is first expressed by maximising the probability in s stepwise, i.e. after the \(i\text{-th }\) step. Below, we expand the definition of the uniformised matrix using the uniformisation rate q given by the maximal exit rate and the time bound [21, 27]:

$$\begin{aligned} \sigma _{i+1}^{\max }(s) = \max _{p \in \mathcal {R}} \left( \sum _{s' \in S \setminus \{s\}} \! \! \! \! \! \! \sigma _{i}^{\max }(s')\frac{\mathbf {R}_p(s, s')}{q} + \sigma _{i}^{\max }(s)\left( 1 - \! \! \! \! \! \!\sum _{s' \in S \setminus \{s\}} \! \! \! \! \! \!\frac{\mathbf {R}_p(s, s')}{q}\right) \right) \end{aligned}$$
(1)

where \(\mathbf {R}_p\) is the rate matrix instantiated with parameter p. The first sum represents the probability of entering a state \(s'\ne s\) and, from there, reaching A in i steps. The second sum is the probability of staying in s and, from there, reaching A in i steps. By expanding the parametric rate matrix \(\mathbf {R}\) in Eq. 1 we get:

figure a

where \(k^{\star }\) = \(k^{\top }\) if \(\sigma _{i}^{\max }(s') > \sigma _{i}^{\max }(s)\) and \(k^\bot \) otherwise. These equations allow us to compute the vector \(\sigma _{i+1}^{\max }\) using matrix-vector operations, as shown in the implementation of \({\diamond }\) in Algorithm 1. Note that Eq. 2 is used when the transition from s to \(s'\) has a parametric rate, while Eq. 3 is used when it has a constant rate.

An approximation error is introduced because \(\sigma _{i+1}^{\max }\) is computed by optimising \(\sigma _{i+1,p}\) locally, i.e. at each step and at each state, and the error accumulates at each uniformisation step. We examine this error and its convergence in [13]. The forward variant of the parametric transient analysis can also be formulated using a vector-matrix operator [10], but the resulting code has more complex control flow and higher branch divergence, which makes parallelisation less efficient.

3.3 Sparse-Matrix Representation of Parametric CTMCs

We introduce a sparse-matrix representation of parametric CTMCs that allows us to implement the operator \({\diamond }\) in such a way that the resulting program has a similar control flow and memory access pattern as the standard matrix-vector multiplication, for which efficient data-parallel implementations exist [4, 8, 33].

We represent the data in a compact format based on the compressed sparse row (CSR) matrix format. The CSR format stores only the non-zero values of the rate matrix \(\mathbf {R}\) using three arrays: non-zero values, their column indices, and row beginnings. The CSR format is also used in the PRISM tool as the fastest explicit representation for CTMCs [26].

To handle the non-parametric transitions separately in a more efficient way, we decompose \(\mathbf {R}\) into the non-parametric matrix, stored in the CSR format, and the parametric matrix. To enable an efficient data-parallel implementation of the operator \({\diamond }\), for a region and for each parametric transition rate \(\mathbf {R}(s,s')\) two quantities, \(r_{s,s'}^{\bot } = \sum _{k \in K}k^{\bot } \cdot a_{k,s,s'}\) and \(r_{s,s'}^{\top } = \sum _{k \in K}k^{\top } \cdot a_{k,s,s'}\), are stored. From Eq. 2, it is enough to test \({\sigma _{i}^{\max }(s') - \sigma _{i}^{\max }(s) > 0}\) to decide whether to use \(r_{s,s'}^{\top }\) or \(r_{s,s'}^{\bot }\) in the multiplication, as illustrated in Algorithm 1.

In the parallel version, we provide an additional implementation using data structures based on the ELLPACK (ELL) sparse matrix representation [4]. The advantage of ELL over CSR is that it provides a single-stride aligned access to the data arrays, meaning that memory accesses within a single warp are reasonably coalesced. ELL yields better performance than CSR for some problems.

3.4 GPU Parallelisation

We implemented PRISM-PSY in Open Computing Language (OpenCL) [32]. In contrast to other programming frameworks, OpenCL supports multiple platforms and GPUs, and thus provides better portability. Moreover, its performance is comparable with that of specialised frameworks (e.g. CUDA [20]).

The synthesis algorithms are executed in a heterogeneous way. The sequential refinement procedure is executed on the CPU. For each parameter region \(\mathcal {R}\) to analyse, the CPU prepares a kernel that computes the probability bounds \(\varLambda _{\min }^{\mathcal {R}}\) and \(\varLambda _{\max }^{\mathcal {R}}\) on the GPU, based on the backward parametric transient analysis described above. Following [4, 8, 33], we implement a state space parallelisation, i.e. a single row of the rate matrix (corresponding to the processing of a single state) is mapped to a single computational element. Note that the models we consider typically have a balanced distribution of the state successors, and thus yield a balanced distribution of non-zero elements in the rows of the matrix. This ensures a good load balancing within the warps and blocks.

In the case of models with small numbers of states, this parallelisation is not able to efficiently utilise all computational elements, since some of them will be idle during the kernel execution. To overcome this potential performance degradation, we combine state space parallelisation with parameter space parallelisation that computes the probability bounds for multiple parameter regions in parallel. As demonstrated in the experimental evaluation (Sect. 4), this two-level parallelisation significantly improves performance on small models. In many cases, this solution can improve the runtime of large models too, because it allows the thread scheduler to better hide memory latency.

Since parallel kernel execution is unsupported by many GPU devices or it may fundamentally decrease performance, we provide a way to perform, in a single compute kernel, multiple matrix-vector operations over multiple parameter regions. The solution exploits the fact that, in our case, the rate matrices for different regions have the same structure and only differ in the values of \(r_{s,s'}^{\top }\) and \(r_{s,s'}^{\bot }\). We extend the sparse-matrix representation of the pCTMC and store the values \(r_{s,s'}^{\bot }\) and \(r_{s,s'}^{\top }\) as well as \(\sigma _{i}^{\max }(s)\) and \(\sigma _{i+1}^{\max }(s)\) for all regions. This allows us to utilise \(m\cdot |S|\) computational elements for m parallel regions. Algorithm 1 illustrates the kernel for the two-level parallelisation using the CSR format. The vectors \(\texttt {matRowBeg}\) and \(\texttt {matCol}\), the same for all regions, keep the column indices and row beginnings, respectively. The vectors \(\texttt {matValTop}\) and \(\texttt {matValBot}\) keep the non-zero values of \(r_{s,s'}^{\top }\) and \(r_{s,s'}^{\bot }\), respectively. We store only the current \(\sigma _{i}^{\max }\) and \(\sigma _{i+1}^{\max }\) using two vectors and the vectors are swapped between the iterations.

Importantly, merging the computation for multiple regions requires modifying the adaptive Fox-Glynn technique to consider the highest uniformisation step among them. This means that the benefits of adaptive Fox-Glynn diminish with the number of subspaces processed in parallel.

figure b

4 Experimental Evaluation

In this section we evaluate the performance of the data-parallel synthesis algorithms on case studies of biological and computer systems. We discuss how model features affect parallelisation and show that parameter synthesis can be meaningfully employed to analyse various requirements, ranging from quality of service to the reliability of synthetic biochemical networks.

All the experiments were run on a 4-core Linux workstation with an AMD Phenom™ II  X4 940 Processor @ 3 GHz, 8 GB DDR2 @ 1066 MHz RAM and an NVIDIA GeForce GTX 480 GPU with 1.5 GB of GPU memory. The GPU has 14 SMs, each having 32 cores and the capability to maintain up to 48 active warps. Therefore, the GPU can simultaneously maintain and schedule up to 21504 active threads to maximise the utilisation of its computational elements.

In the following, Java denotes the optimised sequential implementation and \(\textsc {Csr}_n\) (\(\textsc {Ell}_n\)) the data-parallel implementation based on CSR (ELL) with n subregions being processed in parallel. We report only an approximate value for the number of final subregions, since it differs slightly in some experiments due to parallel processing. We also report the results for the parameter space parallelisation only up to the best performance is reached.

4.1 Google File System

We consider the performance evaluation case study of the replicated file system used in the Google search engine known as Google File System (GFS). The model was first introduced as a generalised stochastic Petri net (GSPN) [16] and then translated to a CTMC [2]. Previous work on the model focused on survivability analysis, i.e. the ability of the system to recover from disturbances or disasters, and considers all model parameters to be fixed. Here, we work with a pCTMC model and show how parameter synthesis can be used to examine survivability.

Fig. 2.
figure 2

Google File System from [2, 16]. Transitions can be immediate (grey) or timed (white).

Figure 2 illustrates the GSPN model of the GFS. Default values for stochastic rates can be found in [2, 16]. Files are divided into chunks of equal size. Each chunk exists in several copies, located in different chunk servers. There is one master server that is responsible for keeping the locations of the chunk copies, monitoring the chunk servers and replicating the chunks. The master can be: up and running (token at \(\mathsf {M\_up}\)); or failed (\(\mathsf {M1}\)), due to a software (\(\mathsf {M\_soft\_d}\)) or hardware (\(\mathsf {M\_hard\_d}\)) problem. The model reproduces the life-cycle of a single chunk: the numbers of available and lost copies are given by places \(\mathsf {R\_present}\) and \(\mathsf {R\_lost}\), respectively. Lost chunks are replicated through transition \(\mathsf {replicate}\). R is the maximum number of copies. We consider M chunk servers whose behaviour is analogous to that of the master. When a chunk server fails, a chunk copy is lost (\(\mathsf {destroy}\)) or not (\(\mathsf {keep}\)) depending on whether the server is storing the single chunk under consideration. We set \(M=60\) and \(R=3\), yielding a model with 21.6 K states and 145 K transitions.

We first formulate a threshold synthesis problem for the CSL formula \(\phi _{\mathsf {GFS_1}} = F^{[0,60]} \ \mathsf {SL_3}\), where \(\mathsf {SL_3}= (\mathsf {M\_up} = 1\) and \(\mathsf {R\_present} \ge 3)\) is the QoS requirement that the master is running and at least three chunk copies are available (service level 3). The initial state models a severe hardware disaster: all the servers are down due to hardware (\(\mathsf {C\_hard\_d} = M\) and \(\mathsf {M\_hard\_d} = 1\)) and all the chunk copies have been lost (\(\mathsf {R\_lost} = R\)). We are interested in synthesising the values of parameter \(\mathsf {c\_hard\_re}\), that is, the rate at which chunk servers are repaired from hardware failure. Importantly, \(\mathsf {c\_hard\_re}\) can actually be controlled, e.g. by intensifying the frequency of technical interventions. Figure 3(a) illustrates the synthesis results for \(\mathsf {c\_hard\_re} \in [0.5,2]\) and probability threshold \(\ge 0.5\). The property is met for any \(\mathsf {c\_hard\_re}\) above 1, and, in particular, \(\mathsf {SL_3}\) is reached with high probability for repair rates above 1.25.

Fig. 3.
figure 3

Synthesis results for the GFS model. Each box denotes a parameter region (width and depth) and its probability bounds (height). Colour code is as in Fig. 1. (a) Threshold synthesis, property \(\phi _{\mathsf {GFS_1}}\), threshold \(\ge 0.5\) (dashed line) and volume tolerance \({\varepsilon = 0.01}\). (b) Max synthesis, property \(\phi _{\mathsf {GFS_2}}\) and probability tolerance \({\epsilon = 0.01}\). (c) Threshold synthesis, property \(\phi _{\mathsf {GFS_2}}\), threshold \(\ge 0.5\) (semi-transparent plane) and \({\varepsilon = 0.1}\). Parameter domains are \(\mathsf {c\_hard\_re} \in [0.5,2]\) (a,b,c) and \(\mathsf {c\_fail} \in [0.01, 1]\) (c). Numbers of final regions are 8 (a), 24 (b) and 136 (c).

Table 1. Performance of the GFS model: 21.6K states, 145K transitions, and \(\le \)47K iterations per subregion. Details of the synthesis problems are reported in Fig. 3.

We now evaluate a property requiring that \(\mathsf {SL_3}\) is reached strictly within the time interval [40, 60]: \(\phi _{\mathsf {GFS_2}} = \lnot \mathsf {SL_3}\ U^{[40,60]} \ \mathsf {SL_3}\). Although it is generally sought to reach the required QoS as soon as possible, this property can be used in scenarios like planned downtime, where the service does not need to be up before the time scheduled for maintenance. In Fig. 3(b), we report the results of max synthesis for parameter \(\mathsf {c\_hard\_re}\). The maximising parameters (indicated with a black arrow) are found in the region approximately given by \(\mathsf {c\_hard\_re} \in [1.2, 1.23]\), since for high repair rates \(\mathsf {SL_3}\) is reached too early.

In the last experiment, we introduce one additional parameter, \(\mathsf {c\_fail}\), i.e. the rate at which any failure (hardware or software) occurs in a chunk server. Since the GFS is designed to run on cheap commodity hardware, this rate can be controlled indirectly through the reliability of the machines used. We consider a threshold synthesis problem with property \(\phi _{\mathsf {GFS_2}}\) and threshold \(\ge 0.5\). Results in Fig. 3(c) evidence that, interestingly, the satisfaction probability is almost independent from the failure rate, except when \(\mathsf {c\_fail}\) approaches 1, and thus slightly higher repair rates are needed.

Table 1 reports the performance of the tool on the above experiments, namely, the speedup achieved by the data-parallel algorithms. Although the state space parallelisation utilises the GPU sufficiently (enough threads are dispatched), the parameter space parallelisation further improves performance, providing up to 24-fold and 16-fold speedup with respect to the sequential algorithm for threshold and max synthesis, respectively. The efficiency of the parameter space parallelisation depends on the effective usage of GPU resources, and thus the speedup does not scale with respect to the number of regions processed in parallel. In this case the adaptive Fox-Glynn technique does not bring any benefit, since the parameters we analyse do not affect the maximal exit rate.

4.2 Epidemic Model

We further consider the stochastic epidemic model we analysed in [12] using the prototype implementation, in order to evaluate the enhancements of the sequential implementation presented in this paper. It describes the epidemic dynamics of susceptible (S), infected (I) and recovered (R) individuals using the following biochemical reactions network with mass action kinetics: \({S + I \xrightarrow {k_i} I + I}\) and \(I \xrightarrow {k_r} R\). With a total population of 100 individuals and initial state \(S = 95, I = 5\) and \(R = 0\), the model has 5.1 K states and 10K transitions. We consider the same max synthesis problem as in [12]: parameter space \(\mathcal {P}_{\mathsf {SIR}} = k_i \times k_r \in [0.005, 0.3] \times [0.005, 0.2]\) and property \(\phi _{\mathsf {SIR}}(t_1,t_2) = (I > 0) \ U^{[t_1,t_2]} \ (I = 0)\), expressing that the infection lasts at least \(t_1\) time units but dies out before time \(t_2\). As shown in [12], for \(t_1=100\) and \(t_2=120\), the prototype implementation produced around 5 K final parameter subspaces and required 3.6 h.

Table 2. Max synthesis for the epidemic model: parameter space \(\mathcal {P}_{\mathsf {SIR}}\) and probability tolerance\(\epsilon = 1\,\%\). \(\phi _{\mathsf {SIR}}(100,120)\): 5.1K states, 10K transitions and \(\le \) 3.1K iterations per regionand \(\sim \)3.1K final regions. \(\phi _{\mathsf {SIR}}(100,200)\): 20K states, 40K transitions and \(\le \)12K iterations per region and \(\sim \)826 final regions (depicted on the right).

Table 2 (left) lists the results obtained with PRISM-PSY on the same synthesis problem. We can see that the optimised sequential implementation is about 14-fold faster (918 sec. vs 3.6 h.). This significant acceleration is explained by: more sophisticated refinement strategy for max/min synthesis, which reduces the number of final regions to 3K (\(\sim \)2-fold speedup); the adaptive Fox-Glynn technique, which reduces the number of iterations (\({\sim 2.5\text{-fold }}\) speedup); and more efficient data structures that accelerate the computation of the probability bounds as well as the refinement procedure (\(\sim \)3-fold speedup). Note that the actual benefits of these enhancements essentially depend on the structure of the model and the synthesis problem. As the epidemic model is relatively small, the state space parallelisation is not able to sufficiently utilise the GPU, and thus the \(\textsc {Csr}_1\) implementation provides only a 2.5-fold speedup. The parameter space acceleration further improves the speedup to 5.6 (\(\textsc {ELL}_{128}\) implementation).

We now consider a more complicated variant of the problem, where we double the population size and extend the time horizon to \(t_2=200\). Results are presented in Table 2 (middle). In this case, the state space parallelisation sufficiently utilises the GPU and for \(\textsc {CSR}_{1}\), we obtain a 10.3-fold speedup. On the other hand, the parameter space parallelisation reduces the benefits of the adaptive Fox-Glynn technique, and thus overall performance is improved only slightly. Table 2 (right) depicts the results of max synthesis for the larger variant.

4.3 Signalling in Prokaryotic Cells

This model was introduced in [14, 31] and describes a two-component signalling pathway in prokaryotic cells with two signalling components both in phosphorylated and dephosphorylated forms: the histidine kinase H and Hp, and the response regulator R and Rp. In this case, parameter synthesis is computationally very demanding, since the model has 116 K states and 954 K transitions. We consider a threshold synthesis problem that requires a relatively small number of refinements, in order to demonstrate the benefits of the state space parallelisation. We synthesise the production and degradation rates (prodR and degrR) of R such that the input noise of response regulators, defined as a quadratic deviation from the average population, is below 9 at least 80\(\%\) of the time. This can be formalised as the cumulative reward property \(\varPhi _{\mathsf {SIG}} = R_{\ge 0.8t} [C^{\le t}]\), where the reward in state is 1 if it satisfies \((R+Rp-avg)^2 < 9\), otherwise the reward is 0. We consider \(t=10\), \(avg=30\) and parameter space \(\mathcal {P}_{\mathsf {SIG}} = prodR \times degrR \in [0.1,0.9] \times [0.005,0.02]\), which reflects the setting in [14].

Table 3. Threshold synthesis for the signalling model: property \(\varPhi _{\mathsf {SIG}}\), parameter space \(\mathcal {P}_{\mathsf {SIG}}\) and \(\varepsilon = 9\,\%\). The variant from [14]: 116K states, 954K transitions, \(\le 19K\) iterations per region and \(\sim \)70 final regions (depicted on the right). The larger variant: 424K states, 3.6M transitions, \(\le \) 24K iterations per region, and \(\sim \)67 final regions.

As shown in Table 3, a speedup up to 24.7 is obtained using \(\textsc {ELL}_{4}\), which further improves the GPU utilisation and the memory access pattern of the pure state space parallelisation. We also consider a larger variant of the model (about 3.6-times), for which we obtain an even better speedup (up to \({31.1\text{-fold }}\)), so demonstrating good scalability of the data-parallel algorithm. Table 3 (right) depicts the synthesis results for the small variant, evidencing the non-monotonicity of the satisfaction function for the reward property.

4.4 Approximate Majority

The next model describes a chemical reaction network that computes the approximate majority – the asymptotically fastest way to approximate a common decision by all members of a population  [11]. We consider the network \(AM_{3,3} \#39\): \(A + B \xrightarrow {k_1 = 92.9} X + X\), \(A+X \xrightarrow {k_2 = 26.2} A + A\) and \( B+X \xrightarrow {k_3 = 23.3} B + B\) synthesised in [17] as the best network utilising only 3 species. The structure of the network has been synthesised using an approach based on bounded model checking and the kinetic parameters estimated by Monte Carlo-based optimisation.

Fig. 4.
figure 4

Threshold synthesis for the approximate majority model and property \(\varPhi _{\mathsf {AM}}\).

As in [17], we consider small numbers of input molecules (\(A=10\), \({B=4}\) and \({X=0}\)), and thus the model has only 120 states and 273 transitions. This experiment, in contrast to the previous case studies, allows us to demonstrate the performance of our tool on very small models. We synthesised the parameters such that the probability of the correct decision being made after 100 time units is at least \(95\,\%\). The property is formalised as \(\varPhi _{\mathsf {AM}} = P_{\ge 0.95} [ F^{[100,100]} \ \varPsi _{correct}]\) and the parameter space is \(\mathcal {P}_{\mathsf {AM}} = k_1 \times k_2 \times k_3 \in [1,100]^3\).

Table 4. Threshold synthesis for the approximate majority: property \(\varPhi _{\mathsf {AM}} \), parameter space \(\mathcal {P}_{\mathsf {AM}}\) and \({\varepsilon = 10\,\%}\). 120 states, 273 transitions, \(\le \) 700K iterations per region and \({\sim }911\) final regions.

Due to the low number of states, the state space parallelisation utilises only a small portion of the computational elements of the GPU. Therefore, the GPU parallelisation using a small number of parallel parameter regions slows down the computation, as shown in Table 4. Increasing the number of parallel regions (up to 128) improves the GPU utilisation, and hence performance, yielding up to 5.7-fold speedup.

This experiment also demonstrates that, in contrast to the Monte Carlo-based optimisation, precise parameter synthesis provides detailed information about the impact of parameters on the probability of correct decision, as shown in Fig. 4. Note that the backward transient analysis implemented in our tool computes the probability bounds for all the reachable states, in this case all the inputs satisfying \({A+B=14}\).

4.5 Workstation Cluster

Finally, we consider a model describing a cluster of workstations consisting of two sub-clusters with N workstations in each, connected in a star topology [25, 34]. Both sub-clusters have their own switch that connects the workstations in the sub-cluster with a central backbone. The cluster maintains the minimum quality of service if at least 75 % of the workstations are operational and connected. We assume that one can control the workstation inspection (\(ws\_check\)), repair (\(ws\_repair\)) and failure (\(ws\_fail\)) rates.

Table 5. Threshold synthesis for the cluster model: property \(\varPhi _{\mathsf {CLU}}\), parameter space \(\mathcal {P}_{\mathsf {CLU}} \) and \(\varepsilon = 10\,\%\). 86K states, 415K transitions, \(\le \)9K iterations per region and \({\sim }273\) final regions.

We synthesise the parameters such that the minimum quality of service is not maintained at most 0.1 % of the time. This is formalised as \(\varPhi _{\mathsf {CLU}} = R_{\le 0.1 \cdot t} [C\le t]\), and associating a reward of 1 to states where the minimum quality of service is not provided. In this experiment, we use \(t=100\) and parameter space \({\mathcal {P}_{\mathsf {CLU}} = ws\_check \times ws\_repair \times ws\_fail \in [5,20]\times [0.5,5] \times [0.001,0.02]}\).

Table 5 presents the results for \({N = 48}\) (85K states and 415 K transitions). In this model, the parameter space parallelisation considerably reduces the benefits of the adaptive Fox-Glynn technique, and thus the \(\textsc {CSR}_{1}\) implementation provides the best performance, leading to a 19-fold speedup.

4.6 Result Analysis

State space parallelisation improves the scalability of the computation with respect to the model size. The experiments demonstrate that the speedup compared to the sequential baseline tends to improve with the number of states (see Tables 2 and 3). On the other hand, for smaller models, an insufficient number of threads is dispatched, leading to performance degradation (see Table 4). In most cases, the ELL format moderately outperforms the CSR format and it also works better with the parameter space parallelisation due to the more coalesced memory access pattern. Since the refinement procedure (running solely on CPU) is more complicated for max/min synthesis, for these instances the overall speedup is lower than that for threshold synthesis.

Parameter space parallelisation allows us to efficiently utilise the GPU even for small models. It scales well up to reaching the maximal number of active threads that can be dispatched (see Table 4). In practice, performance usually increases even beyond this point, since the parameter space parallelisation can improve the memory access locality. On the other hand, it mitigates the advantage of the adaptive Fox-Glynn technique, which can lead to performance degradation, as reported in Table 5. Importantly, PRISM-PSY can also be configured to perform this parallelisation using multi-core CPUs (not discussed here).

Our experiments clearly indicate that the tool is able to provide good scalability with respect to the number of computational elements. Since the two-level parallelisation can tune the GPU utilisation for various synthesis problems, we expect that the execution of the tool on new generations of GPUs with a larger number of cores will lead to a further improvement in acceleration.

5 Conclusion

We have introduced the tool PRISM-PSY that performs precise parameter synthesis for CTMCs and time-bounded specifications. In order to overcome the high computational demands, we have developed data-parallel versions of the algorithms allowing us to significantly accelerate synthesis on many-core GPUs. As a result, the tool provides up to 31-fold speedup with respect to the optimised sequential implementation, and thus considerably extends the applicability of precise parameter synthesis. In future we will extend the tool to support the full fragment of time-bounded CSL and multi-affine rate functions [13].