Property-Based Testing for Parameter Learning of Probabilistic Graphical Models

Saranti, Anna; Taraghi, Behnam; Ebner, Martin; Holzinger, Andreas

doi:10.1007/978-3-030-57321-8_28

Anna Saranti¹²,
Behnam Taraghi¹⁴,
Martin Ebner¹⁴ &
…
Andreas Holzinger ORCID: orcid.org/0000-0002-6786-5194^12,13

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12279))

Included in the following conference series:

International Cross-Domain Conference for Machine Learning and Knowledge Extraction

4558 Accesses
5 Citations
6 Altmetric

Abstract

Code quality is a requirement for successful and sustainable software development. The emergence of Artificial Intelligence and data driven Machine Learning in current applications makes customized solutions for both data as well as code quality a requirement. The diversity and the stochastic nature of Machine Learning algorithms require different test methods, each of which is suitable for a particular method. Conventional unit tests in test-automation environments provide the common, well-studied approach to tackle code quality issues, but Machine Learning applications pose new challenges and have different requirements, mostly as far the numerical computations are concerned. In this research work, a concrete use of property-based testing for quality assurance in the parameter learning algorithm of a probabilistic graphical model is described. The necessity and effectiveness of this method in comparison to unit tests is analyzed with concrete code examples for enhanced retraceability and interpretability, thus highly relevant for what is called explainable AI.

You have full access to this open access chapter, Download conference paper PDF

Test Generation with PathCrawler

Complete Property-Oriented Module Testing

Six years later: testing vs. model checking

Article Open access 01 December 2024

Keywords

1 Introduction

Most Machine Learning (ML) approaches are stochastic. Consequently, most existing testing techniques are inadequate for ML code implementations. Consequently, the ML community uses numerical testing, metamorphic testing, mutation testing, coverage-guided fuzzing testing, proof-based testing, and especially property-based testing to detect problems in ML code implementations as early as possible [1]. Because these ML models are increasingly used for decision support, e.g. in the medical domain, there is an urgent need for quality assurance - particularly with a focus on domain-dependent properties. On such is monotonicity and specifies a software as learned by an ML model to provide a prediction. Interestingly, approaches for checking monotonicity of the generated model, in particular of black-box models, are lacking [13].

The concept of property-based testing (PBT) relies on randomly generated test cases and it is a very relevant extension for unit tests. Defining concrete test cases is a central task when developing unit tests. However, it is very time-consuming and still often incomplete. Therefore methods for automatic generation of a variety of test-cases with only one specification, are more effective and profitable. Test developers don’t have to define all possible edge cases anymore; those are automatically discovered by the corresponding frameworks [9]. The task of the developer shifts from listing and programming a lot of use-cases, to analyzing the constraints and the properties of the software under test and let the framework randomly generate values that fulfil the constraints and explore the relevant edge-cases automatically.

Since Artificial Intelligence applications have become prevalent, the need of corresponding quality management tools is rising. Frameworks like ProbFuzz [4], tailored for the needs of probabilistic programming systems and generate lots of different probability distributions. Several examples of machine learning programs involving neural networks are described in [3] and one of the most popular and useful ones, the Markov Chain Monte Carlo (MCMC) is indeed exercised with property-based techniques [5]. An extensive study is provided by [15]. This research work focuses on a particular probabilistic graphical model with a defined structure, where the parameters need to be learned with the expectation-maximization algorithm. The testing of the implementation follows the paradigm of property-based testing.

2 Previous Work

Previous research work was based on data of the learning analytics application “1x1 trainer”^{Footnote 1}, developed by the department Educational Technology of Graz University of Technology, Austria. Users answer 1-digit multiplication questions that are posed to them sequentially. Detailed information about the gathered data, student modelling and analysis can be found in [12, 14]. The student model that was designed and provided valuable insights, is used in the forthcoming sections and its structure is depicted in Fig. 1. Each question has its own probabilistic graphical model; the structure of all models is basically the same.

In Bayesian parameter estimation the computation of the posterior distribution with regard to the prior is computed with the Bayes rule 1:

$$\begin{aligned} \overbrace{P( \varTheta | \mathcal {D} )}^{posterior} = \frac{ \overbrace{P( \mathcal {D} | \varTheta )}^{likelihood} \overbrace{P( \varTheta )}^{prior} }{ \underbrace{P( \mathcal {D} )}_{marginal \; likelihood} } \end{aligned}$$

(1)

The parameters of the student model in Fig. 1 are the ones corresponding to the uniform (uninformative) prior. The data that were collected from the students by their interaction with the learning application, are used to update the parameters, as it will be described in the following sections. The equation of the joint distribution 2 will be very valuable in the following sections for the update of the parameters.

$$\begin{aligned} \begin{aligned} P( \mathbf {{Correctness}_q}, \mathbf {{Learning \; State}_q}, \mathbf {{Answers}_q} ) = \\ P( \mathbf {{Correctness}_q} ) \; P( \mathbf {{Learning \; State}_q} | \mathbf {{Correctness}_q} ) \\ P( \mathbf {{Answers}_q} | \mathbf {{Learning \; State}_q} ) \end{aligned} \end{aligned}$$

(2)

3 Learning the Model’s Parameters with Batch Expectation-Maximization (EM)

Bayesian parameter learning is applicable when all variables are visible [7]; in that case all components of the Eq. 1 are computable. The model of the learning competence contains hidden variables, therefore the computation of the posterior of each random variable cannot be made directly. In this case the method that is used is expectation-maximization (abbreviated by EM). The concepts of prior, likelihood and posterior that were described in the previous Sect. 2 are used in the description of this method.

3.1 Notation

The entities that are necessary for the analytical solution for the computation of the posterior distributions of all model’s variables are the following:

N : Number of all samples in dataset

n : One sample of N

M : Number of $ \mathbf {{Correctness}_q} $ possible outcomes ($M=2$)

$\mu $ : Index of $ \mathbf {{Correctness}_q} $ outcome

$w_\mu $ : Parameters of $ \mathbf {{Correctness}_q} $

K : Number of $ \mathbf {{Learning \; State}_q} $ possible error types and correct outcome

($K=8$)

k : Index of $ \mathbf {{Learning \; State}_q} $ outcome (one error type out of $K-1$ or correct)

$\pi _{k \mid \mu }$ : Parameters of the $ \mathbf {{Learning \; State}_q} $ variable

Q : Number of $ \mathbf {{Answers}_q} $ random variables (90 in total)

q : Index of $ \mathbf {{Answers}_q} $ (one question of Q)

X : Number of all possible answers of each question (columns of conditional probability tables of $ \mathbf {{Answers}_{1x1}} $ to $ \mathbf {{Answers}_{10x9}} $)

x : one answer out of X

$x_n$ : the answer of the n-th sample

$\theta _{x \mid k}$ : The parameters of the $ \mathbf {{Correctness}_q} $ random variable

$\varTheta $ : All current parameters of the model : (set of all $w_\mu $, $\pi _{k \mid \mu }$, $\theta _{x \mid k}$).

$\varTheta _{old}$ : All parameters of the previous EM iteration.

X : The set of all visible variables. In this model, they are all $ \mathbf {{Answers}_q} $

variables.

Z : The set of all latent variables or hidden causes. In this model, they are all $ \mathbf {{Correctness}_q} $ and $ \mathbf {{Learning \; State}_q} $ variable.

3.2 Expectation-Maximization (EM) Algorithm

The goal of the EM-Algorithm is to find appropriate values for all parameters $\varTheta $. In general a better model will fit the data better, although it must not overfit. The latent variables $ \mathbf {{Correctness}_q} $ and $ \mathbf {{Learning \; State}_q} $ are not observed, so the direct maximization of the likelihood P(X$; \varTheta )$ of the data according to this model is not possible; the observed data X (not to be confused with the number of possible answers of each question X) is incomplete. Each iteration of the EM-algorithm computes a different instantiation of the table CPDs.

By using marginalization:

$$\begin{aligned} P(\varvec{X} ; \varTheta ) = \sum _{\varvec{Z}} P( \varvec{X}, \varvec{Z} ; \varTheta ) \end{aligned}$$

(3)

$$\begin{aligned} \text {ln} \, P(\varvec{X} ; \varTheta ) = \text {ln} \left\{ \sum _{\varvec{Z}} P( \varvec{X}, \varvec{Z} ; \varTheta ) \right\} \end{aligned}$$

(4)

If the complete data set $ \{ \varvec{X}, \varvec{Z} \} $ were known, then it would be straightforward to try to maximize the complete data log-likelihood. To avoid multiplication of very small floating point numbers that can lead to zero, one can equivalently maximize the log-likelihood function $\text {ln} \, P(\varvec{X} ; \varTheta )$.

The EM-Algorithm works iteratively and consists of four steps:

1.
Initialization of all parameters to $\varTheta _0$ of the complete dataset $ \{ \varvec{X}, \varvec{Z} \} $ and set $\varTheta _0 = \varTheta _{old}$.
2.
E-Step: Computation of the posterior distribution $P( \varvec{Z} | \varvec{X} ; \varTheta _{old})$ of Z given the visible variables and the previous parameters.
3.
M-Step: Compute new $\varTheta $ parameters by trying to maximize 6 the expected value of the posterior distribution 5 over the latent variables $\varvec{Z}$:
$$\begin{aligned} \mathcal {Q}( \varTheta , \varTheta _{old} ) = \sum _{\varvec{Z}} P( \varvec{Z} | \varvec{X} ; \varTheta _{old}) \; \text {ln}\, P( \varvec{X}, \varvec{Z} ; \varTheta ) \end{aligned}$$
(5)

$$\begin{aligned} \varTheta = \underset{\varTheta }{\mathrm {argmax}} \; \mathcal {Q}( \varTheta , \varTheta _{old} ) \end{aligned}$$
(6)
4.
Compute the incomplete data likelihood $ P(\varvec{X} ; \varTheta ) $ or equivalently the log-likelihood $ \text {ln} P(\varvec{X} ; \varTheta ) $ 4. If the log-likelihood’s increase or the $\varTheta $ parameters’ change is not significant compared to the previous iteration, then stop. Else, set current $\varTheta $ with the values computed in M-Step and return to E-Step. The EM-algorithm is a “meta-algorithm” since it contains an inference in the E-Step [11]. The iterative process is depicted in Fig. 2.

3.3 Analytical Solution of Expectation-Maximization (EM) for the Model of Learning Competence

The steps of the EM-algorithm are applied to the model of the Learning Competence for the derivation of the analytical solution for the update of the parameters. We apply those steps to the model of Learning Competence of each question q separately. The equations in this subsection omit the subscript q; they apply to the model of each question independently.

The Eq. 7 expresses the joint probability distribution Eq. 8 is derived from:

$$\begin{aligned} P(\varvec{X}, \varvec{Z} ; \varTheta ) = \prod _{n} \prod _{\mu } \prod _{k} w_{\mu } \pi _{k \mid \mu } \theta _{x_n \mid k} \end{aligned}$$

(7)

$$\begin{aligned} \text {ln}\,P(\varvec{X}, \varvec{Z} ; \varTheta ) = \sum _{n=1}^{N} \text {ln}\,\Bigg ( \sum _{\mu =1}^{M} \sum _{k=1}^{K} w_{\mu } \pi _{k \mid \mu } \theta _{x_n \mid k} \Bigg ) \end{aligned}$$

(8)

The expected value of the complete log-likelihood $\mathcal {Q}( \varTheta , \varTheta _{old} )$ is:

$$\begin{aligned} \begin{aligned} \mathcal {Q}( \varTheta , \varTheta _{old} ) = \mathbb {E}_{ P(\varvec{Z} | \varvec{X} ; \varTheta _{old}) } \, [ \, \text {ln}\,P(\varvec{X}, \varvec{Z} ; \varTheta ) \, ] = \\ \sum _{\varvec{Z}} P ( \varvec{Z} | \varvec{X} ; \varTheta _{old} ) \, \text {ln}\, P(\varvec{X}, \varvec{Z} ; \varTheta ) = \\ \sum _{n} \sum _{\mu } \sum _{k} \gamma (z_{\mu k}^{(n)}) \bigg ( \text {ln} w_{\mu } + \text {ln}\pi _{k \mid \mu } + \text {ln}\theta _{x_n \mid k} \bigg ) \end{aligned} \end{aligned}$$

(9)

The responsibility $ \gamma (z_{\mu k}^{(n)}) $ of the hidden error cause or correct k for $n-$th sample coupled with the probability of the answer being answered correctly, can be computed by using the Bayes rule and the factorization of the joint probability distribution from Eq. 2:

$$\begin{aligned} \begin{aligned} \gamma (z_{\mu k}^{(n)}) = P ( z_{\mu k}^{(n)} | x^{(n)} ; \varTheta _{old} ) \propto P ( x^{(n)} | z_{\mu k}^{(n)} ; \varTheta _{old} ) P( z_{\mu k}^{(n)} ; \varTheta _{old} ) \end{aligned} \end{aligned}$$

(10)

The values of all $\gamma (z_{\mu k}^{(n)})$ values in Eq. 10 are provided up to a normalization factor. Since $\gamma (z_{\mu k}^{(n)})$ depends only on $\varTheta _{old}$, it can be considered a constant in the process of maximization of $\mathcal {Q}$. At the same time following constraints that reflect the conditional probability rules must be fulfilled:

$$\begin{aligned} \sum _{\mu =1}^{M} w_{\mu } = 1 \end{aligned}$$

(11)

$$\begin{aligned} \sum _{k=1}^{K} \pi _{k \mid \mu } = 1 \end{aligned}$$

(12)

$$\begin{aligned} \sum _{x=1}^{X} \theta _{x \mid k} = 1 \end{aligned}$$

(13)

The maximization of the complete log-likelihood $\mathcal {Q}( \varTheta , \varTheta _{old} )$ leads to the parameters of the model. The maximization process must also fulfil the constraints in Eqs. 11, 12, 13, which can be made with the use of Lagrange Multipliers. The maximum of the following expression must be found:

$$\begin{aligned} \mathcal {Q}(\varTheta , \varTheta _{old}) + \lambda \bigg ( \sum _{\mu } w_{\mu } - 1 \bigg ) + \sum _{\mu } \lambda _{\mu } \bigg (\sum _k \pi _{k \mid \mu } - 1\bigg ) + \sum _{k} \lambda _{k} \bigg ( \sum _x \theta _{x \mid k} - 1 \bigg ) \end{aligned}$$

(14)

First, the update of the parameters of a particular $ \mathbf {{Correctness}_q} $ value $w_{m | q}$, is made from the data samples $n'$ that have as question $q = q_{n'} \in [q_1 \cdots q_Q]$, and answer m being either correct or wrong:

$$\begin{aligned} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) \frac{1}{w_{m | q}} + \lambda \mathop {=}\limits ^! 0 \quad \Vert \cdot w_{m | q} \end{aligned}$$

(15)

$$\begin{aligned} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')})+ \lambda w_{m | q} \mathop {=}\limits ^! 0 \quad \Vert \sum _{m=1}^M \end{aligned}$$

(16)

$$\begin{aligned} \sum _{m=1}^M \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')})+ \lambda \sum _{m=1}^M w_{m | q} \mathop {=}\limits ^! 0 \end{aligned}$$

(17)

$$\begin{aligned} \lambda = - \sum _{m=1}^M \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) = - N' \end{aligned}$$

(18)

because :

$$\begin{aligned} \sum _{m=1}^M \gamma (z_{\mu k}^{(n')}) = 1 \end{aligned}$$

(19)

$$\begin{aligned} w_{m | q} = \frac{ \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) }{N} \end{aligned}$$

(20)

Secondly, the maximization with respect to a particular $\pi _l \in [\pi _1 \cdots \pi _K]$ is computed. The derivative must be set to 0 and all parameters of the expression 14 not related to $\pi _l$ can be eliminated as constants. If the number of samples that are answered wrongly is $N'$, the following steps provide the analytical solution for the update rule for any $\pi _k$:

$$\begin{aligned} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) \frac{1}{\pi _l} + \lambda _{\mu } \mathop {=}\limits ^! 0 \quad \Vert \cdot \pi _l \end{aligned}$$

(21)

$$\begin{aligned} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) + \lambda _{\mu } \pi _l \mathop {=}\limits ^! 0 \quad \Vert \sum _{k=1}^K \end{aligned}$$

(22)

$$\begin{aligned} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) + \lambda _{\mu } \sum _{k=1}^K \pi _l \mathop {=}\limits ^! 0 \end{aligned}$$

(23)

$$\begin{aligned} \lambda _{\mu } = - \sum _{k=1}^K \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) \end{aligned}$$

(24)

$$\begin{aligned} \pi _k = \frac{ \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) }{\sum _{k=1}^K \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')})} \end{aligned}$$

(25)

Thirdly, the maximization with respect to $\theta _{x | q, k, }$ is performed in a similar manner. The update of the parameters of a particular

$\mathbf {{Answers}_q} $ value $\theta _{x | q, k}$, is made from the data samples $n'$ that have as question $q = q_{n'} \in [q_1 \cdots q_Q]$, and answer $x = x_{n'} \in [x_1 \cdots x_X]$:

$$\begin{aligned} \sum _{n'=1}^{N'} \frac{ \gamma (z_{\mu k}^{(n')}) }{ \theta _{x \mid k} } + \lambda _k \mathop {=}\limits ^! 0 \quad \Vert \cdot \theta _{x \mid k} \end{aligned}$$

(26)

$$\begin{aligned} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) + \lambda _k \theta _{x \mid k} \mathop {=}\limits ^! 0 \quad \Vert \sum _{x=1}^{X} \end{aligned}$$

(27)

$$\begin{aligned} \sum _{x=1}^{X} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) + \lambda _k \sum _{x=1}^{X} \theta _{x \mid k} \mathop {=}\limits ^! 0 \end{aligned}$$

(28)

$$\begin{aligned} \lambda _k = - \sum _{x=1}^{X} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) \end{aligned}$$

(29)

$$\begin{aligned} \theta _{x \mid k} = \frac{ \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) }{ \sum _{x=1}^{X} \sum _{n'=1}^{N'} \gamma (z_{\mu k}^{(n')}) } \end{aligned}$$

(30)

The steps of the EM-Algorithm for updating the parameters of this Bayesian Model are as follows:

1.
Initialization of all parameters $\varTheta _0$. In this case that is the uniform prior.
2.
E-Step: Computation of $\gamma (z_{\mu k}^{(n)})$ using Eq. 10
3.
M-Step: Compute new $\varTheta $ parameters $w_{\mu | q}$, $\pi _k$ and $\theta _{x \mid k}$ using Eqs. 20, 25 and 30
4.
Compute the likelihood $ P(\varvec{X} ; \varTheta ) $ or log-likelihood $ \text {ln} P(\varvec{X} ; \varTheta ) $:
$$\begin{aligned} \begin{aligned} P( \mathbf {{Correctness}_q}, \mathbf {{Answers}_q} ; \varTheta ) = \\ \frac{ P(\mathbf {{Correctness}_q}, \mathbf {{Learning State}_q}, \mathbf {{Answers}_q} ; \varTheta ) }{ P( \mathbf {{Learning State}_q} | \mathbf {{Correctness}_q}, \mathbf {{Answers}_q} ; \varTheta ) } \\ P( \mathbf {{Correctness}_q} ; \varTheta ) P( \mathbf {{Correctness}_q} | \mathbf {{Learning State}_q} ; \varTheta ) \end{aligned} \end{aligned}$$
(31)
If the likelihood or the parameters values do not converge, then set current $\varTheta $ with the values computed in M-Step and goto E-Step.

Figure 3 depicts the steps of the EM-algorithm updating procedure. The dataset used for training is called training set.

It is proven that the EM-algorithm increases the log-likelihood of the observed data X at each iteration [2]:

$$\begin{aligned} \text {ln} \, P(\varvec{X} ; \varTheta ) \ge \text {ln} \, P(\varvec{X} ; \varTheta _{old}) \end{aligned}$$

(32)

The procedure of updating the log-likelihood in this manner is shown to guarantee convergence to a stationary point, which can be a local minimum, local maximum or saddle point. Fortunately, by initializing the iterations from different starting $\varTheta _0$ and injecting small changes to the parameters, the local minima and saddle points can be avoided [7].

4 Fractional Updating

Since the learning application proposes questions continuously, it is important to update the beliefs about the learning competence of the student as soon as an answer is present. As new evidence is observed - in the form of answered questions - the model shifts the value of the parameters to reflect the fact that the belief about the learning competence of the user is changed.

With fractional updating [6], the initialization and updating of the parameters is made by means of the Dirichlet pseudocounts. The starting pseudocount number is set to 1.0 to express a weak belief about the learning competence of the student. In this application, for each question-answer pair, only one probabilistical graphical model needs to be updated. The data sample only contains the value of the corresponding observed variable $ \mathbf {{Answers}_q}$. The update of the pseudocounts $\alpha $ is provided by equation:

$$\begin{aligned} \alpha _{ijk}^{l+1} = \alpha _{ijk}^{l} + P(\varvec{X_i} = k, Parents_G(\varvec{X_i}) = j | \mathcal {D}) \end{aligned}$$

(33)

where the current joint probability of the updated variable and the value of its parents are used to update the value of the pseudocounts, which may no longer be an integer. $\mathcal {D}$ denotes the dataset of samples.

The fractional updating procedure can be explained by an example where a student provides the wrong answer 42 to the question $8 \times 5$. The probabilities start with the following values (Table 1 and 2):

Table 1. Probabilistic graphical model of the question $8 \times 5$ where all conditional probabilities (all rows of the conditional probability tables) are set uniformly.

Full size table

The corresponding pseudocounts start with the following values:

Table 2. Probabilistic graphical model of the question $8 \times 5$ where all pseudocounts are set to value 1.

Full size table

There are three error types that can cause the answer 42. The weights for each case, corresponding to the entity $P(\varvec{X_i} = k, Parents_G(\varvec{X_i}) = j | \mathcal {D})$ of the Eq. 33, are computed as follows:

$\frac{1}{2} \frac{1}{7} \frac{1}{12} / (\frac{1}{2} \frac{1}{7} \frac{1}{12} + \frac{1}{2} \frac{1}{7} \frac{1}{17} + \frac{1}{2} \frac{1}{7} \frac{1}{4}) = 0.21259$

$\frac{1}{2} \frac{1}{7} \frac{1}{17} / (\frac{1}{2} \frac{1}{7} \frac{1}{12} + \frac{1}{2} \frac{1}{7} \frac{1}{17} + \frac{1}{2} \frac{1}{7} \frac{1}{4}) = 0.15006$

$\frac{1}{2} \frac{1}{7} \frac{1}{4} / / (\frac{1}{2} \frac{1}{7} \frac{1}{12} + \frac{1}{2} \frac{1}{7} \frac{1}{17} + \frac{1}{2} \frac{1}{7} \frac{1}{4}) = 0.63776$

The value of the updated pseudocounts of the operand, intrusion, and consistency error types, are presented in the following table:

The pseudocounts of the rest of the error types will remain to the value 1, so the total sum of pseudocounts that will be used as normalization value is: $ 4 \times 1 + 1.21259 + 1.15006 + 1.63776 = 8.0004$. The actual probabilities of the involved error types are listed in the table above; the rest have the value 0.12499. The pseudocounts of “wrong” are increased by $+1$, setting the probability of “wrong” to $\frac{2}{3}$ and “correct” to $\frac{1}{3}$. The values of the $ \mathbf {{Answers}_{8x5}} $ random variable are computed accordingly.

This comprises a full iteration of the online EM-algorithm. In the next iteration, the newly computed pseudocounts will play the role of the prior that needs updating. Drawbacks and extensions of the fractional updating algorithm can be explained in the work of [6].

5 Evaluation of Parameter Learning

The evaluation of the parameter learning EM-algorithm is firstly made by computing the likelihood of a training set at each iteration. It is expected that the likelihood is increasing monotonically and converging with increasing number of iterations. As seen in Fig. 4, this also applies for the likelihood of the models as a whole.

If the EM-algorithm is repeated for many iterations, the values of the parameters will be adjusted too much to the training set, without being able to generalize to the properties of the test set; an unseen user’s learning competence must be also modelled sufficiently by the model. Figure 5 depicts the evolution of the likelihood of the test set, with respect to the number of EM-iterations of the training set. The likelihood of the test set decreases after 4 iterations; this is an indication of overfitting [2].

The expectation-maximization algorithm bases on the fact that all possible outcomes of all questions present at least one sample of the dataset. This was not the case in this application; the sufficient statistics condition was not fulfilled [7, 8]. Nevertheless, it has been shown that the models performance is sufficient for practical purposes and that as new data are gathered, this problem might be solved.

6 Testing

Software testing in python is made with the use of the pytest framework^{Footnote 2} [10]. The expectation-maximization update rules were tested with some examples and compared to numerically computed results. But since the number of possible cases that must be tried out is very large, property-based testing was used [9]. The hypothesis framework^{Footnote 3} gives the ability to write tests in an abstract manner, where the concrete numerical values are generated automatically.

An extended description of property-based testing is out of scope of this work, but the main idea is describing the properties of the function that is tested. By that means, generating several numerical examples was proven to be very effective. The fractional expectation-maximization update rule 4 has the following properties:

The sum of each row of the conditional probability tables must equal to 1.0 after the update.
Only one of the $ \mathbf {{Answers}_q} $ (namely the one that corresponds to the posed question), as well as the corresponding $ \mathbf {{Learning \; State}_q} $ and $\mathbf {{Correctness}_q}$ will change.
A sampled answer can belong to one or more error types. The conditional probability of this answer will be increased in the $ \mathbf {{Answers}_q} $ conditional probability table. Because the sum must remain 1.0, the conditional probabilities of the other answers will be decreased after the update. Similarly, the error types that could be responsible for this answer will have an increased conditional probability in the $ \mathbf {{Learning \; State}_q} $, whereas the rest conditional probabilities will decrease. If the answer is wrong, then the belief that the student’s ability to answer a question correctly will fall, and the $\mathbf {{Correctness}_q}$ will change by the same means.

The following code in Listing 1.1, is the definition of one pytest test-case which covers several valid scenarios at once. The test-case exercises the fractional expectation-maximization learner and updater algorithm. During each scenario, one of the questions is randomly sampled, as well a valid corresponding answer between a minimum and a maximum value. The number of iterations of expectation-maximization also varies. Up to 10 tests are run at each execution (controlled by the parameter max_examples) and the execution can have unlimited time. The test’s fixture that defines all those values, is passed as a parameter.

Part of the implementation properties in the test-case are listed in 1.2. After the fractional updater is applied, the properties are checked one by one. The probability distributions at each row of the Conditional Probability Tables must sum up to one all the time. Depending on the answer the “Correct” or “InCorrect” proportion must increase, as well as the corresponding error types that could generate it. The probabilities of the other error types that could not have generated this answer must decrease correspondingly, while the sum remains equal to 1. The likelihood of the training set data must increase, as described by the Eq. 32.

7 Conclusion

Property-based testing is an effective method to ensure the code quality of probabilistic graphical models parameter learning. The necessity and effectiveness of this method has justified its use and was beneficial for the quality management of a learning-aware application. This work provides a concrete paradigm, that can be used by other similar applications that use probabilistic programming and analytical solutions of their learned parameter updating rules.

Notes

1.
https://schule.learninglab.tugraz.at/einmaleins/, Last accessed 26 April 2020.
2.
https://docs.pytest.org/, Last accessed 10 March 2020.
3.
https://hypothesis.readthedocs.io/en/latest/, Last accessed 10 March 2020.

References

On testing machine learning programs. J. Syst. Softw. 164, 110542 (2020). https://doi.org/10.1016/j.jss.2020.110542
Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Google Scholar
Braiek, H.B., Khomh, F.: On testing machine learning programs. J. Syst. Softw. 164, 110542 (2020)
Article Google Scholar
Dutta, S., Legunsen, O., Huang, Z., Misailovic, S.: Testing probabilistic programming systems. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 574–586 (2018)
Google Scholar
Grosse, R.B., Duvenaud, D.K.: Testing MCMC code. arXiv preprint. arXiv:1412.5218 (2014)
Jensen, F.V., Nielsen, T.D.: Bayesian Networks and Decision Graphs, 2nd edn. Springer, New York (2007)
Book Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)
MATH Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT press, Cambridge (2012)
MATH Google Scholar
Nilsson, R.: ScalaCheck: the definitive guide. Artima (2014)
Google Scholar
Okken, B.: Python Testing with Pytest: Simple, Rapid, Effective, and Scalable. Pragmatic Bookshelf (2017)
Google Scholar
Pfeffer, A.: Practical Probabilistic Programming. Manning Publications, Greenwich (2016)
Google Scholar
Saranti, A., Taraghi, B., Ebner, M., Holzinger, A.: Insights into learning competence through probabilistic graphical models. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-MAKE 2019. LNCS, vol. 11713, pp. 250–271. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29726-8_16
Chapter Google Scholar
Sharma, A., Wehrheim, H.: Testing monotonicity of machine learning models. arXiv:2002.12278 (2020)
Taraghi, B., Saranti, A., Legenstein, R., Ebner, M.: Bayesian modelling of student misconceptions in the one-digit multiplication with probabilistic programming. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge, pp. 449–453 (2016)
Google Scholar
Zhang, J.M., Harman, M., Ma, L., Liu, Y.: Machine learning testing: survey, landscapes and horizons. arXiv preprint arXiv:1906.10742 (2019)

Download references

Author information

Authors and Affiliations

Medical University Graz, Auenbruggerplatz 2, 8036, Graz, Austria
Anna Saranti & Andreas Holzinger
xAI Lab, Alberta Machine Intelligence Institute, Edmonton, T6G 2H1, Canada
Andreas Holzinger
Department Educational Technology, Graz University of Technology, Münzgrabenstrasse 36/I, 8010, Graz, Austria
Behnam Taraghi & Martin Ebner

Authors

Anna Saranti
View author publications
You can also search for this author in PubMed Google Scholar
Behnam Taraghi
View author publications
You can also search for this author in PubMed Google Scholar
Martin Ebner
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Holzinger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Saranti .

Editor information

Editors and Affiliations

Human-Centered AI Lab, Institute for Medical Informatics, Statistics and Doumentation, Medical University Graz, Graz, Austria
Andreas Holzinger
UAS St. Pölten, St. Pölten, Austria
Peter Kieseberg
Institute of Software Technology and Interactive Systems, Technical University of Vienna, Vienna, Austria
A Min Tjoa
SBA Research, Vienna, Austria
Edgar Weippl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saranti, A., Taraghi, B., Ebner, M., Holzinger, A. (2020). Property-Based Testing for Parameter Learning of Probabilistic Graphical Models. In: Holzinger, A., Kieseberg, P., Tjoa, A., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2020. Lecture Notes in Computer Science(), vol 12279. Springer, Cham. https://doi.org/10.1007/978-3-030-57321-8_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-57321-8_28
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57320-1
Online ISBN: 978-3-030-57321-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)