Correctness guarantees are at the core of cyber-physical computing research. While prior research addressed correctness of timing behavior and correctness of program logic, this paper tackles the emerging topic of assessing correctness of input data. This topic is motivated by the desire to crowd-source sensing tasks, an act we henceforth call social sensing, in applications with humans in the loop. A key challenge in social sensing is that the reliability of sources is generally unknown, which makes it difficult to assess the correctness of collected observations. To address this challenge, we adopt a cyber-physical approach, where assessment of correctness of individual observations is aided by knowledge of physical constraints on sources and observed variables to compensate for the lack of information on source reliability. We cast the problem as one of maximum likelihood estimation. The goal is to jointly estimate both (i) the latent physical state of the observed environment, and (ii) the inferred reliability of individual sources such that they are maximally consistent with both provenance information (who reported what) and physical constraints. We also derive new analytic bounds that allow the social sensing applications to accurately quantify the estimation error of source reliability for given confidence levels. We evaluate the framework through both a real-world social sensing application and extensive simulation studies. The results demonstrate significant performance gains in estimation accuracy of the new algorithms and verify the correctness of the analytic bounds we derived.

In practice, we can run the algorithm until the difference of estimation parameter between consecutive iterations becomes insignificant.
As stated in our application model, sources never report a variable to be false (e.g., cars never reported the absence of traffic lights).
In principle, there is no incentive for a source to lie more than 50 % of the time, since negating their statements would then give a more accurate truth.
1.1 Derivation of the E-step and M-step of OtO EM
Having formulated the new likelihood function to account for the source constraints in the previous subsection, we can now plug it into the Q function defined in Eq. (7) of Expectation Maximization. The E-step can be derived as follows:
where \(p(z_j=1|X_j,\theta ^{(n)})\) represents the conditional probability of the variable \(C_j\) to be true given the observation matrix related to the jth observed variable and current estimate of \(\theta \). We represent \(p(z_j=1|X_j,\theta ^{(n)})\) by Z(n, j) since it is only a function of t and j. Z(n, j) can be further computed as:
Note that, in the E-step, we continue to only consider sources who observe a given variable while computing the likelihood of reports regarding that variable.
In the M-step, we set the derivatives \(\frac{\partial Q}{\partial a_i}=0\), \(\frac{\partial Q}{\partial b_i}=0\), \(\frac{\partial Q}{\partial d_j}=0\). This gives us the \(\theta ^*\) (i.e., \(a_1^*,a_2^*,\ldots ,a_M^*\);\(b_1^*, b_2^*,\ldots ,b_M^*\);\(d_1^*,d_2^*,\ldots ,d_N^*\)) that maximizes the \(Q\left( \theta |\theta ^{(n)}\right) \) function in each iteration and is used as the \(\theta ^{(n+1)}\) of the next iteration.
where \(\mathcal {O}_i\) is set of variables source \(S_i\) observes according to the knowledge matrix SK and Z(n, j) is defined in Eq. (23). \(SJ_i\) is the set of variables the source \(S_i\) actually reports in the observation matrix SC. We note that, in the computation of \(a_i\) and \(b_i\), the silence of source \(S_i\) regarding some variable \(C_j\) is interpreted differently depending on whether \(S_i\) observed it or not. This reflects that the opportunity to observe has been incorporated into the M-Step when the estimation parameters of sources are computed. The resulting OtO EM algorithm is summarized in the subsection below.
1.2 Derivation of E-step and M-step of DV and OtO+DV EM
Given the new likelihood function of the DV EM scheme defined in Eq. (11), the E-step becomes:
where \(p(z_{g_1},\ldots ,z_{g_k}|X_g,\theta ^{(n)})\) represents the conditional joint probability of all variables in independent group g (i.e., \(g_1,\ldots ,g_k\)) given the observed data regarding these variables and the current estimation of the parameters. \(p(z_{g_1},\ldots ,z_{g_k}|X_g,\theta ^{(n)})\) can be further computed as follows:
We note that \(p(z_j=1|X_j,\theta ^{(n)})\) (i.e., Z(n, j)), defined as the probability that \(C_j\) is true given the observed data and the current estimation parameters, can be computed as the marginal distribution of the joint probability of all variables in the independent variable group g that variable \(C_j\) belongs to (i.e., \(p(z_{g_1},\ldots ,z_{g_k}|X_g,\theta ^{(n)})\quad j\in c_g\)). We also note that, for the worst case where N variables fall into one independent group, the computational load to compute this marginal grows exponentially with respect to N. However, as long as the constraints on observed variables are localized, our approach stays scalable, independently of the total number of estimated variables.
In the M-step, as before, we choose \(\theta ^*\) that maximizes the \(Q\left( \theta |\theta ^{(n)}\right) \) function in each iteration to be the \(\theta ^{(n+1)}\) of the next iteration. Hence:
where \(Z(n,j)=p(z_j=1|X_j,\theta ^{(n)})\). We note that for the estimation parameters, \(a_i\) and \(b_i\), we obtain the same expression as for the case of independent variables. The reason is that sources report variables independently of the form of constraints between these variables.
Next, we combine the two EM extensions (i.e., OtO EM and DV EM) derived so far to obtain a comprehensive EM scheme (OtO+DV EM) that considers constraints on both sources and observed variables. The corresponding E-Step and M-Step are shown below:
