Probabilistic model for truth discovery with mean and median check framework

doi:10.1016/j.knosys.2021.107482

Knowledge-Based Systems

Volume 233, 5 December 2021, 107482

https://doi.org/10.1016/j.knosys.2021.107482 Get rights and content

Highlights

•
Addressing the truth discovery problem without following the generally accepted principle.
•
Theoretically verifying the absolute distance between the mean and median value.
•
Proposing the framework for truth detection, error claim removal, and iteration-stopping criteria.
•
Conducting experiments on three datasets to evaluate the performance.

Abstract

In the era of big data, information can be collected from various sources. Unfortunately, information provided by multiple sources on the same entity is inevitably conflicting. Due to the ubiquitous existence of data conflicts, truth discovery has recently attracted considerable attention. Several truth discovery methods focus on providing a point estimate for the truth of each entity and exhibit completely different performances on the same input dataset. Therefore, an appropriate truth discovery method should be adopted to fit the unknown source reliability distributions. To address this, we approach truth discovery from another perspective. We theoretically verify that if the absolute distance between the mean and median value is large, then there must be incorrect claims with large errors in the input dataset. Accordingly, we propose a mean and median check (MMC) framework for truth detection, error claim removal, and iteration-stopping criteria. The experiments demonstrate that MMC can effectively remove incorrect claims provided by unreliable sources. Furthermore, the performance of state-of-the-art truth discovery methods can be significantly improved if MMC is used for input data preprocessing.

Introduction

In the era of big data, information on the same object can be obtained from various data sources. For example, the price of a certain commodity presented on different websites, or the temperature of a local area observed by different sensors. However, there are errors, conflicts, and outdated data across different sources. To tackle this problem, truth discovery, which integrates multisource noisy information by estimating the reliability of each source, has attracted considerable attention [1].

A generally accepted principle for truth discovery is that more reliable sources are assumed to provide claims closer to the truth. Accordingly, most studies on truth discovery are concerned with determining the optimal reliability degree values of sources and the estimated truths [2], [3], [4], [5], [6], [7], [8], [9]. However, as an increasing number of truth discovery methods are being developed, the following issues related to this principle are becoming impediments to practical applications.

$•$ $Biased truth estimator :$ The generally accepted principle indicates that the truth can be estimated by the weighted combination of multiple claims from different sources. However, without any prior knowledge, truth discovery methods must begin with uniform source weights. Unfortunately, some incorrect claims may significantly deviate from the truth, either by intention or due to noise, but their weights are initialized to the correct values. Therefore, if the weights of unreliable sources are not initially set to sufficiently small values, a biased truth estimator is obtained.

$•$ $Inconsistent performance :$ Truth discovery is aimed at identifying the most trustworthy information without any knowledge of the source reliability distribution. However, due to different assumptions regarding the source reliability distribution, various truth discovery methods may perform inconsistently on the same dataset, so the appropriate truth discovery method should be adopted to fit the unknown source reliability distributions.

$•$ $Low efficiency :$ Since truth estimation and the computation of the source weights are tightly combined, coordinate descent is currently being adopted, in which one set of variables should be fixed to solve for another [10]. Therefore, most existing methods are iterative. This leads to solutions in which all incorrect claims are involved in each iteration, thus increasing the size of the input claim set, which is proportional to the computational complexity of coordinate descent [3], [11].

Accordingly, we naturally raise the question of whether the truth discovery problem can be addressed without following this generally accepted principle.

In this paper, we propose performing truth discovery in a mean-and-median check (MMC) framework. We intend to design a novel framework that can address the aforementioned issues without following the generally accepted principle. Regarding the issue of a biased truth estimator, the proposed framework can eliminate incorrect claims with significant deviation, and the mean or median of input claims can be used as the truth estimator. Regarding the issue of an inconsistent performance, the proposed framework enables various existing truth discovery methods to exhibit similar performances on the same input dataset. Regarding the issue of a low efficiency, incorrect claims are iteratively removed so that the computational complexity of truth discovery can be reduced. The main contributions are summarized as follows:

$•$ We address the truth discovery problem without following the generally accepted principle that more reliable sources provide claim values closer to the truth.

$•$ We theoretically verify that if the absolute distance between the mean and median value is large, then there must be incorrect claims with large errors in the input dataset.

$•$ We propose the MMC framework for truth detection, error claim removal, and iteration-stopping criteria.

$•$ We conduct experiments on three datasets to evaluate the performance of the proposed framework. The results demonstrate that if the input datasets are preprocessed by the proposed framework, then various truth discovery methods exhibit a similar performance on the same dataset; in fact, even the baseline mean method yields highly satisfactory results. Furthermore, since incorrect claims are removed from the input dataset, the running time of existing methods decreases.

The remainder of this paper is organized as follows: In Section 2 the related work is reviewed. In Section 3 the problem of truth discovery with MMC is introduced and formally defined. In Section 4, we describe the methodology, MMC framework for truth detection, error claim removal, and iteration-stopping criteria. The experiments are presented in Section 5. Finally, the paper is concluded in Section 6.

Section snippets

Related work

Truth discovery is important in identifying trustworthy information. Several real-world applications that rely heavily on reliable information for decision-making can benefit from truth discovery, such as knowledge graph construction [12], [13], crowdsourcing [14], [15], and crowd sensing [16], [17]. Truth discovery is an advanced data-fusion technique for resolving conflicts among multisource data [1], [18]. The problem of truth discovery was formally introduced in [19]. Various scenarios have

Problem setting

We first introduce the terminology and notations used in this paper through examples. Subsequently, we investigate the limits of truth discovery. Finally, we formally define the problem, and we propose a technical challenge.

Probabilistic truth discovery with a mean and median check framework

Herein, we describe the proposed framework, whereby incorrect input claims provided by unreliable sources can be removed. Based on the remaining claims, we can take the median or mean value as the truth. Moreover, different existing truth discovery methods can be applied to this remaining claim set to obtain more trustworthy information, as shown in Fig. 2. It should be noted that different truth discovery methods can output identical estimated truths under the proposed framework.

Experiments

Herein, the proposed framework is evaluated on three real-world applications. The results demonstrate that MMC can effectively remove the incorrect claims provided by unreliable sources, and state-of-the-art truth discovery methods can achieve significant performance improvements if MMC is used for input-data preprocessing. Additionally, different truth discovery methods obtain similar results on the same input dataset. Finally, the running time of state-of-the-art truth discovery methods

Conclusions

Most existing truth discovery methods are derived from the generally accepted principle that more reliable sources provide claims closer to the truth. To address the issues resulting from this principle, we approached the truth discovery problem from another perspective, i.e., without adopting this principle. We theoretically verified that if the absolute distance between the mean and median value is large, there must be incorrect claims with large errors in the input dataset. Regarding the

CRediT authorship contribution statement

Songtao Ye: Conceptualization, Methodology, Writing – original draft. Junjie Wang: Software. Hongjie Fan: Data curation, Writing – original draft. Zhiqiang Zhang: Visualization, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work is supported by the National Natural Science Foundation of China (No. 61802327), the Natural Science Foundation of Hunan Province (No. 2018JJ3511), China University of Political Science and Law Research Innovation Project (Grant No. 21FQ41001), and the Fundamental Research Funds for the Central Universities .

References (33)

LiY. et al.
A survey on truth discovery
SIGKDD Explor. Newsl.
(2016)
LyuS. et al.
Truth discovery by claim and source embedding
IEEE Trans. Knowl. Data Eng.
(2019)
FáveroL.P. et al.
ZhaoB. et al.
A probabilistic model for estimating real-valued truth from conflicting sources
Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, J. Han, Resolving conflicts in heterogeneous data by truth discovery and source...
LiQ. et al.
A confidence-aware approach for truth discovery on long-tail data
WanM. et al.
From truth discovery to trustworthy opinion discovery: An uncertainty-aware quantitative modeling approach
LiY. et al.
On the discovery of evolving truth
YeC. et al.
Constrained truth discovery
IEEE Trans. Knowl. Data Eng.
(2020)
S. Zhi, F. Yang, Z. Zhu, Q. Li, Z. Wang, J. Han, Dynamic truth discovery on numerical data, in: 2018 IEEE International...

OuyangR.W. et al.

Truth discovery in crowdsourced detection of spatial events

BertsekasD.P.

Non-Linear Programming

(1999)

WangX. et al.

Approximate truth discovery via problem scale reduction

DongX. et al.

Knowledge vault: A web-scale approach to probabilistic knowledge fusion

DongX.L. et al.

From data fusion to knowledge fusion

Proc. VLDB Endow.

(2014)

L. Jiang, X. Niu, J. Xu, D. Yang, L. Xu, Incentivizing the workers for truth discovery in crowdsourcing with copiers,...

Cited by (9)

DLFTI: A deep learning based fast truth inference mechanism for distributed spatiotemporal data in mobile crowd sensing
2023, Information Sciences
The paradigm of Mobile Crowd Sensing (MCS) allows for numerous applications with distributed spatiotemporal data, where great attention is drawn to the fundamental problems for truth inference. Existing works all suffer from poor accuracy and slow start, due to the lack of valid information for Ground Truth Data (GTD) and workers. These problems make the MCS system vulnerable to fraudulent attacks by malicious gangs, causing the Estimated Truth Data (ETD) to deviate significantly from GTD. In this paper, we propose a Deep Learning based Fast Truth Inference mechanism, called DLFTI, to achieve fast trust computing and accurate truth discovery in MCS. First, we introduce the Degrees-Of-Trust (DOT) to characterize the sensing ability of workers and establish worker profiles based on DOT to recognize workers’ trustworthiness dynamically. Then, we abandon the unrealistic assumption of priori GTD in previous studies and instead utilize the Unmanned Aerial Vehicles (UAVs), recognized trustworthy workers and the Deep Matrix Factorization (DMF) method to construct three-level GTD and three-level ETD, which are used for fast trust computing of workers and accurate truth discovery of tasks respectively. Finally, we conduct extensive simulations on a real-world dataset to corroborate the significant performance of DLFTI.
A decentralized trust inference approach with intelligence to improve data collection quality for mobile crowd sensing
2023, Information Sciences
Mobile Crowd Sensing (MCS) has been recognized as a promising param to construct numerous applications by employing enormous workers to perceive and collect data. The quality of MCS depends on the quality of the data submitted by the workers. Therefore, there is an urgent need to gain the trust of workers. In this paper, we propose a Decentralized Trust Inference (DTI) approach to improve data collection quality for MCS, mainly including the following components: (a) A trust evaluation method is proposed to obtain trust baselines of different levels, which can be used to assess the trust of workers. In addition, a data filling scheme on the basis of Bayesian Probabilistic Matrix Factorization (BPMF) is adopted to fill data when there is no credible baseline in the region. (b) Based on the DTI approach, we propose a worker recruitment method according to the priority of trust and bid ratio. Then, by preferentially selecting reliable and low-bid workers, we can improve data quality and reduce costs. Finally, theoretical analysis and experimental results demonstrate the effectiveness of our proposed scheme.
RLTD: A Reinforcement Learning-based Truth Data Discovery scheme for decision support systems under sustainable environments
2023, Applied Soft Computing
The online world and associated information and communication technologies have generated digital networks by processing massive volumes of data and have a significant impact on the environmental sustainability. Mobile Crowd Source (MCS) is one of the digital technologies that can help humanity to better sense, understand and protect the environment by using vast amounts of data obtained to construct intelligent decision support systems (DSS). However, as the false data submitted by dishonest and malicious workers will cause the data-based DSSs to make wrong decisions and thus cause great harm, it is an urgent issue to propose an effective Truth Data Discovery (TDD) scheme for MCS. To tackle this issue, a Reinforcement Learning-based Truth Data Discovery (RLTD) scheme is proposed to obtain truth data in MCS at low cost in this paper. The main innovations of the RLTD scheme are as follows: (1) A novel trustworthiness-based TDD scheme is proposed to obtain truth data accurately at low cost, which can facilitate data-based DSSs in MCS. (2) Combined with Matrix Factorization (MF), we propose a worker recruitment method that only needs to recruit $ϑ n$ ( $ϑ \leq 1$ ) workers for TDD in $n$ tasks, which reduces the data collection cost significantly than previous TDD schemes. (3) We propose a Reinforcement Learning-based Site Selection (RLSS) method that intelligently selects as few sites as possible for worker recruitment with guaranteed high data quality. Experimental results demonstrate that the RLTD scheme can improve the accuracy of data collection by 1.31%–21.02%, reduce the data collection cost by 81.39%–85.50% compared to the traditional TDD schemes, and identify workers with accuracy of 86.67%–94.58%.
A Semi-supervised Sensing Rate Learning based CMAB scheme to combat COVID-19 by trustful data collection in the crowd
2023, Computer Communications
The recruitment of trustworthy and high-quality workers is an important research issue for MCS. Previous studies either assume that the qualities of workers are known in advance, or assume that the platform knows the qualities of workers once it receives their collected data. In reality, to reduce costs and thus maximize revenue, many strategic workers do not perform their sensing tasks honestly and report fake data to the platform, which is called False data attacks. And it is very hard for the platform to evaluate the authenticity of the received data In this paper, an incentive mechanism named Semi-supervision based Combinatorial Multi-Armed Bandit reverse Auction (SCMABA) is proposed to solve the recruitment problem of multiple unknown and strategic workers in MCS. First, we model the worker recruitment as a multi-armed bandit reverse auction problem and design an UCB-based algorithm to separate the exploration and exploitation, regarding the Sensing Rates (SRs) of recruited workers as the gain of the bandit Next, a Semi-supervised Sensing Rate Learning (SSRL) approach is proposed to quickly and accurately obtain the workers’ SRs, which consists of two phases, supervision and self-supervision. Last, SCMABA is designed organically combining the SRs acquisition mechanism with multi-armed bandit reverse auction, where supervised SR learning is used in the exploration, and the self-supervised one is used in the exploitation. We theoretically prove that our SCMABA achieves truthfulness and individual rationality and exhibits outstanding performances of the SCMABA mechanism through in-depth simulations of real-world data traces.
TVD-RA: A Truthful Data Value Discovery-Based Reverse Auction Incentive System for Mobile Crowdsensing
2024, IEEE Internet of Things Journal
DTD: An Intelligent Data and Bid Dual Truth Discovery Scheme for MCS in IIoT
2024, IEEE Internet of Things Journal

View all citing articles on Scopus

View full text