Elsevier

Knowledge-Based Systems

Volume 233, 5 December 2021, 107482
Knowledge-Based Systems

Probabilistic model for truth discovery with mean and median check framework

https://doi.org/10.1016/j.knosys.2021.107482Get rights and content

Highlights

  • Addressing the truth discovery problem without following the generally accepted principle.

  • Theoretically verifying the absolute distance between the mean and median value.

  • Proposing the framework for truth detection, error claim removal, and iteration-stopping criteria.

  • Conducting experiments on three datasets to evaluate the performance.

Abstract

In the era of big data, information can be collected from various sources. Unfortunately, information provided by multiple sources on the same entity is inevitably conflicting. Due to the ubiquitous existence of data conflicts, truth discovery has recently attracted considerable attention. Several truth discovery methods focus on providing a point estimate for the truth of each entity and exhibit completely different performances on the same input dataset. Therefore, an appropriate truth discovery method should be adopted to fit the unknown source reliability distributions. To address this, we approach truth discovery from another perspective. We theoretically verify that if the absolute distance between the mean and median value is large, then there must be incorrect claims with large errors in the input dataset. Accordingly, we propose a mean and median check (MMC) framework for truth detection, error claim removal, and iteration-stopping criteria. The experiments demonstrate that MMC can effectively remove incorrect claims provided by unreliable sources. Furthermore, the performance of state-of-the-art truth discovery methods can be significantly improved if MMC is used for input data preprocessing.

Introduction

In the era of big data, information on the same object can be obtained from various data sources. For example, the price of a certain commodity presented on different websites, or the temperature of a local area observed by different sensors. However, there are errors, conflicts, and outdated data across different sources. To tackle this problem, truth discovery, which integrates multisource noisy information by estimating the reliability of each source, has attracted considerable attention [1].

A generally accepted principle for truth discovery is that more reliable sources are assumed to provide claims closer to the truth. Accordingly, most studies on truth discovery are concerned with determining the optimal reliability degree values of sources and the estimated truths [2], [3], [4], [5], [6], [7], [8], [9]. However, as an increasing number of truth discovery methods are being developed, the following issues related to this principle are becoming impediments to practical applications.

Biasedtruthestimator : The generally accepted principle indicates that the truth can be estimated by the weighted combination of multiple claims from different sources. However, without any prior knowledge, truth discovery methods must begin with uniform source weights. Unfortunately, some incorrect claims may significantly deviate from the truth, either by intention or due to noise, but their weights are initialized to the correct values. Therefore, if the weights of unreliable sources are not initially set to sufficiently small values, a biased truth estimator is obtained.

Inconsistentperformance : Truth discovery is aimed at identifying the most trustworthy information without any knowledge of the source reliability distribution. However, due to different assumptions regarding the source reliability distribution, various truth discovery methods may perform inconsistently on the same dataset, so the appropriate truth discovery method should be adopted to fit the unknown source reliability distributions.

Lowefficiency : Since truth estimation and the computation of the source weights are tightly combined, coordinate descent is currently being adopted, in which one set of variables should be fixed to solve for another [10]. Therefore, most existing methods are iterative. This leads to solutions in which all incorrect claims are involved in each iteration, thus increasing the size of the input claim set, which is proportional to the computational complexity of coordinate descent [3], [11].

Accordingly, we naturally raise the question of whether the truth discovery problem can be addressed without following this generally accepted principle.

In this paper, we propose performing truth discovery in a mean-and-median check (MMC) framework. We intend to design a novel framework that can address the aforementioned issues without following the generally accepted principle. Regarding the issue of a biased truth estimator, the proposed framework can eliminate incorrect claims with significant deviation, and the mean or median of input claims can be used as the truth estimator. Regarding the issue of an inconsistent performance, the proposed framework enables various existing truth discovery methods to exhibit similar performances on the same input dataset. Regarding the issue of a low efficiency, incorrect claims are iteratively removed so that the computational complexity of truth discovery can be reduced. The main contributions are summarized as follows:

We address the truth discovery problem without following the generally accepted principle that more reliable sources provide claim values closer to the truth.

We theoretically verify that if the absolute distance between the mean and median value is large, then there must be incorrect claims with large errors in the input dataset.

We propose the MMC framework for truth detection, error claim removal, and iteration-stopping criteria.

We conduct experiments on three datasets to evaluate the performance of the proposed framework. The results demonstrate that if the input datasets are preprocessed by the proposed framework, then various truth discovery methods exhibit a similar performance on the same dataset; in fact, even the baseline mean method yields highly satisfactory results. Furthermore, since incorrect claims are removed from the input dataset, the running time of existing methods decreases.

The remainder of this paper is organized as follows: In Section 2 the related work is reviewed. In Section 3 the problem of truth discovery with MMC is introduced and formally defined. In Section 4, we describe the methodology, MMC framework for truth detection, error claim removal, and iteration-stopping criteria. The experiments are presented in Section 5. Finally, the paper is concluded in Section 6.

Section snippets

Related work

Truth discovery is important in identifying trustworthy information. Several real-world applications that rely heavily on reliable information for decision-making can benefit from truth discovery, such as knowledge graph construction [12], [13], crowdsourcing [14], [15], and crowd sensing [16], [17]. Truth discovery is an advanced data-fusion technique for resolving conflicts among multisource data [1], [18]. The problem of truth discovery was formally introduced in [19]. Various scenarios have

Problem setting

We first introduce the terminology and notations used in this paper through examples. Subsequently, we investigate the limits of truth discovery. Finally, we formally define the problem, and we propose a technical challenge.

Probabilistic truth discovery with a mean and median check framework

Herein, we describe the proposed framework, whereby incorrect input claims provided by unreliable sources can be removed. Based on the remaining claims, we can take the median or mean value as the truth. Moreover, different existing truth discovery methods can be applied to this remaining claim set to obtain more trustworthy information, as shown in Fig. 2. It should be noted that different truth discovery methods can output identical estimated truths under the proposed framework.

Experiments

Herein, the proposed framework is evaluated on three real-world applications. The results demonstrate that MMC can effectively remove the incorrect claims provided by unreliable sources, and state-of-the-art truth discovery methods can achieve significant performance improvements if MMC is used for input-data preprocessing. Additionally, different truth discovery methods obtain similar results on the same input dataset. Finally, the running time of state-of-the-art truth discovery methods

Conclusions

Most existing truth discovery methods are derived from the generally accepted principle that more reliable sources provide claims closer to the truth. To address the issues resulting from this principle, we approached the truth discovery problem from another perspective, i.e., without adopting this principle. We theoretically verified that if the absolute distance between the mean and median value is large, there must be incorrect claims with large errors in the input dataset. Regarding the

CRediT authorship contribution statement

Songtao Ye: Conceptualization, Methodology, Writing – original draft. Junjie Wang: Software. Hongjie Fan: Data curation, Writing – original draft. Zhiqiang Zhang: Visualization, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work is supported by the National Natural Science Foundation of China (No. 61802327), the Natural Science Foundation of Hunan Province (No. 2018JJ3511), China University of Political Science and Law Research Innovation Project (Grant No. 21FQ41001), and the Fundamental Research Funds for the Central Universities .

References (33)

  • LiY. et al.

    A survey on truth discovery

    SIGKDD Explor. Newsl.

    (2016)
  • LyuS. et al.

    Truth discovery by claim and source embedding

    IEEE Trans. Knowl. Data Eng.

    (2019)
  • FáveroL.P. et al.
  • ZhaoB. et al.

    A probabilistic model for estimating real-valued truth from conflicting sources

  • Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, J. Han, Resolving conflicts in heterogeneous data by truth discovery and source...
  • LiQ. et al.

    A confidence-aware approach for truth discovery on long-tail data

  • WanM. et al.

    From truth discovery to trustworthy opinion discovery: An uncertainty-aware quantitative modeling approach

  • LiY. et al.

    On the discovery of evolving truth

  • YeC. et al.

    Constrained truth discovery

    IEEE Trans. Knowl. Data Eng.

    (2020)
  • S. Zhi, F. Yang, Z. Zhu, Q. Li, Z. Wang, J. Han, Dynamic truth discovery on numerical data, in: 2018 IEEE International...
  • OuyangR.W. et al.

    Truth discovery in crowdsourced detection of spatial events

  • BertsekasD.P.

    Non-Linear Programming

    (1999)
  • WangX. et al.

    Approximate truth discovery via problem scale reduction

  • DongX. et al.

    Knowledge vault: A web-scale approach to probabilistic knowledge fusion

  • DongX.L. et al.

    From data fusion to knowledge fusion

    Proc. VLDB Endow.

    (2014)
  • L. Jiang, X. Niu, J. Xu, D. Yang, L. Xu, Incentivizing the workers for truth discovery in crowdsourcing with copiers,...
  • Cited by (9)

    View all citing articles on Scopus
    View full text