Skip to main content
Log in

Approximating Global Optimum for Probabilistic Truth Discovery

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

The problem of truth discovery arises in many areas such as database, data mining, data crowdsourcing and machine learning. It seeks trustworthy information from possibly conflicting data provided by multiple sources. Due to its practical importance, the problem has been studied extensively in recent years. Two competing models were proposed for truth discovery, weight-based model and probabilistic model. While \((1+\epsilon )\)-approximations have already been obtained for the weight-based model, no quality guaranteed solution has been discovered yet for the probabilistic model. In this paper, we focus on the probabilistic model and formulate it as a geometric optimization problem. Based on a sampling technique and a few other ideas, we achieve the first \((1 + \epsilon )\)-approximation solution. Our techniques can also be used to solve the more general multi-truth discovery problem. We validate our method by conducting experiments on both synthetic and real-world datasets (teaching evaluation data) and comparing its performance to some existing approaches. Our solutions are closer to the truth as well as global optimum based on the experimental result. The general technique we developed has the potential to be used to solve other geometric optimization problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. For categorical data, the Gaussian distribution may cause fractional answers, which can be viewed as a probability distribution over possible truths. In practice, variance for different coordinates of the truth vector may be different and there might be some non-zero covariance between different coordinates; however, up to a linear transformation, we may assume the covariance matrix is \(\sigma _i^2 I_d\).

  2. Also referred as polynomial growing function or Log–Log Lipschitz function in literature.

References

  1. Bādoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, pp. 250–257. ACM (2002)

  2. Ding, H., Gao, J., Xu, J.: Finding global optimum for truth discovery: entropy based geometric variance. In: LIPIcs-Leibniz International Proceedings in Informatics, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, vol 51 (2016)

  3. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–610. ACM (2014)

  4. Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. Proc. VLDB Endow. 2(1), 550–561 (2009)

    Article  Google Scholar 

  5. Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. Proc. VLDB Endow. 2(1), 562–573 (2009)

    Article  Google Scholar 

  6. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, pp. 569–578. ACM (2011)

  7. Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 131–140. ACM (2010)

  8. Huang, Z., Ding, H., Xu, J.: Faster algorithm for truth discovery via range cover. In: Proceedings of Algorithms and Data Structures Symposium (WADS 2017), pp. 461–472 (2017)

  9. Kumar, A., Sabharwal, Y., Sen, S.: Linear time algorithms for clustering problems in any dimensions. In: International Colloquium on Automata, Languages, and Programming, pp. 1374–1385. Springer (2005)

  10. Li, F., Lee, M.L, Hsu, W.: Entity profiling with varying source reliabilities. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1146–1155. ACM (2014)

  11. Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1187–1198. ACM (2014)

  12. Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 877–885. Association for Computational Linguistics (2010)

  13. URL. Report on teaching evaluation. In http://theconversation.com/students-dont-know-whats-best-for-their-own-learning-33835

  14. Wang, X., Sheng, Q.Z., Fang, X.S., Yao, L., Xu, X., Li, X.: An integrated bayesian approach for effective multi-truth discovery. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 493–502. ACM (2015)

  15. Welinder, P., Branson, S., Belongie, S.J., Perona, P.: The multidimensional wisdom of crowds. In: NIPS, vol. 23, pp. 2424–2432. (2010)

  16. Whitehill, J., Wu, T.-F., Bergsma, J., Movellan, J.R., Ruvolo, P.L.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems, pp. 2035–2043. (2009)

  17. Xiao, H., Gao, J., Li, Q., Ma, F., Su, L., Feng, Y., Zhang, A.: Towards confidence in the truth: a bootstrapping based truth discovery approach. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1935–1944. ACM (2016)

  18. Xiao, H., Gao, J., Wang, Z., Wang, S., Su, L., Liu, H.: A truth discovery approach with theoretical guarantee. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1925–1934. ACM (2016)

  19. Yin, X., Han, J., Philip, S.Y.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)

    Article  Google Scholar 

  20. Zhao, B., Han, J.: A probabilistic model for estimating real-valued truth from conflicting sources. In: Proceedings of QDB (2012)

  21. Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 5(6), 550–561 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinhui Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported in part by NSF through Grants CCF-1422324, IIS-1422591, CCF-1716400, and IIS-1910492. A preliminary version of this work has appeared in COCOON’18.

A Experiments

A Experiments

We validate our method on both synthetic datasets and real-world datasets. We use the objective function value as the main criterion for the performance evaluation, and compare the result of our sampling method to those of a few other truth discovery models. Particularly, we compare our Probability Truth Discovery model (PTD) with the Randomized Gaussian Mixture Model (RGMM) in [18], which is the best known solution to the problem before our result. We also compare our result to the traditional “mean value” approach, which is generated by simply taking the average of all the points in P. For the “baseline”, we try all the points in P as the truth and take the minimum. Notice that from Lemma 4, we know that baseline is a 4-approximation. Throughout the experiments, we always set the number of iterations \(N=3^{dim(X)}\) in our sampling algorithm.

1.1 A.1 Description of data

As discussed in Sect. 2, when the hyper parameter \(\sigma _0\) is large compared to the relative distance between points, the mean of all the points can be a fairly good approximate solution of the optimum. It could be the case that the truth vector is indeed the average of all the points from the input. This is obviously a trivial case and its solution can be easily obtained. The more interesting case is that the mean point has a high cost. In such a scenario, one needs to rely on, for example, our sampling algorithm to further approximate the unknown point that actually optimizes the objective function. The optimal point in real world data may reflect the underlying fact, as it will be shown in Sect. 1.

We consider three different settings for synthetic data. In each setting, the mean of the points will not be the optimal point when \(\sigma _0\) is chosen appropriately.

We evaluate our method on three sets of real-world data. The first two of them were used in [18]. One dataset has 44 sources and each observation is of 7 dimensions. They are estimations of the length of indoor hallways given by random online users. The other dataset consists of weather forecasts from 10 sources over 88 cities.

The third real-world dataset is from an interesting application of our truth discovery method in teaching evaluation. Some recent studies on teaching evaluations suggest that students evaluate their teachers more positively when they learn less [13]. A dilemma often faced by many teachers is that the better teaching job they do, the lower evaluation scores they receive. This is caused by the fact that teaching evaluation is often contaminated by many irrelevant factors such as students’ personal preference towards the teacher or the course subject. Due to the anonymity nature of the teaching evaluation process, such issues are difficult, if not impossible, to be avoided from the evaluation. To mitigate this problem, we have acquired teaching evaluation raw data in the last six semesters for an instructor in our department whose performance is generally very good. However, in one semester (Spring 2016), students gave conflict reviews (partially due to a sudden change of curriculum in that semester), thus resulting in a poor overall evaluation (based on the mean evaluation scores). Using our truth discovery model, we show that it is possible to learn a more accurate teaching evaluation by automatically capturing some outliers. We believe that our method has the potential to be used as a more fair evaluation tool for teachers.

1.2 A.2 Result of synthetic data


Setting 1: One cluster with outlier

Fig. 1
figure 1

Left: Comparison of objective function values when \(\sigma _0\) varies. Right: Comparison of objective function values when the number of sources varies. (Setting 1)

This dataset consists of 150 points of 7 dimensions. They are first drawn from the multi-variate Gaussian distribution with co-variance matrix set to be 0.1I. Then, the 50th and 100th data items are substituted with “outliers” generated by Gaussian with co-variance matrix 40000I.

In the legend of all the figures, PTD stands for Probabilistic Truth Discovery and the corresponding data is from the algorithm given in this paper. RGMM stands for the Randomized Gaussian Mixture Model in [18] and the corresponding data is from an algorithm implemented this model. The data labeled as “baseline” is generated by picking the smallest objective value when trying all the points in P for the input. Notice that from Lemma 4, we know that baseline is always between \(\texttt {AVG}\) and \(4\texttt {AVG}\). Finally, the data labeled as “mean value” is generated by simply taking the average of all the points in P.

Figure 1Left shows how the performances of the four methods change when \(\sigma _0\) varies from \(10^{-3}\) to \(10^3\). In this experiment, we focus on the relative difference between different methods, as the objective value does not carry much meaning. To have a better comparison, we scale everything to the baseline. Figure 1Left suggests that PTD always performs better than the averaging method and RGMM, since the outliers can affect the averaging method significantly.


Figure 1Right depicts how the performance of each compared method changes when the number of sources increases from 1 to 150. In this experiment, \(\sigma _0\) is set to be 1. The Y axis corresponds to the averaged objective function value. From the figure, one interesting observation is that there are two pulses around the locus where x-coordinates are 50 and 100, respectively. This is exactly where the outliers lie. This suggests that the sensitivity of the averaging method towards outliers is much higher than PTD. Another interesting observation is that in the scaled subplot, RGMM does outperform the averaging method in terms of objective function value.


Setting 2: Two clusters with the same size and variance

Fig. 2
figure 2

Left: \(\sigma _0\) varies. Right: number of sources varies. (Setting 2)

This dataset consists of \(n=100\) points of \(d=7\) dimensions. Both the first half and the second half of the data are generated by Gaussian with co-variance matrix 0.1I. The means of the first and the second half of the points are centered at the origin and some random location 3\(\sqrt{0.1\cdot d}\) far away from the origin, respectively.

Figure 2Left shows the performances of the four methods when \(\sigma _0\) ranges from \(4^{-8}\) to \(4^{-1}\). It is worthy to point out that when \(\sigma _0\) is small, the result from PTD is almost identical to the baseline. The underlying reason is that when \(\sigma _0\) tends to 0, the optimal solution tends to one of the points in P, as mentioned before in Sect. 2.


Figure 2Right depicts how the performances of the compared algorithms change when the number of sources increases. In this experiment, \(\sigma _0\) is set to be 0.25. It is interesting to see that the performances of PTD and the averaging method reverse when the number of sources exceeds 50 (where the second cluster is taken into account).


Setting 3: Two clusters with the same size and different variance

Fig. 3
figure 3

Left: \(\sigma _0\) varies. Right: number of sources varies. (Setting 3)

Again the dataset consists of \(n=100\) points of \(d=7\) dimensions. This time the first half of the data is generated by Gaussian with co-variance 0.1I centered at the origin and the second half of the data is generated by Gaussian with co-variance \(10^{-5}I\) centered at a random point that is \(\sqrt{0.1\cdot d}\) far away from the origin.

Figure 3Left shows the performance of the four methods when \(\sigma _0\) ranges from \(4^{-8}\) to \(4^{-1}\). Comparing to Fig. 2Left, one can observe that the difference between the baseline and the averaging method is larger in this experiment. This is mainly due to the fact that the points are highly-clustered around a point so that our model will naturally favor such point while the averaging method fails to catch this information, and so does RGMM.

Again we show in Fig. 3Right how the performance changes when the number of sources varies. In this experiment, \(\sigma _0\) is set to be 0.25. A similar pattern as in Fig. 2Right can be observed. Also, the baseline always performs worse than the averaging method, and our sampling algorithm discovers points with lower cost.

1.3 A.3 Result of Real-World Data

Fig. 4
figure 4

Real dataset: indoor floorplan dataset

Fig. 5
figure 5

Real dataset: weather forecast dataset

The performance comparison on the first two real-world datasets shows similar patterns as on synthetic datasets (see Figs. 4 and 5). Since RGMM does not converge on the weather forecast dataset, we exclude it from the plots. The indoor floor plan dataset is plotted when \(\sigma _0=1/32\). The weather forecast dataset is plotted when \(\sigma _0=1\).

The third real-world dataset consists of the raw data of teaching evaluations for six semesters of an instructor (whose teaching performance is in general very good) in the Department of Computer Science and Engineering at the State University of New York at Buffalo. Each semester, there are four questions in the questionnaire directly related to the performance of the instructor. For each question the answer can be a discrete value ranging from 1 to 5. Thus the data in each year have 4 dimensions. The number of sources (i.e., students) are 198, 158, 40, 182, 57, 21 for semester 2014fall, 2015spring, 2015fall, 2016spring, 2016fall, 2017spring, respectively. We perform each algorithm on this dataset for each semester with \(\sigma _0=0.2\). Since it is hard to visualize the 4 dimensions, we use the summation of the all coordinates as the final evaluation and compare the results. Therefore the maximum possible evaluation value is 20.

Fig. 6
figure 6

Real world dataset: teaching evaluations

As we mentioned early, there is some external reason (e.g., sudden change of curriculum) that causes the occurrence of some outliers in 2016 Spring for this instructor. As the result shown, the truth discovery model can mitigate the negative influence of these outliers. Thus, a more consistent teaching performance is revealed by our truth discovery model through the whole 6 semesters, where high evaluations correspond to lower cost in the objective function.

The cost function in 2016 Fall has exceptionally lower cost for the solutions by all three models because the sources were unanimously agree with a high evaluation. This is also reflected by the evaluation value on the left of Fig. 6.

From all the experiments in this section, we can see that our sampling based technique has much better overall performance than the compared methods. By setting the hyper parameter \(\sigma _0\) differently, our technique can indeed discover potential patterns hidden in the dataset, and can be used in practice.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, S., Xu, J. & Ye, M. Approximating Global Optimum for Probabilistic Truth Discovery. Algorithmica 82, 3091–3116 (2020). https://doi.org/10.1007/s00453-020-00715-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-020-00715-5

Keywords

Navigation