Skip to main content
Log in

Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data fusion, within the data integration pipeline, addresses the problem of discovering the true values of a data item when multiple sources provide different values for it. An important contribution to the solution of the problem can be given by assessing the quality of the involved sources and relying more on the values coming from trusted sources. State-of-the-art data fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources, and recently it has been also recognized that the trustworthiness of the same source may vary with the domain of interest. In this paper we propose STORM, a novel domain-aware algorithm for data fusion designed for the multi-truth case, that is, when a data item can also have multiple true values. Like many other data-fusion techniques, STORM relies on Bayesian inference. However, differently from the other Bayesian approaches to the problem, it determines the trustworthiness of sources by taking into account their authority: Here, we define authoritative sources as those that have been copied by many other ones, assuming that, when source administrators decide to copy data from other sources, they choose the ones they perceive as the most reliable. To group together the values that have been recognized as variants representing the same real-world entity, STORM provides also a value-reconciliation step, thus reducing the possibility of making mistakes in the remaining part of the algorithm. The experimental results on multi-truth synthetic and real-world datasets show that STORM represents a solid step forward in data-fusion research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The value of these attributes usually contains a special character (e.g., “;”, “ &”) used as separator.

  2. The Conclusion and Future Work section contains some comments on the problem of dealing with more than one attribute.

  3. http://lunadong.com/fusionDataSets.htm.

  4. https://www.wikidata.org/wiki/Q892.

  5. Available at http://lunadong.com/fusionDataSets.htm.

  6. https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236.

  7. https://github.com/daqcri/DAFNA-EA.

References

  1. Batini, C., Scannapieco, M.: Data and Information Quality - Dimensions,Principles and Techniques. Data-Centric Systems and Applications. Springer (2016)

  2. Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pp. 39–48. IEEE (2000)

  3. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1:1-1:41 (2008)

    Google Scholar 

  4. Canalle, G.K., Salgado, A.C., Lóscio, B.F.: A survey on data fusion: what for? in what form? what is next? J. Intell. Inform. Syst. 57(1), 25–50 (2021)

    Article  Google Scholar 

  5. Das Sarma, A., Dong, X.L., Halevy, A.Y.: Data integration with dependent sources. In: Proc. of EDBT 2011, 14th International Conference on Extending Database Technology, pp. 401–412. ACM (2011)

  6. Dietterich, T.G.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)

    Article  Google Scholar 

  7. Dong, X.L., Berti-Équille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. Proc. VLDB Endowment 2(1), 550–561 (2009)

    Article  Google Scholar 

  8. Dong, X.L., Gabrilovich, E., Murphy, K., Dang, V., Horn, W., Lugaresi, C., Sun, S., Zhang, W.: Knowledge-based trust: Estimating the trustworthiness of web sources. Proc. VLDB Endowment 8(9), 938–949 (2015)

    Article  Google Scholar 

  9. Dong, X.L., Saha, B., Srivastava, D.: Less is more: Selecting sources wisely for integration. Proc. VLDB Endowment 6(2), 37–48 (2012)

    Article  Google Scholar 

  10. Dong, X.L., Srivastava, D.: Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2015)

  11. Fang, X.S., Sheng, Q.Z., Wang, X., Chu, D., Ngu, A.H.H.: SmartVote: A full-fledged graph-based model for multi-valued truth discovery. World Wide Web 22(4), 1855–1885 (2019)

    Article  Google Scholar 

  12. Ferreira, A.A., Gonçalves, M.A., Laender, A.H.: A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Rec. 41(2), 15–26 (2012)

  13. Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proc. of WSDM 2010, 3rd International Conference on Web Search and Web Data Mining, pp. 131–140. ACM (2010)

  14. Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Techn. J. 29(2), 147–160 (1950)

    Article  MathSciNet  MATH  Google Scholar 

  15. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann (2011)

  16. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  17. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: Proc. of SODA 1998, 9th Symposium on Discrete Algorithms, pp. 668–677. ACM/SIAM (1998)

  18. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  19. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  20. Li, Q., Li, Y., Gao, J., Su, L., Zhao, B., Demirbas, M., Fan, W., Han, J.: A confidence-aware approach for truth discovery on long-tail data. Proc. VLDB Endowment 8(4), 425–436 (2014)

    Article  Google Scholar 

  21. Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proc. of SIGMOD 2014, International Conference on Management of Data, pp. 1187–1198. ACM (2014)

  22. Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: Is the problem solved? Proc. VLDB Endowment 6(2), 97–108 (2012)

    Article  Google Scholar 

  23. Li, Y., Gao, J., Meng, C., Li, Q., Su, L., Zhao, B., Fan, W., Han, J.: A survey on truth discovery. SIGKDD Explorat. 17(2), 1–16 (2015)

    Article  Google Scholar 

  24. Li, Y., Li, Q., Gao, J., Su, L., Zhao, B., Fan, W., Han, J.: On the discovery of evolving truth. In: Proc. of KDD 2015, 21th International Conference on Knowledge Discovery and Data Mining, pp. 675–684. ACM (2015)

  25. Li, Y., Rubinstein, B.I.P., Cohn, T.: Truth inference at scale: A Bayesian model for adjudicating highly redundant crowd annotations. In: Proc. of WWW 2019, 28th International World Wide Web Conference, pp. 1028–1038. ACM (2019)

  26. Lin, X., Chen, L.: Domain-aware multi-truth discovery from conflicting sources. Proc. VLDB Endowment 11(5), 635–647 (2018)

    Article  Google Scholar 

  27. Liu, W., Liu, J., Wei, B., Duan, H., Hu, W.: A new truth discovery method for resolving object conflicts over Linked Data with scale-free property. Knowl. Inf. Syst. 59(2), 465–495 (2019)

    Article  Google Scholar 

  28. Liu, X., Dong, X.L., Ooi, B.C., Srivastava, D.: Online data fusion. Proc. VLDB Endowment 4(11), 932–943 (2011)

    Article  Google Scholar 

  29. Lyu, S., Ouyang, W., Wang, Y., Shen, H., Cheng, X.: Truth discovery by claim and source embedding. IEEE Trans. Knowl. Data Eng. 33(3), 1264–1275 (2021)

    Article  Google Scholar 

  30. Ma, F., Li, Y., Li, Q., Qiu, M., Gao, J., Zhi, S., Su, L., Zhao, B., Ji, H., Han, J.: FaitCrowd: Fine grained truth discovery for crowdsourced data aggregation. In: Proc. of KDD 2015, 21th International Conference on Knowledge Discovery and Data Mining, pp. 745–754. ACM (2015)

  31. Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: Proc. of COLING 2010, 23rd International Conference on Computational Linguistics, pp. 877–885. Tsinghua University Press (2010)

  32. Pasternack, J., Roth, D.: Latent credibility analysis. In: Proc. of WWW 2013, 22nd International World Wide Web Conference, pp. 1009–1020. ACM (2013)

  33. Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: Proc. of SIGMOD 2014, International Conference on Management of Data, pp. 433–444. ACM (2014)

  34. Ramshaw, L., Tarjan, R.E.: On minimum-cost assignments in unbalanced bipartite graphs. HP Labs, Palo Alto, CA, USA, Tech. Rep. HPL-2012-40R1 (2012)

  35. Ratcliff, J.W., Metzener, D.E.: Pattern matching: The Gestalt approach. Dr Dobbs J. 13(141), 46–51 (1988)

    Google Scholar 

  36. Rekatsinas, T., Joglekar, M., Garcia-Molina, H., Parameswaran, A.G., Ré, C.: Slimfast: Guaranteed results for data fusion and source reliability. In: Proc. of SIGMOD 2017, International Conference on Management of Data, pp. 1399–1414. ACM (2017)

  37. Sørensen, T.A.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Kongelige Danske Videnskabernes Selskab 5(4), 1–34 (1948)

    Google Scholar 

  38. Wang, X., Sheng, Q.Z., Fang, X.S., Yao, L., Xu, X., Li, X.: An integrated Bayesian approach for effective multi-truth discovery. In: Proc. of CIKM 2015, 24th International Conference on Information and Knowledge Management, pp. 493–502. ACM (2015)

  39. Wang, X., Sheng, Q.Z., Yao, L., Li, X., Fang, X.S., Xu, X., Benatallah, B.: Truth discovery via exploiting implications from multi-source data. In: Proc. of CIKM 2016, 25th International Conference on Information and Knowledge Management, pp. 861–870. ACM (2016)

  40. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proc. of the Section on Survey Research Methods, American Statistical Association (1990)

  41. Xiao, H., Gao, J., Li, Q., Ma, F., Su, L., Feng, Y., Zhang, A.: Towards confidence interval estimation in truth discovery. IEEE Trans. Knowledge Data Eng. 31(3), 575–588 (2019)

    Article  Google Scholar 

  42. Yang, J., Tay, W.P.: An unsupervised Bayesian neural network for truth discovery in social networks. IEEE Trans. Knowledge Data Eng. (2021)

  43. Ye, C., Wang, H., Zheng, K., Kong, Y., Zhu, R., Gao, J., Li, J.: Constrained truth discovery. IEEE Trans. Knowledge and Data Eng. (2020)

  44. Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)

    Article  Google Scholar 

  45. Yin, X., Tan, W.: Semi-supervised truth discovery. In: Proc. of WWW 2011, 20th International Conference on World Wide Web, pp. 217–226. ACM (2011)

  46. Zhang, D., Wang, D., Vance, N., Zhang, Y., Mike, S.: On scalable and robust truth discovery in big data social media sensing applications. IEEE Trans. Big Data 5(2), 195–208 (2019)

    Article  Google Scholar 

  47. Zhang, H., Li, Q., Ma, F., Xiao, H., Li, Y., Gao, J., Su, L.: Influence-aware truth discovery. In: Proc. of CIKM 2016, 25th International Conference on Information and Knowledge Management, pp. 851–860. ACM (2016)

  48. Zhang, J., Wu, X.: Multi-label truth inference for crowdsourcing using mixture models. IEEE Trans. Knowledge and Data Eng. 33(5), 2083–2095 (2021)

    Google Scholar 

  49. Zhang, L., Qi, G., Zhang, D., Tang, J.: Latent dirichlet truth discovery: Separating trustworthy and untrustworthy components in data sources. IEEE Access 6, 1741–1752 (2018)

    Article  Google Scholar 

  50. Zhao, B., Han, J.: A probabilistic model for estimating real-valued truth from conflicting sources. In: Proc. of QDB 2012, 10th International Workshop on Quality in Databases (2012)

  51. Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endowment 5(6), 550–561 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emanuele Rabosio.

Ethics declarations

Competing Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Azzalini, F., Piantella, D., Rabosio, E. et al. Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity. The VLDB Journal 32, 475–500 (2023). https://doi.org/10.1007/s00778-022-00757-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00757-x

Keywords

Navigation