Abstract
Data fusion, within the data integration pipeline, addresses the problem of discovering the true values of a data item when multiple sources provide different values for it. An important contribution to the solution of the problem can be given by assessing the quality of the involved sources and relying more on the values coming from trusted sources. State-of-the-art data fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources, and recently it has been also recognized that the trustworthiness of the same source may vary with the domain of interest. In this paper we propose STORM, a novel domain-aware algorithm for data fusion designed for the multi-truth case, that is, when a data item can also have multiple true values. Like many other data-fusion techniques, STORM relies on Bayesian inference. However, differently from the other Bayesian approaches to the problem, it determines the trustworthiness of sources by taking into account their authority: Here, we define authoritative sources as those that have been copied by many other ones, assuming that, when source administrators decide to copy data from other sources, they choose the ones they perceive as the most reliable. To group together the values that have been recognized as variants representing the same real-world entity, STORM provides also a value-reconciliation step, thus reducing the possibility of making mistakes in the remaining part of the algorithm. The experimental results on multi-truth synthetic and real-world datasets show that STORM represents a solid step forward in data-fusion research.
Similar content being viewed by others
Notes
The value of these attributes usually contains a special character (e.g., “;”, “ &”) used as separator.
The Conclusion and Future Work section contains some comments on the problem of dealing with more than one attribute.
Available at http://lunadong.com/fusionDataSets.htm.
References
Batini, C., Scannapieco, M.: Data and Information Quality - Dimensions,Principles and Techniques. Data-Centric Systems and Applications. Springer (2016)
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pp. 39–48. IEEE (2000)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1:1-1:41 (2008)
Canalle, G.K., Salgado, A.C., Lóscio, B.F.: A survey on data fusion: what for? in what form? what is next? J. Intell. Inform. Syst. 57(1), 25–50 (2021)
Das Sarma, A., Dong, X.L., Halevy, A.Y.: Data integration with dependent sources. In: Proc. of EDBT 2011, 14th International Conference on Extending Database Technology, pp. 401–412. ACM (2011)
Dietterich, T.G.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)
Dong, X.L., Berti-Équille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. Proc. VLDB Endowment 2(1), 550–561 (2009)
Dong, X.L., Gabrilovich, E., Murphy, K., Dang, V., Horn, W., Lugaresi, C., Sun, S., Zhang, W.: Knowledge-based trust: Estimating the trustworthiness of web sources. Proc. VLDB Endowment 8(9), 938–949 (2015)
Dong, X.L., Saha, B., Srivastava, D.: Less is more: Selecting sources wisely for integration. Proc. VLDB Endowment 6(2), 37–48 (2012)
Dong, X.L., Srivastava, D.: Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2015)
Fang, X.S., Sheng, Q.Z., Wang, X., Chu, D., Ngu, A.H.H.: SmartVote: A full-fledged graph-based model for multi-valued truth discovery. World Wide Web 22(4), 1855–1885 (2019)
Ferreira, A.A., Gonçalves, M.A., Laender, A.H.: A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Rec. 41(2), 15–26 (2012)
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proc. of WSDM 2010, 3rd International Conference on Web Search and Web Data Mining, pp. 131–140. ACM (2010)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Techn. J. 29(2), 147–160 (1950)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann (2011)
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: Proc. of SODA 1998, 9th Symposium on Discrete Algorithms, pp. 668–677. ACM/SIAM (1998)
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)
Li, Q., Li, Y., Gao, J., Su, L., Zhao, B., Demirbas, M., Fan, W., Han, J.: A confidence-aware approach for truth discovery on long-tail data. Proc. VLDB Endowment 8(4), 425–436 (2014)
Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proc. of SIGMOD 2014, International Conference on Management of Data, pp. 1187–1198. ACM (2014)
Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: Is the problem solved? Proc. VLDB Endowment 6(2), 97–108 (2012)
Li, Y., Gao, J., Meng, C., Li, Q., Su, L., Zhao, B., Fan, W., Han, J.: A survey on truth discovery. SIGKDD Explorat. 17(2), 1–16 (2015)
Li, Y., Li, Q., Gao, J., Su, L., Zhao, B., Fan, W., Han, J.: On the discovery of evolving truth. In: Proc. of KDD 2015, 21th International Conference on Knowledge Discovery and Data Mining, pp. 675–684. ACM (2015)
Li, Y., Rubinstein, B.I.P., Cohn, T.: Truth inference at scale: A Bayesian model for adjudicating highly redundant crowd annotations. In: Proc. of WWW 2019, 28th International World Wide Web Conference, pp. 1028–1038. ACM (2019)
Lin, X., Chen, L.: Domain-aware multi-truth discovery from conflicting sources. Proc. VLDB Endowment 11(5), 635–647 (2018)
Liu, W., Liu, J., Wei, B., Duan, H., Hu, W.: A new truth discovery method for resolving object conflicts over Linked Data with scale-free property. Knowl. Inf. Syst. 59(2), 465–495 (2019)
Liu, X., Dong, X.L., Ooi, B.C., Srivastava, D.: Online data fusion. Proc. VLDB Endowment 4(11), 932–943 (2011)
Lyu, S., Ouyang, W., Wang, Y., Shen, H., Cheng, X.: Truth discovery by claim and source embedding. IEEE Trans. Knowl. Data Eng. 33(3), 1264–1275 (2021)
Ma, F., Li, Y., Li, Q., Qiu, M., Gao, J., Zhi, S., Su, L., Zhao, B., Ji, H., Han, J.: FaitCrowd: Fine grained truth discovery for crowdsourced data aggregation. In: Proc. of KDD 2015, 21th International Conference on Knowledge Discovery and Data Mining, pp. 745–754. ACM (2015)
Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: Proc. of COLING 2010, 23rd International Conference on Computational Linguistics, pp. 877–885. Tsinghua University Press (2010)
Pasternack, J., Roth, D.: Latent credibility analysis. In: Proc. of WWW 2013, 22nd International World Wide Web Conference, pp. 1009–1020. ACM (2013)
Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: Proc. of SIGMOD 2014, International Conference on Management of Data, pp. 433–444. ACM (2014)
Ramshaw, L., Tarjan, R.E.: On minimum-cost assignments in unbalanced bipartite graphs. HP Labs, Palo Alto, CA, USA, Tech. Rep. HPL-2012-40R1 (2012)
Ratcliff, J.W., Metzener, D.E.: Pattern matching: The Gestalt approach. Dr Dobbs J. 13(141), 46–51 (1988)
Rekatsinas, T., Joglekar, M., Garcia-Molina, H., Parameswaran, A.G., Ré, C.: Slimfast: Guaranteed results for data fusion and source reliability. In: Proc. of SIGMOD 2017, International Conference on Management of Data, pp. 1399–1414. ACM (2017)
Sørensen, T.A.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Kongelige Danske Videnskabernes Selskab 5(4), 1–34 (1948)
Wang, X., Sheng, Q.Z., Fang, X.S., Yao, L., Xu, X., Li, X.: An integrated Bayesian approach for effective multi-truth discovery. In: Proc. of CIKM 2015, 24th International Conference on Information and Knowledge Management, pp. 493–502. ACM (2015)
Wang, X., Sheng, Q.Z., Yao, L., Li, X., Fang, X.S., Xu, X., Benatallah, B.: Truth discovery via exploiting implications from multi-source data. In: Proc. of CIKM 2016, 25th International Conference on Information and Knowledge Management, pp. 861–870. ACM (2016)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proc. of the Section on Survey Research Methods, American Statistical Association (1990)
Xiao, H., Gao, J., Li, Q., Ma, F., Su, L., Feng, Y., Zhang, A.: Towards confidence interval estimation in truth discovery. IEEE Trans. Knowledge Data Eng. 31(3), 575–588 (2019)
Yang, J., Tay, W.P.: An unsupervised Bayesian neural network for truth discovery in social networks. IEEE Trans. Knowledge Data Eng. (2021)
Ye, C., Wang, H., Zheng, K., Kong, Y., Zhu, R., Gao, J., Li, J.: Constrained truth discovery. IEEE Trans. Knowledge and Data Eng. (2020)
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Yin, X., Tan, W.: Semi-supervised truth discovery. In: Proc. of WWW 2011, 20th International Conference on World Wide Web, pp. 217–226. ACM (2011)
Zhang, D., Wang, D., Vance, N., Zhang, Y., Mike, S.: On scalable and robust truth discovery in big data social media sensing applications. IEEE Trans. Big Data 5(2), 195–208 (2019)
Zhang, H., Li, Q., Ma, F., Xiao, H., Li, Y., Gao, J., Su, L.: Influence-aware truth discovery. In: Proc. of CIKM 2016, 25th International Conference on Information and Knowledge Management, pp. 851–860. ACM (2016)
Zhang, J., Wu, X.: Multi-label truth inference for crowdsourcing using mixture models. IEEE Trans. Knowledge and Data Eng. 33(5), 2083–2095 (2021)
Zhang, L., Qi, G., Zhang, D., Tang, J.: Latent dirichlet truth discovery: Separating trustworthy and untrustworthy components in data sources. IEEE Access 6, 1741–1752 (2018)
Zhao, B., Han, J.: A probabilistic model for estimating real-valued truth from conflicting sources. In: Proc. of QDB 2012, 10th International Workshop on Quality in Databases (2012)
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endowment 5(6), 550–561 (2012)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Azzalini, F., Piantella, D., Rabosio, E. et al. Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity. The VLDB Journal 32, 475–500 (2023). https://doi.org/10.1007/s00778-022-00757-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-022-00757-x