Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity

Azzalini, Fabio; Piantella, Davide; Rabosio, Emanuele; Tanca, Letizia

doi:10.1007/s00778-022-00757-x

Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity

Regular Paper
Published: 19 July 2022

Volume 32, pages 475–500, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Fabio Azzalini^1,2,
Davide Piantella¹,
Emanuele Rabosio ORCID: orcid.org/0000-0003-3722-7789² &
…
Letizia Tanca¹

460 Accesses
1 Citation
Explore all metrics

Abstract

Data fusion, within the data integration pipeline, addresses the problem of discovering the true values of a data item when multiple sources provide different values for it. An important contribution to the solution of the problem can be given by assessing the quality of the involved sources and relying more on the values coming from trusted sources. State-of-the-art data fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources, and recently it has been also recognized that the trustworthiness of the same source may vary with the domain of interest. In this paper we propose STORM, a novel domain-aware algorithm for data fusion designed for the multi-truth case, that is, when a data item can also have multiple true values. Like many other data-fusion techniques, STORM relies on Bayesian inference. However, differently from the other Bayesian approaches to the problem, it determines the trustworthiness of sources by taking into account their authority: Here, we define authoritative sources as those that have been copied by many other ones, assuming that, when source administrators decide to copy data from other sources, they choose the ones they perceive as the most reliable. To group together the values that have been recognized as variants representing the same real-world entity, STORM provides also a value-reconciliation step, thus reducing the possibility of making mistakes in the remaining part of the algorithm. The experimental results on multi-truth synthetic and real-world datasets show that STORM represents a solid step forward in data-fusion research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 4

Fig. 5

Big Data Privacy: Challenges to Privacy Principles and Models

Article Open access 15 September 2015

Coordinating Decision-Making in Data Management Activities: A Systematic Review of Data Governance Principles

A Review of Distributed Ledger Technologies

Notes

The value of these attributes usually contains a special character (e.g., “;”, “ &”) used as separator.
The Conclusion and Future Work section contains some comments on the problem of dealing with more than one attribute.
http://lunadong.com/fusionDataSets.htm.
https://www.wikidata.org/wiki/Q892.
Available at http://lunadong.com/fusionDataSets.htm.
https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236.
https://github.com/daqcri/DAFNA-EA.

References

Batini, C., Scannapieco, M.: Data and Information Quality - Dimensions,Principles and Techniques. Data-Centric Systems and Applications. Springer (2016)
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pp. 39–48. IEEE (2000)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1:1-1:41 (2008)
Google Scholar
Canalle, G.K., Salgado, A.C., Lóscio, B.F.: A survey on data fusion: what for? in what form? what is next? J. Intell. Inform. Syst. 57(1), 25–50 (2021)
Article Google Scholar
Das Sarma, A., Dong, X.L., Halevy, A.Y.: Data integration with dependent sources. In: Proc. of EDBT 2011, 14th International Conference on Extending Database Technology, pp. 401–412. ACM (2011)
Dietterich, T.G.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)
Article Google Scholar
Dong, X.L., Berti-Équille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. Proc. VLDB Endowment 2(1), 550–561 (2009)
Article Google Scholar
Dong, X.L., Gabrilovich, E., Murphy, K., Dang, V., Horn, W., Lugaresi, C., Sun, S., Zhang, W.: Knowledge-based trust: Estimating the trustworthiness of web sources. Proc. VLDB Endowment 8(9), 938–949 (2015)
Article Google Scholar
Dong, X.L., Saha, B., Srivastava, D.: Less is more: Selecting sources wisely for integration. Proc. VLDB Endowment 6(2), 37–48 (2012)
Article Google Scholar
Dong, X.L., Srivastava, D.: Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2015)
Fang, X.S., Sheng, Q.Z., Wang, X., Chu, D., Ngu, A.H.H.: SmartVote: A full-fledged graph-based model for multi-valued truth discovery. World Wide Web 22(4), 1855–1885 (2019)
Article Google Scholar
Ferreira, A.A., Gonçalves, M.A., Laender, A.H.: A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Rec. 41(2), 15–26 (2012)
Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. In: Proc. of WSDM 2010, 3rd International Conference on Web Search and Web Data Mining, pp. 131–140. ACM (2010)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Techn. J. 29(2), 147–160 (1950)
Article MathSciNet MATH Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann (2011)
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Article Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: Proc. of SODA 1998, 9th Symposium on Discrete Algorithms, pp. 668–677. ACM/SIAM (1998)
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)
Article MathSciNet MATH Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Li, Q., Li, Y., Gao, J., Su, L., Zhao, B., Demirbas, M., Fan, W., Han, J.: A confidence-aware approach for truth discovery on long-tail data. Proc. VLDB Endowment 8(4), 425–436 (2014)
Article Google Scholar
Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proc. of SIGMOD 2014, International Conference on Management of Data, pp. 1187–1198. ACM (2014)
Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: Is the problem solved? Proc. VLDB Endowment 6(2), 97–108 (2012)
Article Google Scholar
Li, Y., Gao, J., Meng, C., Li, Q., Su, L., Zhao, B., Fan, W., Han, J.: A survey on truth discovery. SIGKDD Explorat. 17(2), 1–16 (2015)
Article Google Scholar
Li, Y., Li, Q., Gao, J., Su, L., Zhao, B., Fan, W., Han, J.: On the discovery of evolving truth. In: Proc. of KDD 2015, 21th International Conference on Knowledge Discovery and Data Mining, pp. 675–684. ACM (2015)
Li, Y., Rubinstein, B.I.P., Cohn, T.: Truth inference at scale: A Bayesian model for adjudicating highly redundant crowd annotations. In: Proc. of WWW 2019, 28th International World Wide Web Conference, pp. 1028–1038. ACM (2019)
Lin, X., Chen, L.: Domain-aware multi-truth discovery from conflicting sources. Proc. VLDB Endowment 11(5), 635–647 (2018)
Article Google Scholar
Liu, W., Liu, J., Wei, B., Duan, H., Hu, W.: A new truth discovery method for resolving object conflicts over Linked Data with scale-free property. Knowl. Inf. Syst. 59(2), 465–495 (2019)
Article Google Scholar
Liu, X., Dong, X.L., Ooi, B.C., Srivastava, D.: Online data fusion. Proc. VLDB Endowment 4(11), 932–943 (2011)
Article Google Scholar
Lyu, S., Ouyang, W., Wang, Y., Shen, H., Cheng, X.: Truth discovery by claim and source embedding. IEEE Trans. Knowl. Data Eng. 33(3), 1264–1275 (2021)
Article Google Scholar
Ma, F., Li, Y., Li, Q., Qiu, M., Gao, J., Zhi, S., Su, L., Zhao, B., Ji, H., Han, J.: FaitCrowd: Fine grained truth discovery for crowdsourced data aggregation. In: Proc. of KDD 2015, 21th International Conference on Knowledge Discovery and Data Mining, pp. 745–754. ACM (2015)
Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: Proc. of COLING 2010, 23rd International Conference on Computational Linguistics, pp. 877–885. Tsinghua University Press (2010)
Pasternack, J., Roth, D.: Latent credibility analysis. In: Proc. of WWW 2013, 22nd International World Wide Web Conference, pp. 1009–1020. ACM (2013)
Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: Proc. of SIGMOD 2014, International Conference on Management of Data, pp. 433–444. ACM (2014)
Ramshaw, L., Tarjan, R.E.: On minimum-cost assignments in unbalanced bipartite graphs. HP Labs, Palo Alto, CA, USA, Tech. Rep. HPL-2012-40R1 (2012)
Ratcliff, J.W., Metzener, D.E.: Pattern matching: The Gestalt approach. Dr Dobbs J. 13(141), 46–51 (1988)
Google Scholar
Rekatsinas, T., Joglekar, M., Garcia-Molina, H., Parameswaran, A.G., Ré, C.: Slimfast: Guaranteed results for data fusion and source reliability. In: Proc. of SIGMOD 2017, International Conference on Management of Data, pp. 1399–1414. ACM (2017)
Sørensen, T.A.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Kongelige Danske Videnskabernes Selskab 5(4), 1–34 (1948)
Google Scholar
Wang, X., Sheng, Q.Z., Fang, X.S., Yao, L., Xu, X., Li, X.: An integrated Bayesian approach for effective multi-truth discovery. In: Proc. of CIKM 2015, 24th International Conference on Information and Knowledge Management, pp. 493–502. ACM (2015)
Wang, X., Sheng, Q.Z., Yao, L., Li, X., Fang, X.S., Xu, X., Benatallah, B.: Truth discovery via exploiting implications from multi-source data. In: Proc. of CIKM 2016, 25th International Conference on Information and Knowledge Management, pp. 861–870. ACM (2016)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proc. of the Section on Survey Research Methods, American Statistical Association (1990)
Xiao, H., Gao, J., Li, Q., Ma, F., Su, L., Feng, Y., Zhang, A.: Towards confidence interval estimation in truth discovery. IEEE Trans. Knowledge Data Eng. 31(3), 575–588 (2019)
Article Google Scholar
Yang, J., Tay, W.P.: An unsupervised Bayesian neural network for truth discovery in social networks. IEEE Trans. Knowledge Data Eng. (2021)
Ye, C., Wang, H., Zheng, K., Kong, Y., Zhu, R., Gao, J., Li, J.: Constrained truth discovery. IEEE Trans. Knowledge and Data Eng. (2020)
Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Article Google Scholar
Yin, X., Tan, W.: Semi-supervised truth discovery. In: Proc. of WWW 2011, 20th International Conference on World Wide Web, pp. 217–226. ACM (2011)
Zhang, D., Wang, D., Vance, N., Zhang, Y., Mike, S.: On scalable and robust truth discovery in big data social media sensing applications. IEEE Trans. Big Data 5(2), 195–208 (2019)
Article Google Scholar
Zhang, H., Li, Q., Ma, F., Xiao, H., Li, Y., Gao, J., Su, L.: Influence-aware truth discovery. In: Proc. of CIKM 2016, 25th International Conference on Information and Knowledge Management, pp. 851–860. ACM (2016)
Zhang, J., Wu, X.: Multi-label truth inference for crowdsourcing using mixture models. IEEE Trans. Knowledge and Data Eng. 33(5), 2083–2095 (2021)
Google Scholar
Zhang, L., Qi, G., Zhang, D., Tang, J.: Latent dirichlet truth discovery: Separating trustworthy and untrustworthy components in data sources. IEEE Access 6, 1741–1752 (2018)
Article Google Scholar
Zhao, B., Han, J.: A probabilistic model for estimating real-valued truth from conflicting sources. In: Proc. of QDB 2012, 10th International Workshop on Quality in Databases (2012)
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endowment 5(6), 550–561 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via G. Ponzio 34/5, I-20133, Milano, Italy
Fabio Azzalini, Davide Piantella & Letizia Tanca
Center for Health Data Science, Human Technopole, Viale R. Levi-Montalcini 1, I-20157, Milano, Italy
Fabio Azzalini & Emanuele Rabosio

Authors

Fabio Azzalini
View author publications
You can also search for this author in PubMed Google Scholar
Davide Piantella
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Rabosio
View author publications
You can also search for this author in PubMed Google Scholar
Letizia Tanca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emanuele Rabosio.

Ethics declarations

Competing Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Azzalini, F., Piantella, D., Rabosio, E. et al. Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity. The VLDB Journal 32, 475–500 (2023). https://doi.org/10.1007/s00778-022-00757-x

Download citation

Received: 29 November 2021
Revised: 18 June 2022
Accepted: 21 June 2022
Published: 19 July 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00778-022-00757-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity

Abstract

Access this article

Similar content being viewed by others

Big Data Privacy: Challenges to Privacy Principles and Models

Coordinating Decision-Making in Data Management Activities: A Systematic Review of Data Governance Principles

A Review of Distributed Ledger Technologies

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity

Abstract

Access this article

Similar content being viewed by others

Big Data Privacy: Challenges to Privacy Principles and Models

Coordinating Decision-Making in Data Management Activities: A Systematic Review of Data Governance Principles

A Review of Distributed Ledger Technologies

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation