Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods

Primpeli, Anna; Bizer, Christian

doi:10.1007/978-3-031-06981-9_7

Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods

Conference paper
First Online: 31 May 2022

1438 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13261))

Abstract

Entity matching aims at identifying records in different data sources that describe the same real-world entity. Entity matching is the foundational technique for setting RDF links in the context of the Web of Data. By applying active learning methods for training entity matchers, it is possible to reduce the human labeling effort by selecting informative record pairs for labeling. Although active learning has been extensively studied for the two-data source matching case, it was only recently applied for the task of matching records in multi-source settings, such as the Web of Data. A multi-source matching task has certain inherent characteristics which do not apply for two-source matching tasks and which can be exploited by the active learning query strategy to further reduce the labeling effort. In this paper, we propose a set of profiling dimensions which capture these inherent characteristics of multi-source matching tasks and study their impact on the performance of different active learning methods for training entity matchers. To enable our analysis, we develop ALMSERgen, a multi-source matching task generator and curate a continuum of 252 matching tasks along the suggested profiling dimensions. We use the generated as well as five benchmark tasks to compare the performance of three query strategies: a committee-based strategy, a graph-based strategy, and a strategy that exploits grouping signals. Our results show that graph signals are relevant for multi-source matching tasks involving a large amount of records describing the same-real world entities with heterogeneous attribute values while using grouping signals is beneficial if there exists a small number of groups of matching tasks sharing the same underlying patterns.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Achichi, M., Cheatham, M., et al.: Results of the ontology alignment evaluation initiative 2017. In: Proceedings of OM 2017–12th ISWC Workshop on Ontology Matching, pp. 61–113 (2017)
Google Scholar
Bellare, K., Curino, C., Machanavajihala, A., et al.: WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11), 1114–1125 (2013)
Google Scholar
Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5
Chapter Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications (2012)
Google Scholar
Christophides, V., Efthymiou, V., et al.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. (CSUR) 53(6), 1–42 (2020)
Article Google Scholar
Elmagarmid, A., Ipeirotis, P., et al.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Ferrara, A., Montanelli, S., Noessner, J., Stuckenschmidt, H.: Benchmarking matching applications on the semantic web. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6644, pp. 108–122. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21064-8_8
Chapter Google Scholar
Halevy, A., Rajaraman, A., Ordille, J.: Data integration: the teenage years. In: Proceedings of VLD, pp. 9–16 (2006)
Google Scholar
Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (2011)
Google Scholar
Hildebrandt, K., Panse, F., et al.: Large-scale data pollution with Apache spark. IEEE Trans. Big Data 6(2), 396–411 (2020)
Article Google Scholar
Huang, J., Hu, W., Li, H., Qu, Y.: Automated comparative table generation for facilitating human intervention in multi-entity resolution. In: Proceedings of SIGIR, pp. 585–594 (2018)
Google Scholar
Ioannou, E., Rassadko, N., Velegrakis, Y.: On generating benchmark data for entity matching. J. Data Semant. 2(1), 37–56 (2013)
Article Google Scholar
Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. J. Web Semant. 23, 2–15 (2013)
Article Google Scholar
Kasai, J., Qian, K., et al.: Low-resource deep entity resolution with transfer and active learning. In: Proceedings of ACL, pp. 5851–5861 (2019)
Google Scholar
Konda, P., et al.: Magellan: toward building entity matching management systems over data science stacks. PVLDB 13, 1581–1584 (2016)
Google Scholar
Konyushkova, K., Raphael, S., Fua, P.: Learning active learning from data. In: Proceedings of NIPS, p. 4228–4238 (2017)
Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. VLDB Endow. 3(1–2), 484–493 (2010)
Article Google Scholar
Meduri, V., Popa, L., et al.: A comprehensive benchmark framework for active learning methods in entity matching. In: Proceedings of SIGMOD, pp. 1133–1147 (2020)
Google Scholar
Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets: a case for active learning. VLDB Endow. 8(2), 125–136 (2014)
Article Google Scholar
Nafa, Y., et al.: Active deep learning on entity resolution by risk sampling. Knowl.-Based Syst. 236, 107729 (2022)
Article Google Scholar
Nentwig, M., Hartung, M., Ngonga Ngomo, A.C., Rahm, E.: A survey of current link discovery frameworks. Semant. Web 8(3), 419–436 (2017)
Article Google Scholar
Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_17
Chapter Google Scholar
Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The four generations of entity resolution. Synthesis Lect. Data Manage. 16(2), 1–170 (2021)
Article Google Scholar
Primpeli, A., Bizer, C.: Profiling entity matching benchmark tasks. In: Proceedings of CIKM, pp. 3101–3108 (2020)
Google Scholar
Primpeli, A., Bizer, C.: Graph-boosted active learning for multi-source entity resolution. In: Hotho, A., et al. (eds.) ISWC 2021. LNCS, vol. 12922, pp. 182–199. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88361-4_11
Chapter Google Scholar
Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: Proceedings of CIKM, pp. 1379–1388 (2017)
Google Scholar
Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS, vol. 10509, pp. 278–293. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66917-5_19
Chapter Google Scholar
Saveta, T., Daskalaki, E., Flouris, G., Fundulaki, I., Herschel, M., Ngomo, A.-C.N.: LANCE: piercing to the heart of instance matching tools. In: ISWC 2015. LNCS, vol. 9366, pp. 375–391. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_22
Chapter Google Scholar
Settles, B.: Active Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)
Google Scholar
Shen, W., DeRose, P., Vu, L., et al.: Source-aware entity matching: a compositional approach. In: Proceedings of ICDE, pp. 196–205 (2007)
Google Scholar
Sherif, M.A., Dreßler, K., Ngomo, A.C.N.: LIGON-link discovery with noisy oracles. In: Proceedings of Ontology Matching Workshop (ISWC), pp. 48–59 (2020)
Google Scholar
Thirumuruganathan, S., Parambath, S.A.P., et al.: Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018)
Ye, Y., Talburt, J.: Generating synthetic data to support entity resolution education and research. J. Comput. Sci. Coll. 34(7), 12–19 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Data and Web Science Group, University of Mannheim, Mannheim, Germany
Anna Primpeli & Christian Bizer

Authors

Anna Primpeli
View author publications
You can also search for this author in PubMed Google Scholar
Christian Bizer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Primpeli .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, Noord-Holland, The Netherlands
Paul Groth
Universidad Simón Bolívar, Leibniz Information Centre for Science and Technology, Hannover, Niedersachsen, Germany
Maria-Esther Vidal
Institut Polytechnique de Paris "DIG", Télécom ParisTech, Palaiseau, France
Fabian Suchanek
University of Southern California, Marina del Rey, CA, USA
Pedro Szekley
IBM Research - Thomas J. Watson Research, Yorktown Heights, NY, USA
Pavan Kapanipathi
LaSIGE, Fac de Ciencias,Edif C6, Pis0 3, Universidade de Lisboa, Lisbon, Portugal
Catia Pesquita
University of Nantes, Nantes, France
Hala Skaf-Molli
Aalto University, Espoo, Finland
Minna Tamper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Primpeli, A., Bizer, C. (2022). Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods. In: Groth, P., et al. The Semantic Web. ESWC 2022. Lecture Notes in Computer Science, vol 13261. Springer, Cham. https://doi.org/10.1007/978-3-031-06981-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-06981-9_7
Published: 31 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06980-2
Online ISBN: 978-3-031-06981-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics