Abstract
Entity matching aims at identifying records in different data sources that describe the same real-world entity. Entity matching is the foundational technique for setting RDF links in the context of the Web of Data. By applying active learning methods for training entity matchers, it is possible to reduce the human labeling effort by selecting informative record pairs for labeling. Although active learning has been extensively studied for the two-data source matching case, it was only recently applied for the task of matching records in multi-source settings, such as the Web of Data. A multi-source matching task has certain inherent characteristics which do not apply for two-source matching tasks and which can be exploited by the active learning query strategy to further reduce the labeling effort. In this paper, we propose a set of profiling dimensions which capture these inherent characteristics of multi-source matching tasks and study their impact on the performance of different active learning methods for training entity matchers. To enable our analysis, we develop ALMSERgen, a multi-source matching task generator and curate a continuum of 252 matching tasks along the suggested profiling dimensions. We use the generated as well as five benchmark tasks to compare the performance of three query strategies: a committee-based strategy, a graph-based strategy, and a strategy that exploits grouping signals. Our results show that graph signals are relevant for multi-source matching tasks involving a large amount of records describing the same-real world entities with heterogeneous attribute values while using grouping signals is beneficial if there exists a small number of groups of matching tasks sharing the same underlying patterns.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Achichi, M., Cheatham, M., et al.: Results of the ontology alignment evaluation initiative 2017. In: Proceedings of OM 2017–12th ISWC Workshop on Ontology Matching, pp. 61–113 (2017)
Bellare, K., Curino, C., Machanavajihala, A., et al.: WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11), 1114–1125 (2013)
Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications (2012)
Christophides, V., Efthymiou, V., et al.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. (CSUR) 53(6), 1–42 (2020)
Elmagarmid, A., Ipeirotis, P., et al.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Ferrara, A., Montanelli, S., Noessner, J., Stuckenschmidt, H.: Benchmarking matching applications on the semantic web. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6644, pp. 108–122. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21064-8_8
Halevy, A., Rajaraman, A., Ordille, J.: Data integration: the teenage years. In: Proceedings of VLD, pp. 9–16 (2006)
Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (2011)
Hildebrandt, K., Panse, F., et al.: Large-scale data pollution with Apache spark. IEEE Trans. Big Data 6(2), 396–411 (2020)
Huang, J., Hu, W., Li, H., Qu, Y.: Automated comparative table generation for facilitating human intervention in multi-entity resolution. In: Proceedings of SIGIR, pp. 585–594 (2018)
Ioannou, E., Rassadko, N., Velegrakis, Y.: On generating benchmark data for entity matching. J. Data Semant. 2(1), 37–56 (2013)
Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. J. Web Semant. 23, 2–15 (2013)
Kasai, J., Qian, K., et al.: Low-resource deep entity resolution with transfer and active learning. In: Proceedings of ACL, pp. 5851–5861 (2019)
Konda, P., et al.: Magellan: toward building entity matching management systems over data science stacks. PVLDB 13, 1581–1584 (2016)
Konyushkova, K., Raphael, S., Fua, P.: Learning active learning from data. In: Proceedings of NIPS, p. 4228–4238 (2017)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. VLDB Endow. 3(1–2), 484–493 (2010)
Meduri, V., Popa, L., et al.: A comprehensive benchmark framework for active learning methods in entity matching. In: Proceedings of SIGMOD, pp. 1133–1147 (2020)
Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets: a case for active learning. VLDB Endow. 8(2), 125–136 (2014)
Nafa, Y., et al.: Active deep learning on entity resolution by risk sampling. Knowl.-Based Syst. 236, 107729 (2022)
Nentwig, M., Hartung, M., Ngonga Ngomo, A.C., Rahm, E.: A survey of current link discovery frameworks. Semant. Web 8(3), 419–436 (2017)
Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_17
Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The four generations of entity resolution. Synthesis Lect. Data Manage. 16(2), 1–170 (2021)
Primpeli, A., Bizer, C.: Profiling entity matching benchmark tasks. In: Proceedings of CIKM, pp. 3101–3108 (2020)
Primpeli, A., Bizer, C.: Graph-boosted active learning for multi-source entity resolution. In: Hotho, A., et al. (eds.) ISWC 2021. LNCS, vol. 12922, pp. 182–199. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88361-4_11
Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: Proceedings of CIKM, pp. 1379–1388 (2017)
Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS, vol. 10509, pp. 278–293. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66917-5_19
Saveta, T., Daskalaki, E., Flouris, G., Fundulaki, I., Herschel, M., Ngomo, A.-C.N.: LANCE: piercing to the heart of instance matching tools. In: ISWC 2015. LNCS, vol. 9366, pp. 375–391. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_22
Settles, B.: Active Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)
Shen, W., DeRose, P., Vu, L., et al.: Source-aware entity matching: a compositional approach. In: Proceedings of ICDE, pp. 196–205 (2007)
Sherif, M.A., Dreßler, K., Ngomo, A.C.N.: LIGON-link discovery with noisy oracles. In: Proceedings of Ontology Matching Workshop (ISWC), pp. 48–59 (2020)
Thirumuruganathan, S., Parambath, S.A.P., et al.: Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018)
Ye, Y., Talburt, J.: Generating synthetic data to support entity resolution education and research. J. Comput. Sci. Coll. 34(7), 12–19 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Primpeli, A., Bizer, C. (2022). Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods. In: Groth, P., et al. The Semantic Web. ESWC 2022. Lecture Notes in Computer Science, vol 13261. Springer, Cham. https://doi.org/10.1007/978-3-031-06981-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-06981-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06980-2
Online ISBN: 978-3-031-06981-9
eBook Packages: Computer ScienceComputer Science (R0)