Skip to main content

Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods

  • Conference paper
  • First Online:
  • 1438 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13261))

Abstract

Entity matching aims at identifying records in different data sources that describe the same real-world entity. Entity matching is the foundational technique for setting RDF links in the context of the Web of Data. By applying active learning methods for training entity matchers, it is possible to reduce the human labeling effort by selecting informative record pairs for labeling. Although active learning has been extensively studied for the two-data source matching case, it was only recently applied for the task of matching records in multi-source settings, such as the Web of Data. A multi-source matching task has certain inherent characteristics which do not apply for two-source matching tasks and which can be exploited by the active learning query strategy to further reduce the labeling effort. In this paper, we propose a set of profiling dimensions which capture these inherent characteristics of multi-source matching tasks and study their impact on the performance of different active learning methods for training entity matchers. To enable our analysis, we develop ALMSERgen, a multi-source matching task generator and curate a continuum of 252 matching tasks along the suggested profiling dimensions. We use the generated as well as five benchmark tasks to compare the performance of three query strategies: a committee-based strategy, a graph-based strategy, and a strategy that exploits grouping signals. Our results show that graph signals are relevant for multi-source matching tasks involving a large amount of records describing the same-real world entities with heterogeneous attribute values while using grouping signals is beneficial if there exists a small number of groups of matching tasks sharing the same underlying patterns.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/wbsg-uni-mannheim/ALMSER-GEN.

  2. 2.

    http://millionsongdataset.com/lastfm/.

References

  1. Achichi, M., Cheatham, M., et al.: Results of the ontology alignment evaluation initiative 2017. In: Proceedings of OM 2017–12th ISWC Workshop on Ontology Matching, pp. 61–113 (2017)

    Google Scholar 

  2. Bellare, K., Curino, C., Machanavajihala, A., et al.: WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11), 1114–1125 (2013)

    Google Scholar 

  3. Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5

    Chapter  Google Scholar 

  4. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications (2012)

    Google Scholar 

  5. Christophides, V., Efthymiou, V., et al.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. (CSUR) 53(6), 1–42 (2020)

    Article  Google Scholar 

  6. Elmagarmid, A., Ipeirotis, P., et al.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  7. Ferrara, A., Montanelli, S., Noessner, J., Stuckenschmidt, H.: Benchmarking matching applications on the semantic web. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6644, pp. 108–122. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21064-8_8

    Chapter  Google Scholar 

  8. Halevy, A., Rajaraman, A., Ordille, J.: Data integration: the teenage years. In: Proceedings of VLD, pp. 9–16 (2006)

    Google Scholar 

  9. Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (2011)

    Google Scholar 

  10. Hildebrandt, K., Panse, F., et al.: Large-scale data pollution with Apache spark. IEEE Trans. Big Data 6(2), 396–411 (2020)

    Article  Google Scholar 

  11. Huang, J., Hu, W., Li, H., Qu, Y.: Automated comparative table generation for facilitating human intervention in multi-entity resolution. In: Proceedings of SIGIR, pp. 585–594 (2018)

    Google Scholar 

  12. Ioannou, E., Rassadko, N., Velegrakis, Y.: On generating benchmark data for entity matching. J. Data Semant. 2(1), 37–56 (2013)

    Article  Google Scholar 

  13. Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. J. Web Semant. 23, 2–15 (2013)

    Article  Google Scholar 

  14. Kasai, J., Qian, K., et al.: Low-resource deep entity resolution with transfer and active learning. In: Proceedings of ACL, pp. 5851–5861 (2019)

    Google Scholar 

  15. Konda, P., et al.: Magellan: toward building entity matching management systems over data science stacks. PVLDB 13, 1581–1584 (2016)

    Google Scholar 

  16. Konyushkova, K., Raphael, S., Fua, P.: Learning active learning from data. In: Proceedings of NIPS, p. 4228–4238 (2017)

    Google Scholar 

  17. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. VLDB Endow. 3(1–2), 484–493 (2010)

    Article  Google Scholar 

  18. Meduri, V., Popa, L., et al.: A comprehensive benchmark framework for active learning methods in entity matching. In: Proceedings of SIGMOD, pp. 1133–1147 (2020)

    Google Scholar 

  19. Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets: a case for active learning. VLDB Endow. 8(2), 125–136 (2014)

    Article  Google Scholar 

  20. Nafa, Y., et al.: Active deep learning on entity resolution by risk sampling. Knowl.-Based Syst. 236, 107729 (2022)

    Article  Google Scholar 

  21. Nentwig, M., Hartung, M., Ngonga Ngomo, A.C., Rahm, E.: A survey of current link discovery frameworks. Semant. Web 8(3), 419–436 (2017)

    Article  Google Scholar 

  22. Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_17

    Chapter  Google Scholar 

  23. Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The four generations of entity resolution. Synthesis Lect. Data Manage. 16(2), 1–170 (2021)

    Article  Google Scholar 

  24. Primpeli, A., Bizer, C.: Profiling entity matching benchmark tasks. In: Proceedings of CIKM, pp. 3101–3108 (2020)

    Google Scholar 

  25. Primpeli, A., Bizer, C.: Graph-boosted active learning for multi-source entity resolution. In: Hotho, A., et al. (eds.) ISWC 2021. LNCS, vol. 12922, pp. 182–199. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88361-4_11

    Chapter  Google Scholar 

  26. Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: Proceedings of CIKM, pp. 1379–1388 (2017)

    Google Scholar 

  27. Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS, vol. 10509, pp. 278–293. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66917-5_19

    Chapter  Google Scholar 

  28. Saveta, T., Daskalaki, E., Flouris, G., Fundulaki, I., Herschel, M., Ngomo, A.-C.N.: LANCE: piercing to the heart of instance matching tools. In: ISWC 2015. LNCS, vol. 9366, pp. 375–391. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_22

    Chapter  Google Scholar 

  29. Settles, B.: Active Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2012)

    Google Scholar 

  30. Shen, W., DeRose, P., Vu, L., et al.: Source-aware entity matching: a compositional approach. In: Proceedings of ICDE, pp. 196–205 (2007)

    Google Scholar 

  31. Sherif, M.A., Dreßler, K., Ngomo, A.C.N.: LIGON-link discovery with noisy oracles. In: Proceedings of Ontology Matching Workshop (ISWC), pp. 48–59 (2020)

    Google Scholar 

  32. Thirumuruganathan, S., Parambath, S.A.P., et al.: Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018)

  33. Ye, Y., Talburt, J.: Generating synthetic data to support entity resolution education and research. J. Comput. Sci. Coll. 34(7), 12–19 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Primpeli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Primpeli, A., Bizer, C. (2022). Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods. In: Groth, P., et al. The Semantic Web. ESWC 2022. Lecture Notes in Computer Science, vol 13261. Springer, Cham. https://doi.org/10.1007/978-3-031-06981-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06981-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06980-2

  • Online ISBN: 978-3-031-06981-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics