Skip to main content

Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER)

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11695))

Included in the following conference series:

Abstract

Entity resolution identifies records that refer to the same real-world entity. For its classification step, supervised learning can be adopted, but this faces limitations in the availability of labeled training data. Under this situation, active learning has been proposed to gather labels while reducing the human labeling effort, by selecting the most informative data as candidates for labeling. Committee-based active learning is one of the most commonly used approaches, which chooses data with the most disagreement of voting results of the committee, considering this as the most informative data. However, the current state-of-the-art committee-based active learning approaches for entity resolution have two main drawbacks: First, the selected initial training data is usually not balanced and informative enough. Second, the committee is formed with homogeneous classifiers by comprising their accuracy to achieve diversity of the committee, i.e., the classifiers are not trained with all available training data or the best parameter setting. In this paper, we propose our committee-based active learning approach HeALER, which overcomes both drawbacks by using more effective initial training data selection approaches and a more effective heterogenous committee. We implemented HeALER and compared it with passive learning and other state-of-the-art approaches. The experiment results prove that our approach outperforms other state-of-the-art committee-based active learning approaches.

This work was partially funded by the DFG [grant no.: SA 465/50-1], China Scholarship Council [No. 201408080093] and Graduiertenförderung des Landes Sachsen-Anhalt.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Implemented by the Debatty library (version 1.1.0).

References

  1. Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD, pp. 783–794 (2010)

    Google Scholar 

  2. Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: SIGKDD, pp. 1131–1139 (2012)

    Google Scholar 

  3. Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching with guarantees. In: TKDD, pp. 12:1–12:24 (2013)

    Article  Google Scholar 

  4. Chen, X., Durand, G.C., Zoun, R., Broneske, D., Li, Y., Saake, G.: The best of both worlds: combining hand-tuned and word-embedding-based similarity measures for entity resolution. In: BTW (2019)

    Google Scholar 

  5. Chen, X., Schallehn, E., Saake, G.: Cloud-scale entity resolution: current state and open challenges. In: OJBD, pp. 30–51 (2018)

    Google Scholar 

  6. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media, Heidelberg (2012)

    Book  Google Scholar 

  7. de Freitas, J., Pappa, G.L., da Silva, A.S., et al.: Active learning genetic programming for record deduplication. In: CEC, pp. 1–8 (2010)

    Google Scholar 

  8. Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017)

  9. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. In: IEEE TKDE, pp. 1–16 (2007)

    Article  Google Scholar 

  10. Fisher, J., Christen, P., Wang, Q.: Active learning based entity resolution using Markov logic. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J.Z., Wang, R. (eds.) PAKDD 2016. LNCS (LNAI), vol. 9652, pp. 338–349. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31750-2_27

    Chapter  Google Scholar 

  11. Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. J. Web Semant. 23, 2–15 (2013)

    Article  Google Scholar 

  12. Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int’l. Trans. Comp. Sci. Eng. 30(1), 25–36 (2006)

    Google Scholar 

  13. Leipzig, D.G.: Benchmark datasets for entity resolution (2017). Accessed 27 Nov 2017

    Google Scholar 

  14. Lu, Z., Wu, X., Bongard, J.: Active learning with adaptive heterogeneous ensembles. In: ICDM, pp. 327–336 (2009)

    Google Scholar 

  15. Mamitsuka, N.A.H., et al.: Query learning strategies using boosting and bagging. In: ICML (1998)

    Google Scholar 

  16. Melville, P., Mooney, R.J.: Diverse ensembles for active learning. In: ICML (2004)

    Google Scholar 

  17. Nanopoulos, A., Manolopoulos, Y., Theodoridis, Y.: An efficient and effective algorithm for density biased sampling. In: CIKM, pp. 398–404. ACM (2002)

    Google Scholar 

  18. Ngomo, A.N., Lehmann, J., Auer, S., Höffner, K.: RAVEN - active learning of link specifications. In: Proceedings of the International, Workshop on Ontology Matching (2011)

    Google Scholar 

  19. Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_17

    Chapter  Google Scholar 

  20. Ngomo, A.-C.N., Lyko, K., Christen, V.: COALA – correlation-aware active learning of link specifications. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 442–456. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38288-8_30

    Chapter  Google Scholar 

  21. Nguyen, H.T., Smeulders, A.: Active learning using pre-clustering. In: ICML, p. 79 (2004)

    Google Scholar 

  22. Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: CIKM, pp. 1379–1388 (2017)

    Google Scholar 

  23. Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes ext classifiers. In: ICML, pp. 616–623 (2003)

    Google Scholar 

  24. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD, pp. 269–278 (2002)

    Google Scholar 

  25. Seung, M.O., Sebastian, H., Sompolinsky, H.: Query by committee. In: Proceedings of the Workshop on Computational Learning Theory (1992)

    Google Scholar 

  26. Spark. Spark.mllib documentation. https://spark.apache.org/docs/latest/mllib-ensembles.html. Accessed 29 Nov 2018

  27. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26, 607–633 (2001)

    Article  Google Scholar 

  28. Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 562–573. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_44

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G. (2019). Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds) Advances in Databases and Information Systems. ADBIS 2019. Lecture Notes in Computer Science(), vol 11695. Springer, Cham. https://doi.org/10.1007/978-3-030-28730-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-28730-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-28729-0

  • Online ISBN: 978-3-030-28730-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics