Skip to main content

Unsupervised Blocking Key Selection for Real-Time Entity Resolution

  • Conference paper
  • First Online:
Book cover Advances in Knowledge Discovery and Data Mining (PAKDD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9078))

Included in the following conference series:

Abstract

Real-time entity resolution (ER) is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing is a major step in the ER process, aimed at reducing the search space by bringing similar records closer to each other using a blocking key criterion. Selecting these keys is crucial for the effectiveness and efficiency of the real-time ER process. Traditional indexing techniques require domain knowledge for optimal key selection. However, to make the ER process less dependent on human domain knowledge, automatic selection of optimal blocking keys is required. In this paper we propose an unsupervised learning technique that automatically selects optimal blocking keys for building indexes that can be used in real-time ER. We specifically learn multiple keys to be used with multi-pass sorted neighbourhood, one of the most efficient and widely used indexing techniques for ER. We evaluate the proposed approach using three real-world data sets, and compare it with an existing automatic blocking key selection technique. The results show that our approach learns optimal blocking/sorting keys that are suitable for real-time ER. The learnt keys significantly increase the efficiency of query matching while maintaining the quality of matching results.

This research was funded by the Australian Research Council (ARC), Veda, and Funnelback Pty. Ltd., under Linkage Project LP100200079.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aizawa, A., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: WIRI, Tokyo (2005)

    Google Scholar 

  2. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE ICDM, Hong Kong (2006)

    Google Scholar 

  3. Cao, Y., Chen, Z., Zhu, J., Yue, P., Lin, C.Y., Yu, Y.: Leveraging unlabeled data to scale blocking for record linkage. In: IJCAI, Barcelona (2011)

    Google Scholar 

  4. Christen, P.: Data Matching. Springer (2012)

    Google Scholar 

  5. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24(9) (2012)

    Google Scholar 

  6. Das Sarma, A., Jain, A., Machanavajjhala, A., Bohannon, P.: An automatic blocking mechanism for large-scale de-duplication tasks. In: ACM CIKM, Hawaii (2012)

    Google Scholar 

  7. Dong, X.L., Srivastava, D.: Big data integration. In: IEEE ICDE, Brisbane (2013)

    Google Scholar 

  8. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1) (2007)

    Google Scholar 

  9. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328) (1969)

    Google Scholar 

  10. Giang, P.H.: A machine learning approach to create blocking criteria for record linkage. Health Care Management Science (2014)

    Google Scholar 

  11. Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD, San Jose (1995)

    Google Scholar 

  12. Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE ICDM, Dallas (2013)

    Google Scholar 

  13. Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: ICDT, Lausanne, Switzerland (2010)

    Google Scholar 

  14. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. VLDB Endowment 3(1–2) (2010)

    Google Scholar 

  15. Liang, H., Wang, Y., Christen, P., Gayler, R.: Noise-tolerant approximate blocking for dynamic real-time entity resolution. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part II. LNCS (LNAI), vol. 8444, pp. 449–460. Springer, Heidelberg (2014)

    Google Scholar 

  16. Ma, Y., Tran, T.: Typimatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In: ACM WSDM, Rome (2013)

    Google Scholar 

  17. McCallum, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD, Boston (2000)

    Google Scholar 

  18. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006)

    Google Scholar 

  19. Ramadan, B., Christen, P.: Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In: ACM CIKM, Shanghai (2014)

    Google Scholar 

  20. Ramadan, B., Christen, P., Liang, H.: Dynamic sorted neighborhood indexing for real-time entity resolution. In: Wang, H., Sharaf, M.A. (eds.) ADC 2014. LNCS, vol. 8506, pp. 1–12. Springer, Heidelberg (2014)

    Google Scholar 

  21. Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: Li, J., Cao, L., Wang, C., Tan, K.C., Liu, B., Pei, J., Tseng, V.S. (eds.) PAKDD 2013 Workshops. LNCS (LNAI), vol. 7867, pp. 47–58. Springer, Heidelberg (2013)

    Google Scholar 

  22. Tran, K.N., Vatsalan, D., Christen, P.: Geco: an online personal data generator and corruptor. In: ACM CIKM, New York (2013)

    Google Scholar 

  23. Vogel, T., Naumann, F.: Automatic blocking key selection for duplicate detection based on unigram combinations. In: VLDB Workshops, Istanbul (2012)

    Google Scholar 

  24. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: ACM SIGMOD, Providence (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Banda Ramadan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ramadan, B., Christen, P. (2015). Unsupervised Blocking Key Selection for Real-Time Entity Resolution. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18032-8_45

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18031-1

  • Online ISBN: 978-3-319-18032-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics