Skip to main content

Unsupervised Blocking of Imbalanced Datasets for Record Matching

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10042))

Abstract

Record matching in data engineering refers to searching for data records originating from same entities across different data sources. The solutions for record matching usually employ learning algorithms to train a classifier that labels record pairs as either matches or non-matches. In practice, the amount of non-matches typically far exceeds the amount of matches. This problem is so-called imbalance problem, which notoriously increases the difficulty of acquiring a representative dataset for classifier training. Various blocking techniques have been proposed to alleviate this problem, but most of them rely heavily on the effort of human experts. In this paper, we propose an unsupervised blocking method, which aims at automatic blocking. To demonstrate the effectiveness, we evaluated our method using real-world datasets. The results show that our method significantly outperforms other competitors.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    R-tree is a kind of tree data structures for indexing multi-dimensional information [10].

References

  1. http://archive.ics.uci.edu/ml/

  2. http://dbs.uni-leipzig.de/en/research/projects/object_matching

  3. Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: Proceedings of ACM SIGMOD International Conference on Management of data, pp. 783–794. ACM (2010)

    Google Scholar 

  4. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: 6th International Conference on Data Mining, ICDM 2006, pp. 87–96. IEEE (2006)

    Google Scholar 

  5. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)

    Google Scholar 

  6. Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: Proceedings of 33rd International Conference on Very Large Data Bases, pp. 327–338. VLDB Endowment (2007)

    Google Scholar 

  7. Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. (TOIS) 18(3), 288–321 (2000)

    Article  Google Scholar 

  8. Dalvi, N.N., Rastogi, V., Dasgupta, A., Sarma, A.D., Sarlós, T.: Optimal hashing schemes for entity matching. In: WWW, pp. 295–306 (2013)

    Google Scholar 

  9. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  10. Guttman, A.: R-trees: a dynamic index structure for spatial searching. ACM SIGMOD Rec. 14, 47–57 (1984). ACM

    Article  Google Scholar 

  11. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  12. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178. ACM (2000)

    Google Scholar 

  13. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: Proceedings of National Conference on Artificial Intelligence, vol. 21, p. 440. AAAI Press, MIT Press, Menlo Park, London (2006) (1999)

    Google Scholar 

  14. Newcombe, H.B.: Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford University Press Inc., Oxford (1988)

    Google Scholar 

  15. Shu, L., Chen, A., Xiong, M., Meng, W.: Efficient spectral neighborhood blocking for entity resolution. In: IEEE 27th International Conference on Data Engineering (ICDE), pp. 1067–1078. IEEE (2011)

    Google Scholar 

  16. Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 350–359. ACM (2002)

    Google Scholar 

  17. Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2014)

    Article  Google Scholar 

  18. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD Conference, pp. 219–232 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenxiao Dou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Dou, C., Sun, D., Wong, R.K. (2016). Unsupervised Blocking of Imbalanced Datasets for Record Matching. In: Cellary, W., Mokbel, M., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2016. WISE 2016. Lecture Notes in Computer Science(), vol 10042. Springer, Cham. https://doi.org/10.1007/978-3-319-48743-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48743-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48742-7

  • Online ISBN: 978-3-319-48743-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics