skip to main content
10.1145/2396761.2398719acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Fast and accurate incremental entity resolution relative to an entity knowledge base

Published: 29 October 2012 Publication History

Abstract

User facing topical web applications such as events or shopping sites rely on large collections of data records about real world entities that are updated at varying latencies ranging from days to seconds. For example, event venue details are changed relatively infrequently whereas ticket pricing and availability for an event is often updated in near-realtime. Users regard these sites as high quality if they seldom show duplicates, the URLs are stable, and their content is fresh, so it is important to resolve duplicate entity records with high quality and low latencies. High quality entity resolution typically evaluates the entire record corpus for similar record clusters at the cost of latency, while low latency resolution examines the least possible entities to keep time to a minimum, even at the cost of quality. In this paper we show how to keep low latency while achieving high quality, combining the best of both approaches: given an entity to be resolved, our incremental Fastpath system, in a matter of milliseconds, makes approximately the same decisions that the underlying batch system would have made. Our experiments show that the Fastpath system makes matching decisions for previously unseen entities with 90% precision and 98% recall relative to batch decisions, with latencies under 20ms on commodity hardware.

References

[1]
A. Arasu, C. Ré, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, pages 952--963, 2009.
[2]
K. Bellare, C. Curino, A. Machanavajihala, P. Mika, M. Rahurkar, A. Sane, R. Aronson, P. Bohannon, L. Chitnis, C. Drome, Z. Gu, B. Kannan, V. Rastogi, N. Torzec, and M. Welch. Woo: A scalable and multi-tenant platform for continuous knowledge base synthesis, 2012.
[3]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 18(1):255--276, 2009.
[4]
P. Christen and R. Gayler. Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In AusDM, volume 87 of CRPIT, pages 51--60, 2008.
[5]
A. Cuzzocrea and P. L. Puglisi. Record linkage in data warehousing: State-of-the-art analysis and research perspectives. In DEXA Workshops, pages 121--125, 2011.
[6]
J. de Freitas, G. L. Pappa, A. S. da Silva, M. A. Gonçalves, E. S. de Moura, A. Veloso, A. H. F. Laender, and M. G. de Carvalho. Active learning genetic programming for record deduplication. In IEEE Congress on Evolutionary Computation, pages 1--8, 2010.
[7]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19(1):1--16, Jan. 2007.
[8]
L. O. Evangelista, E. Cortez, A. S. da Silva, and W. Meira. Adaptive and flexible blocking for record linkage tasks. JIDM, 1(2):167--182, 2010.
[9]
R. Isele, A. Jentzsch, and C. Bizer. Efficient multidimensional blocking for link discovery without losing recall. In WebDB, 2011.
[10]
S. Kataria, K. S. Kumar, R. Rastogi, P. Sen, and S. H. Sengamedu. Entity disambiguation with hierarchical topic models. In KDD, pages 1037--1045, 2011.
[11]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, pages 169--178, 2000.
[12]
H. B. Newcombe, J. M. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 130(3381):954--959, October 1959.
[13]
D. Rao, P. McNamee, and M. Dredze. Streaming cross document entity coreference resolution. In COLING (Posters), pages 1050--1058, 2010.
[14]
V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. Proc. VLDB Endow., 4(4):208--218, Jan. 2011.
[15]
M. Welch, A. Sane, and C. Drome. High Quality Real-time Incremental Entity Resolution in a Knowledge Base, 2012.
[16]
S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolution. IEEE Trans. Knowledge and Data Engineering, 2012.
[17]
W. E. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U. S. Census Bureau, 2006.
[18]
L. Ye, X. Wang, D. Yankov, and E. J. Keogh. The asymmetric approximate anytime join: A new primitive with applications to data mining. In SDM, 2008.

Cited By

View all
  • (2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
  • (2021)Deep Entity MatchingJournal of Data and Information Quality10.1145/343181613:1(1-17)Online publication date: 6-Jan-2021
  • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
  • Show More Cited By

Index Terms

  1. Fast and accurate incremental entity resolution relative to an entity knowledge base

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
    October 2012
    2840 pages
    ISBN:9781450311564
    DOI:10.1145/2396761
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deduplication
    2. entity resolution
    3. knowledge base

    Qualifiers

    • Poster

    Conference

    CIKM'12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
    • (2021)Deep Entity MatchingJournal of Data and Information Quality10.1145/343181613:1(1-17)Online publication date: 6-Jan-2021
    • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
    • (2020)Incremental Multi-source Entity Resolution for Knowledge Graph CompletionThe Semantic Web10.1007/978-3-030-49461-2_23(393-408)Online publication date: 27-May-2020
    • (2018)Incremental Clustering on Linked Data2018 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2018.00084(531-538)Online publication date: Nov-2018
    • (2015)Entity Resolution in the Web of DataSynthesis Lectures on the Semantic Web: Theory and Technology10.2200/S00655ED1V01Y201507WBE0135:3(1-122)Online publication date: 7-Aug-2015
    • (2015)On-the-fly entity resolution from distributed social media sources for mobile search and explorationProceedings of the 14th International Conference on Mobile and Ubiquitous Multimedia10.1145/2836041.2836043(14-24)Online publication date: 30-Nov-2015
    • (2015)Query-time record linkage and fusion over Web databases2015 IEEE 31st International Conference on Data Engineering10.1109/ICDE.2015.7113271(42-53)Online publication date: Apr-2015
    • (2013)WOOProceedings of the VLDB Endowment10.14778/2536222.25362366:11(1114-1125)Online publication date: 1-Aug-2013

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media