research-article

Attribute-based Crowd Entity Resolution

Authors:
Asif R. Khan

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Hector Garcia-Molina

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementOctober 2016Pages 549–558https://doi.org/10.1145/2983323.2983831

Published:24 October 2016Publication History

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 549–558

ABSTRACT

We study the problem of using the crowd to perform entity resolution (ER) on a set of records. For many types of records, especially those involving images, such a task can be difficult for machines, but relatively easy for humans. Typical crowd-based ER approaches ask workers for pairwise judgments between records, which quickly becomes prohibitively expensive even for moderate numbers of records. In this paper, we reduce the cost of pairwise crowd ER approaches by soliciting the crowd for attribute labels on records, and then asking for pairwise judgments only between records with similar sets of attribute labels. However, due to errors induced by crowd-based attribute labeling, a naive attribute-based approach becomes extremely inaccurate even with few attributes. To combat these errors, we use error mitigation strategies which allow us to control the accuracy of our results while maintaining significant cost reductions. We develop a probabilistic model which allows us to determine the optimal, lowest-cost combination of error mitigation strategies needed to achieve a minimum desired accuracy. We test our approach with actual crowdworkers on a dataset of celebrity images, and find that our results yield crowd ER strategies which achieve high accuracy yet are significantly lower cost than pairwise-only approaches.

References

N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarDigital Library
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, pages 469--478. ACM, 2012. Google ScholarDigital Library
A. Feng, M. Franklin, D. Kossmann, T. Kraska, S. R. Madden, S. Ramesh, A. Wang, and R. Xin. Crowddb: Query processing with the vldb crowd. 2011.Google ScholarDigital Library
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 61--72. ACM, 2011. Google ScholarDigital Library
L. Getoor and A. Machanavajjhala. Entity resolution: tutorial. VLDB, 2012.Google Scholar
A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518--529, 1999. Google ScholarDigital Library
A. Gruenheid, D. Kossmann, S. Ramesh, and F. Widmer. Technical report nr. 785. 2012.Google Scholar
N. Q. V. Hung, N. T. Tam, L. N. Tran, and K. Aberer. An evaluation of aggregation techniques in crowdsourcing. In Web Information Systems Engineering--WISE 2013, pages 1--15. Springer, 2013.Google ScholarCross Ref
A. R. Khan. Celebrity Image Dataset for Crowd ER. http://www.stanford.edu/ asifk/attributes/datasets.html, 2016.Google Scholar
A. R. Khan and H. Garcia-Molina. Attribute-based crowd entity resolution: Technical report. http://ilpubs.stanford.edu:8090/1136/, March 2016.Google Scholar
X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a crowdsourcing data analytics system. Proceedings of the VLDB Endowment, 5(10):1040--1051, 2012. Google ScholarDigital Library
A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proceedings of the VLDB Endowment, 5(1):13--24, 2011. Google ScholarDigital Library
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169--178. ACM, 2000. Google ScholarDigital Library
H. B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Inc., 1988. Google ScholarDigital Library
H. B. Newcombe and J. M. Kennedy. Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM, 5(11):563--566, 1962. Google ScholarDigital Library
A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. 2011.Google Scholar
H. Park, H. Garcia-Molina, R. Pang, N. Polyzotis, A. Parameswaran, and J. Widom. Deco: A system for declarative crowdsourcing. Proceedings of the VLDB Endowment, 5(12):1990--1993, 2012. Google ScholarDigital Library
A. Rajaraman, J. D. Ullman, J. D. Ullman, and J. D. Ullman. Mining of massive datasets, volume 77. Cambridge University Press Cambridge, 2012. Google ScholarDigital Library
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211--252, 2015. Google ScholarDigital Library
C. Sarasua, E. Simperl, and N. F. Noy. Crowdmap: Crowdsourcing ontology alignment with microtasks. In The Semantic Web--ISWC 2012, pages 525--541. Springer, 2012. Google ScholarDigital Library
V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 614--622. ACM, 2008. Google ScholarDigital Library
V. Verroios and H. Garcia-Molina. Entity resolution with crowd errors. 2015.Google ScholarCross Ref
N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment, 7(12):1071--1082, 2014. Google ScholarDigital Library
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11):1483--1494, 2012. Google ScholarDigital Library
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 229--240. ACM, 2013. Google ScholarDigital Library
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment, 6(6):349--360, 2013. Google ScholarDigital Library
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 219--232. ACM, 2009. Google ScholarDigital Library

Index Terms

Attribute-based Crowd Entity Resolution
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution
      2. Extraction, transformation and loading
  2. World Wide Web
    1. Web applications
      1. Crowdsourcing
        Answer ranking

Recommendations

Robust Entity Resolution using Random Graphs
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated ...
Read More
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Read More
A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education
Special Issue on Entity Resolution

This article discusses the topics, approaches, and lessons learned in teaching a graduate-level course covering entity resolution (ER) and its relationship to information quality (IQ). The course surveys a broad spectrum of ER topics and activities ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowd
crowd computation
crowdsourcing
deduplication
entity resolution
record linkage
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 256
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Attribute-based Crowd Entity Resolution

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Robust Entity Resolution using Random Graphs

Collective entity resolution in relational data

A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education