research-article

Corleone: hands-off crowdsourcing for entity matching

Authors:
Chaitanya Gokhale

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

,
Sanjib Das

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

,
AnHai Doan

University of Wisconsin-Madison & @WalmartLabs, Madison, WI, USA

University of Wisconsin-Madison & @WalmartLabs, Madison, WI, USA
View Profile

,
Jeffrey F. Naughton

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

,
Narasimhan Rampalli

@WalmartLabs, San Bruno, CA, USA

@WalmartLabs, San Bruno, CA, USA
View Profile

,
Jude Shavlik

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

,
Xiaojin Zhu

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 601–612https://doi.org/10.1145/2588555.2588576

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 601–612

ABSTRACT

Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot handle scenarios where ordinary users (i.e., the masses) want to leverage crowdsourcing to match entities. In response, we propose the notion of hands-off crowdsourcing (HOC)}, which crowdsources the entire workflow of a task, thus requiring no developers. We show how HOC can represent a next logical direction for crowdsourcing research, scale up EM at enterprises and crowdsourcing startups, and open up crowdsourcing for the masses. We describe Corleone, a HOC solution for EM, which uses the crowd in all major steps of the EM process. Finally, we discuss the implications of our work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers.

References

Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, 2013. Google ScholarDigital Library
A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010. Google ScholarDigital Library
K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In SIGKDD, 2012. Google ScholarDigital Library
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
A. Das Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, 2012. Google ScholarDigital Library
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, 2012. Google ScholarDigital Library
A. Doan, A. Halevy, and Z. Ives. Principles of Data Integration. Elsevier Science, 2012. Google ScholarDigital Library
A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, Apr. 2011. Google ScholarDigital Library
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. CrowdDB: Answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarDigital Library
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. Technical report, UW-Madison, 2014. http://pages.cs.wisc.edu/ cgokhale/corleone-tr.pdf.Google Scholar
S. Guo, A. Parameswaran, and H. Garcia-Molina. So who won?: Dynamic max discovery with the crowd. In SIGMOD, 2012. Google ScholarDigital Library
P. Hanrahan. Analytic DB technology for the data enthusiast. SIGMOD Keynote, 2012. Google ScholarDigital Library
P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon mechanical turk. In HCOMP, 2010. Google ScholarDigital Library
N. Katariya, A. Iyer, and S. Sarawagi. Active evaluation of classifiers on large datasets. In ICDM, 2012. Google ScholarDigital Library
H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197--210, Feb. 2010. Google ScholarDigital Library
H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1--2):484--493, 2010. Google ScholarDigital Library
J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, 2010.Google Scholar
X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012. Google ScholarDigital Library
A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5:13--24, 2011. Google ScholarDigital Library
A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, 2011.Google Scholar
R. McCann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, 2008. Google ScholarDigital Library
B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, abs/1209.3686, 2012.Google Scholar
H. Park, H. Garcia-Molina, R. Pang, N. Polyzotis, A. Parameswaran, and J. Widom. Deco: A system for declarative crowdsourcing. PVLDB, 5(12):1990--1993, 2012. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
C. Sawade, N. Landwehr, and T. Scheffer. Active estimation of f-measures. In NIPS, 2010.Google Scholar
B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.Google ScholarCross Ref
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The Data Tamer system. In CIDR, 2013.Google Scholar
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarDigital Library
P. Wais, S. Lingamneni, D. Cook, J. Fennell, B. Goldenberg, D. Lubarov, D. Marin, and H. Simons. Towards large-scale processing of simple tasks with mechanical turk. In HCOMP, 2011.Google Scholar
J. Wang, T. Kraska, M. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012. Google ScholarDigital Library
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013. Google ScholarDigital Library
L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2010. Google ScholarDigital Library
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
S. E. Whang, J. McAuley, and H. Garcia-Molina. Compare me maybe: Crowd entity resolution interfaces. Technical report, Stanford University.Google Scholar
T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: Exploiting crowds for accurate real-time image search on mobile phones. In MobiSys, 2010. Google ScholarDigital Library

Index Terms

Corleone: hands-off crowdsourcing for entity matching
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Many works have applied crowdsourcing to entity matching (EM). While promising, these approaches are limited in that they often require a developer to be in the loop. As such, it is difficult for an organization to deploy multiple crowdsourced EM ...
Read More
Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments

The ubiquity of the Internet and the widespread proliferation of electronic devices has resulted in flourishing microtask crowdsourcing marketplaces, such as Amazon MTurk. An aspect that has remained largely invisible in microtask crowdsourcing is that ...
Read More
A Community Rather Than A Union: Understanding Self-Organization Phenomenon on MTurk and How It Impacts Turkers and Requesters
CHI EA '17: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems

This paper aims to understand the self-organization phenomenon among the workers of Amazon Mechanical Turk (MTurk), a well-known crowdsourcing platform. Specifically, we explored 1) why MTurk workers self-organize into online communities (Turker ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
crowdsourcing
entity matching
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 162
  Total Citations
  View Citations
- 848
  Total Downloads
- Downloads (Last 12 months)54
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Corleone: hands-off crowdsourcing for entity matching

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services

Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments

A Community Rather Than A Union: Understanding Self-Organization Phenomenon on MTurk and How It Impacts Turkers and Requesters