skip to main content
10.1145/2588555.2588576acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Corleone: hands-off crowdsourcing for entity matching

Published:18 June 2014Publication History

ABSTRACT

Recent approaches to crowdsourcing entity matching (EM) are limited in that they crowdsource only parts of the EM workflow, requiring a developer to execute the remaining parts. Consequently, these approaches do not scale to the growing EM need at enterprises and crowdsourcing startups, and cannot handle scenarios where ordinary users (i.e., the masses) want to leverage crowdsourcing to match entities. In response, we propose the notion of hands-off crowdsourcing (HOC)}, which crowdsources the entire workflow of a task, thus requiring no developers. We show how HOC can represent a next logical direction for crowdsourcing research, scale up EM at enterprises and crowdsourcing startups, and open up crowdsourcing for the masses. We describe Corleone, a HOC solution for EM, which uses the crowd in all major steps of the EM process. Finally, we discuss the implications of our work to executing crowdsourced RDBMS joins, cleaning learning models, and soliciting complex information types from crowd workers.

References

  1. Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In SIGKDD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Das Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Doan, A. Halevy, and Z. Ives. Principles of Data Integration. Elsevier Science, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, Apr. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. CrowdDB: Answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. Technical report, UW-Madison, 2014. http://pages.cs.wisc.edu/ cgokhale/corleone-tr.pdf.Google ScholarGoogle Scholar
  11. S. Guo, A. Parameswaran, and H. Garcia-Molina. So who won?: Dynamic max discovery with the crowd. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Hanrahan. Analytic DB technology for the data enthusiast. SIGMOD Keynote, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon mechanical turk. In HCOMP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Katariya, A. Iyer, and S. Sarawagi. Active evaluation of classifiers on large datasets. In ICDM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197--210, Feb. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1--2):484--493, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, 2010.Google ScholarGoogle Scholar
  18. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5:13--24, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, 2011.Google ScholarGoogle Scholar
  21. R. McCann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, abs/1209.3686, 2012.Google ScholarGoogle Scholar
  23. H. Park, H. Garcia-Molina, R. Pang, N. Polyzotis, A. Parameswaran, and J. Widom. Deco: A system for declarative crowdsourcing. PVLDB, 5(12):1990--1993, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Sawade, N. Landwehr, and T. Scheffer. Active estimation of f-measures. In NIPS, 2010.Google ScholarGoogle Scholar
  26. B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The Data Tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  28. B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Wais, S. Lingamneni, D. Cook, J. Fennell, B. Goldenberg, D. Lubarov, D. Marin, and H. Simons. Towards large-scale processing of simple tasks with mechanical turk. In HCOMP, 2011.Google ScholarGoogle Scholar
  30. J. Wang, T. Kraska, M. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5(11):1483--1494, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. E. Whang, J. McAuley, and H. Garcia-Molina. Compare me maybe: Crowd entity resolution interfaces. Technical report, Stanford University.Google ScholarGoogle Scholar
  35. T. Yan, V. Kumar, and D. Ganesan. CrowdSearch: Exploiting crowds for accurate real-time image search on mobile phones. In MobiSys, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Corleone: hands-off crowdsourcing for entity matching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
      June 2014
      1645 pages
      ISBN:9781450323765
      DOI:10.1145/2588555

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader