skip to main content
10.1145/2983323.2983831acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Attribute-based Crowd Entity Resolution

Published:24 October 2016Publication History

ABSTRACT

We study the problem of using the crowd to perform entity resolution (ER) on a set of records. For many types of records, especially those involving images, such a task can be difficult for machines, but relatively easy for humans. Typical crowd-based ER approaches ask workers for pairwise judgments between records, which quickly becomes prohibitively expensive even for moderate numbers of records. In this paper, we reduce the cost of pairwise crowd ER approaches by soliciting the crowd for attribute labels on records, and then asking for pairwise judgments only between records with similar sets of attribute labels. However, due to errors induced by crowd-based attribute labeling, a naive attribute-based approach becomes extremely inaccurate even with few attributes. To combat these errors, we use error mitigation strategies which allow us to control the accuracy of our results while maintaining significant cost reductions. We develop a probabilistic model which allows us to determine the optimal, lowest-cost combination of error mitigation strategies needed to achieve a minimum desired accuracy. We test our approach with actual crowdworkers on a dataset of celebrity images, and find that our results yield crowd ER strategies which achieve high accuracy yet are significantly lower cost than pairwise-only approaches.

References

  1. N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, pages 469--478. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Feng, M. Franklin, D. Kossmann, T. Kraska, S. R. Madden, S. Ramesh, A. Wang, and R. Xin. Crowddb: Query processing with the vldb crowd. 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 61--72. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Getoor and A. Machanavajjhala. Entity resolution: tutorial. VLDB, 2012.Google ScholarGoogle Scholar
  8. A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gruenheid, D. Kossmann, S. Ramesh, and F. Widmer. Technical report nr. 785. 2012.Google ScholarGoogle Scholar
  10. N. Q. V. Hung, N. T. Tam, L. N. Tran, and K. Aberer. An evaluation of aggregation techniques in crowdsourcing. In Web Information Systems Engineering--WISE 2013, pages 1--15. Springer, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. R. Khan. Celebrity Image Dataset for Crowd ER. http://www.stanford.edu/ asifk/attributes/datasets.html, 2016.Google ScholarGoogle Scholar
  12. A. R. Khan and H. Garcia-Molina. Attribute-based crowd entity resolution: Technical report. http://ilpubs.stanford.edu:8090/1136/, March 2016.Google ScholarGoogle Scholar
  13. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a crowdsourcing data analytics system. Proceedings of the VLDB Endowment, 5(10):1040--1051, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proceedings of the VLDB Endowment, 5(1):13--24, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169--178. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Inc., 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. B. Newcombe and J. M. Kennedy. Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM, 5(11):563--566, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. 2011.Google ScholarGoogle Scholar
  19. H. Park, H. Garcia-Molina, R. Pang, N. Polyzotis, A. Parameswaran, and J. Widom. Deco: A system for declarative crowdsourcing. Proceedings of the VLDB Endowment, 5(12):1990--1993, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Rajaraman, J. D. Ullman, J. D. Ullman, and J. D. Ullman. Mining of massive datasets, volume 77. Cambridge University Press Cambridge, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211--252, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Sarasua, E. Simperl, and N. F. Noy. Crowdmap: Crowdsourcing ontology alignment with microtasks. In The Semantic Web--ISWC 2012, pages 525--541. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 614--622. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Verroios and H. Garcia-Molina. Entity resolution with crowd errors. 2015.Google ScholarGoogle ScholarCross RefCross Ref
  25. N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment, 7(12):1071--1082, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11):1483--1494, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 229--240. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment, 6(6):349--360, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 219--232. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Attribute-based Crowd Entity Resolution

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
          October 2016
          2566 pages
          ISBN:9781450340731
          DOI:10.1145/2983323

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 October 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader