skip to main content
10.1145/2611040.2611047acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

When in Doubt Ask the Crowd: Employing Crowdsourcing for Active Learning

Published: 02 June 2014 Publication History

Abstract

Crowdsourcing has become ubiquitous in machine learning as a cost effective method to gather training labels. In this paper we examine the challenges that appear when employing crowdsourcing for active learning, in an integrated environment where an automatic method and human labelers work together towards improving their performance at a certain task. By using Active Learning techniques on crowd-labeled data, we optimize the performance of the automatic method towards better accuracy, while keeping the costs low by gathering data on demand. In order to verify our proposed methods, we apply them to the task of deduplication of publications in a digital library by examining metadata. We investigate the problems created by noisy labels produced by the crowd and explore methods to aggregate them. We analyze how different automatic methods are affected by the quantity and quality of the allocated resources as well as the instance selection strategies for each active learning round, aiming towards attaining a balance between cost and performance.

References

[1]
V. Ambati, S. Hewavitharana, S. Vogel, and J. Carbonell. Active learning with multiple annotations for comparable data classification task. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. Association for Computational Linguistics, 2011.
[2]
V. Ambati, S. Vogel, and J. G. Carbonell. Active learning and crowd-sourcing for machine translation. In ILREC, volume 11. Citeseer, 2010.
[3]
J. Attenberg and F. Provost. Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '10. ACM, 2010.
[4]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1), Jan. 2009.
[5]
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive Name Matching in Information Integration. IEEE Intelligent Systems, 18(5), 2003.
[6]
E. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, and J. Kittler. Active learning in social context for image classification. In 9th International Conference on Computer Vision Theory and Applications, VISAPP, 2014.
[7]
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn., 15(2), 1994.
[8]
A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, 28(1), 1979.
[9]
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12. ACM, 2012.
[10]
A. Doan, Y. Lu, Y. Lee, and J. Han. Object Matching for Information Integration: A Profiler-Based Approach. In IIWeb, 2003.
[11]
P. Donmez and J. G. Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008.
[12]
M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12. ACM, 2012.
[13]
A. Y. Halevy, X. Dong, J. Madhavan, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05. ACM Press, 2005.
[14]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 2009.
[15]
M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov., 2(1), 1998.
[16]
E. Ioannou, C. Niederée, and W. Nejdl. Probabilistic Entity Linkage for Heterogeneous Information Spaces. In CAiSE, 2008.
[17]
P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10. ACM, 2010.
[18]
F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.
[19]
M. Lease. On quality control and machine learning in crowdsourcing. In Human Computation, 2011.
[20]
Z. Miklós, N. Bonvin, P. Bouquet, M. Catasta, D. Cordioli, P. Fankhauser, J. Gaugaz, E. Ioannou, H. Koshutanski, A. Maña, C. Niederée, T. Palpanas, and H. Stoermer. From Web Data to Entities and Back. CAiSE, 2010.
[21]
A. Morris, Y. Velegrakis, and P. Bouquet. Entity Identification on the Semantic Web. In SWAP, 2008.
[22]
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11, 2010.
[23]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.
[24]
B. Settles. Active learning literature survey. University of Wisconsin, Madison, 2010.
[25]
V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '08. ACM, 2008.
[26]
A. Sheshadri and M. Lease. Square: A benchmark for research on computing crowd consensus. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.
[27]
R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.
[28]
O. Tamuz, C. Liu, S. Belongie, O. Shamir, A. Kalai, and A. Kalai. Adaptively learning the crowd kernel. In ICML, 2011.
[29]
S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In Computer Vision and Pattern Recognition, CVPR. IEEE, 2011.
[30]
L. von Ahn. Human computation. In CIVR, 2009.
[31]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11), 2012.
[32]
P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems, 2010.
[33]
J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, 2009.
[34]
Y. Yan, G. M. Fung, R. Rosales, and J. G. Dy. Active learning from crowds. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011.
[35]
H. Yang, A. Mityagin, K. M. Svore, and S. Markov. Collecting high quality overlapping labels at low cost. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010.
[36]
M.-C. Yuen, L.-J. Chen, I. King, and I. King. A survey of human computation systems. In CSE (4), 2009.
[37]
L. Zhao, G. Sukthankar, and R. Sukthankar. Incremental relabeling for active learning with noisy crowdsourced annotations. In Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom). IEEE, 2011.

Cited By

View all
  • (2019)What You Sow, So Shall You Reap! Toward Preselection Mechanisms for Macrotask CrowdsourcingMacrotask Crowdsourcing10.1007/978-3-030-12334-5_6(163-188)Online publication date: 7-Aug-2019
  • (2016)Crowdlearning: A framework for collaborative and personalized learning2016 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2016.7757355(1-9)Online publication date: Oct-2016
  • (2014)Profiling Flood Risk through Crowdsourced Flood Level Reports2014 International Conference on IT Convergence and Security (ICITCS)10.1109/ICITCS.2014.7021800(1-4)Online publication date: Oct-2014

Index Terms

  1. When in Doubt Ask the Crowd: Employing Crowdsourcing for Active Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)
      June 2014
      506 pages
      ISBN:9781450325387
      DOI:10.1145/2611040
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • Aristotle University of Thessaloniki

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 June 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Active Learning
      2. Crowdsourcing
      3. Human Computation
      4. Machine Learning

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      WIMS '14

      Acceptance Rates

      WIMS '14 Paper Acceptance Rate 41 of 90 submissions, 46%;
      Overall Acceptance Rate 140 of 278 submissions, 50%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)7
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)What You Sow, So Shall You Reap! Toward Preselection Mechanisms for Macrotask CrowdsourcingMacrotask Crowdsourcing10.1007/978-3-030-12334-5_6(163-188)Online publication date: 7-Aug-2019
      • (2016)Crowdlearning: A framework for collaborative and personalized learning2016 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2016.7757355(1-9)Online publication date: Oct-2016
      • (2014)Profiling Flood Risk through Crowdsourced Flood Level Reports2014 International Conference on IT Convergence and Security (ICITCS)10.1109/ICITCS.2014.7021800(1-4)Online publication date: Oct-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media