research-article

When in Doubt Ask the Crowd: Employing Crowdsourcing for Active Learning

Authors:

Mihai Georgescu,

Claudiu S. Firan,

Ujwal Gadiraju,

Wolfgang NejdlAuthors Info & Claims

WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

Article No.: 12, Pages 1 - 12

https://doi.org/10.1145/2611040.2611047

Published: 02 June 2014 Publication History

Abstract

Crowdsourcing has become ubiquitous in machine learning as a cost effective method to gather training labels. In this paper we examine the challenges that appear when employing crowdsourcing for active learning, in an integrated environment where an automatic method and human labelers work together towards improving their performance at a certain task. By using Active Learning techniques on crowd-labeled data, we optimize the performance of the automatic method towards better accuracy, while keeping the costs low by gathering data on demand. In order to verify our proposed methods, we apply them to the task of deduplication of publications in a digital library by examining metadata. We investigate the problems created by noisy labels produced by the crowd and explore methods to aggregate them. We analyze how different automatic methods are affected by the quantity and quality of the allocated resources as well as the instance selection strategies for each active learning round, aiming towards attaining a balance between cost and performance.

References

[1]

V. Ambati, S. Hewavitharana, S. Vogel, and J. Carbonell. Active learning with multiple annotations for comparable data classification task. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. Association for Computational Linguistics, 2011.

Digital Library

[2]

V. Ambati, S. Vogel, and J. G. Carbonell. Active learning and crowd-sourcing for machine translation. In ILREC, volume 11. Citeseer, 2010.

[3]

J. Attenberg and F. Provost. Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '10. ACM, 2010.

Digital Library

[4]

O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1), Jan. 2009.

Digital Library

[5]

M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive Name Matching in Information Integration. IEEE Intelligent Systems, 18(5), 2003.

Digital Library

[6]

E. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, and J. Kittler. Active learning in social context for image classification. In 9th International Conference on Computer Vision Theory and Applications, VISAPP, 2014.

[7]

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn., 15(2), 1994.

Digital Library

[8]

A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, 28(1), 1979.

[9]

G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12. ACM, 2012.

Digital Library

[10]

A. Doan, Y. Lu, Y. Lee, and J. Han. Object Matching for Information Integration: A Profiler-Based Approach. In IIWeb, 2003.

[11]

P. Donmez and J. G. Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008.

Digital Library

[12]

M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12. ACM, 2012.

Digital Library

[13]

A. Y. Halevy, X. Dong, J. Madhavan, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05. ACM Press, 2005.

Digital Library

[14]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 2009.

Digital Library

[15]

M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov., 2(1), 1998.

Digital Library

[16]

E. Ioannou, C. Niederée, and W. Nejdl. Probabilistic Entity Linkage for Heterogeneous Information Spaces. In CAiSE, 2008.

Digital Library

[17]

P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10. ACM, 2010.

Digital Library

[18]

F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.

Digital Library

[19]

M. Lease. On quality control and machine learning in crowdsourcing. In Human Computation, 2011.

[20]

Z. Miklós, N. Bonvin, P. Bouquet, M. Catasta, D. Cordioli, P. Fankhauser, J. Gaugaz, E. Ioannou, H. Koshutanski, A. Maña, C. Niederée, T. Palpanas, and H. Stoermer. From Web Data to Entities and Back. CAiSE, 2010.

[21]

A. Morris, Y. Velegrakis, and P. Bouquet. Entity Identification on the Semantic Web. In SWAP, 2008.

[22]

V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11, 2010.

Digital Library

[23]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.

Digital Library

[24]

B. Settles. Active learning literature survey. University of Wisconsin, Madison, 2010.

Digital Library

[25]

V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '08. ACM, 2008.

Digital Library

[26]

A. Sheshadri and M. Lease. Square: A benchmark for research on computing crowd consensus. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.

[27]

R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008.

Digital Library

[28]

O. Tamuz, C. Liu, S. Belongie, O. Shamir, A. Kalai, and A. Kalai. Adaptively learning the crowd kernel. In ICML, 2011.

Digital Library

[29]

S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In Computer Vision and Pattern Recognition, CVPR. IEEE, 2011.

Digital Library

[30]

L. von Ahn. Human computation. In CIVR, 2009.

Digital Library

[31]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11), 2012.

Digital Library

[32]

P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems, 2010.

Digital Library

[33]

J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, 2009.

[34]

Y. Yan, G. M. Fung, R. Rosales, and J. G. Dy. Active learning from crowds. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011.

[35]

H. Yang, A. Mityagin, K. M. Svore, and S. Markov. Collecting high quality overlapping labels at low cost. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010.

Digital Library

[36]

M.-C. Yuen, L.-J. Chen, I. King, and I. King. A survey of human computation systems. In CSE (4), 2009.

Digital Library

[37]

L. Zhao, G. Sukthankar, and R. Sukthankar. Incremental relabeling for active learning with noisy crowdsourced annotations. In Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom). IEEE, 2011.

Cited By

Gadiraju UZhuang M(2019)What You Sow, So Shall You Reap! Toward Preselection Mechanisms for Macrotask CrowdsourcingMacrotask Crowdsourcing10.1007/978-3-030-12334-5_6(163-188)Online publication date: 7-Aug-2019
https://doi.org/10.1007/978-3-030-12334-5_6
Balasubramanian TEstrada T(2016)Crowdlearning: A framework for collaborative and personalized learning2016 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2016.7757355(1-9)Online publication date: Oct-2016
https://doi.org/10.1109/FIE.2016.7757355
Victorino JEstuar M(2014)Profiling Flood Risk through Crowdsourced Flood Level Reports2014 International Conference on IT Convergence and Security (ICITCS)10.1109/ICITCS.2014.7021800(1-4)Online publication date: Oct-2014
https://doi.org/10.1109/ICITCS.2014.7021800

Index Terms

When in Doubt Ask the Crowd: Employing Crowdsourcing for Active Learning
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Recommendations

Learning from crowds with active learning and self-healing

With the development of crowdsourcing, data acquisition for supervised learning from annotators all over the world becomes simple and economical. To improve accuracy, it is nature to obtain multiple noisy labels (i.e., a multiple label set) for each ...
Ask me better questions: active learning queries based on rule induction
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Active learning methods are used to improve the classification accuracy when little labeled data is available. Most traditional active learning methods pose a very specific query to the oracle, i.e. they ask for the label of an unlabeled example. This ...
A framework for learning web wrappers from the crowd
WWW '13: Proceedings of the 22nd international conference on World Wide Web

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

June 2014

506 pages

ISBN:9781450325387

DOI:10.1145/2611040

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Aristotle University of Thessaloniki

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WIMS '14

WIMS '14: 4th International Conference on Web Intelligence, Mining and Semantics

June 2 - 4, 2014

Thessaloniki, Greece

Acceptance Rates

WIMS '14 Paper Acceptance Rate 41 of 90 submissions, 46%;

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
302
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gadiraju UZhuang M(2019)What You Sow, So Shall You Reap! Toward Preselection Mechanisms for Macrotask CrowdsourcingMacrotask Crowdsourcing10.1007/978-3-030-12334-5_6(163-188)Online publication date: 7-Aug-2019
https://doi.org/10.1007/978-3-030-12334-5_6
Balasubramanian TEstrada T(2016)Crowdlearning: A framework for collaborative and personalized learning2016 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2016.7757355(1-9)Online publication date: Oct-2016
https://doi.org/10.1109/FIE.2016.7757355
Victorino JEstuar M(2014)Profiling Flood Risk through Crowdsourced Flood Level Reports2014 International Conference on IT Convergence and Security (ICITCS)10.1109/ICITCS.2014.7021800(1-4)Online publication date: Oct-2014
https://doi.org/10.1109/ICITCS.2014.7021800

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten