research-article

Crowdsourcing Truth Inference Based on Label Confidence Clustering

Authors:

Xindong WuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 4

Article No.: 46, Pages 1 - 20

https://doi.org/10.1145/3556545

Published: 24 February 2023 Publication History

Abstract

Truth inference can help solve some difficult problems of data integration in crowdsourcing. Crowdsourced workers are not experts and their labeling ability varies greatly; therefore, in practical applications, it is difficult to determine whether the labels collected from a crowdsourcing platform are correct. This article proposes a novel algorithm called truth inference based on label confidence clustering (TILCC) to improve the quality of integrated labels for the single-choice classification problem in crowdsourcing labeling tasks. We obtain the label confidence via worker reliability, which is calculated from multiple noise labels using a truth discovery method, and then we generate the clustering features and use the K-means algorithm to cluster all the tasks into K different clusters. Each cluster corresponds to a specific class, and the tasks in the cluster are assigned a label. Compared with the performances of six state-of-the-art methods, MV, ZenCrowd, PM, CATD, GLAD, and GTIC, on 12 randomly selected real-world datasets, the performance of our algorithm showed many advantages: no need to set complex parameters, faster running speed, and significantly higher accuracy.

References

[1]

Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. 2010. The multidimensional wisdom of crowds. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (2010), 2424–2432.

[2]

Bahadir Ismail Aydin, Yavuz Selim Yilmaz, Yaliang Li, Qi Li, Jing Gao, and Murat Demirbas. 2014. Crowdsourcing for multiple-choice question answering. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. AAAI Press, 2946–2953.

[3]

Alexander Philip Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20–28.

[4]

Hongwei Li, Bo Zhao, and Ariel Fuxman. 2014. The wisdom of minority: Discovering and targeting the right group of workers for crowdsourcing. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 165–175.

[5]

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Anna Jerebko, Charles Florin, Gerardo Hermosillo Valadez, Luca Bogoni, and Linda Moy. 2009. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 889–896.

[6]

Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get another label improving data quality and data mining using multiple noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008). ACM, 614–622.

[7]

Padhraic Smyth, Usama M. Fayyad, Michael C. Burl, Pietro Perona, and Pierre Baldi. 1995. Inferring ground truth from subjective labelling of venus images. In Proceedings of the 9th Annual Conference on Neural Information Processing Systems. MIT Press, 1085–1092.

[8]

Merrielle Spain and Pietro Perona. 2008. Some objects are more equal than others: Measuring and predicting importance. In Proceedings of the 10th European Conference on Computer Vision. Springer, 523–536.

[9]

Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the 23th Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., 2035–2043.

[10]

Dengyong Zhou, Sumit Basu, Yi Mao, and John C. Platt. 2012. Learning from the wisdom of crowds by minimax entropy. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., 2195–2203.

[11]

Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541–552.

Digital Library

[12]

Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2012. Zencrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st International Conference on World Wide Web. ACM, 469–478.

[13]

Jing Zhang, Victor S. Sheng, Jian Wu, and Xindong Wu. 2016. Multi-class ground truth inference in crowdsourcing with clustering. IEEE Transactions on Knowledge and Data Engineering. IEEE, 28, 4 (2016), 1080–1085.

Digital Library

[14]

Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. 2015. A survey on truth discovery. ACM SIGKDD Explorations Newsletter, ACM, 17, 2 (2015), 1–16.

Digital Library

[15]

Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han. 2014. Resolving conflicts in heterogeneous data by truth discovery and source confidence estimation. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014). ACM, 1187–1198.

[16]

Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han. 2014. A confidence-aware approach for truth discovery on long-tail data. Proceedings of the VLDB Endowment 8 (2014), 425–436.

Digital Library

[17]

Yi Yang, Quan Bai, and Qing Liu. 2019. Dynamic source weight computation for truth inference over data streams. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 277–285.

[18]

Houping Xiao, Jing Gao, Qi Li, Fenglong Ma, Lu Su, Yunlong Feng, and Aidong Zhang. 2016. Towards confidence in the truth: A bootstrapping based truth discovery approach. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). ACM, 1935–1944.

[19]

Houping Xiao, Jing Gao, Qi Li, Fenglong Ma, Lu Su, Yunlong Feng, and Aidong Zhang. 2019. Towards confidence interval estimation in truth discovery. IEEE Transactions on Knowledge and Data Engineering 31, 3 (2019), 575–588.

Digital Library

[20]

David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., 1953–1961.

[21]

Qiang Liu, Jian Peng, and Alexander T. Ihler. 2012. Variational inference for crowdsourcing. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems. Curran Associates, Inc., 692–700.

[22]

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11, 1 (2010), 1297–1322.

Digital Library

[23]

Hyun-Chul Kim and Zoubin Ghahramani. 2012. Bayesian classifier combination. In Proceeding of the 15th International Conference on Artificial Intelligence and Statistics. JMLR.org, 619–627.

[24]

Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. 2014. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 155–164.

[25]

Jing Zhang and Xindong Wu. 2018. Multi-label inference for crowdsourcing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018). ACM, 2738–2747.

[26]

Robin Wentao Ouyang, Mani Srivastava, Alice Toniolo, and Timothy J. Norman. 2016. Truth Discovery in Crowdsourced Detection of Spatial Events. IEEE Transactions on Knowledge and Data Engineering. IEEE, 28, 4 (2016), 1047–1060.

Digital Library

[27]

Haipei Sun, Boxiang Dong, Hui Wendy Wang, Ting Yu, and Zhan Qin. 2018. Truth inference on sparse crowdsourcing data with local differential privacy. In Proceedings of the 2018 IEEE International Conference on Big Data (2018). IEEE, 488–497.

[28]

Yuan Li, Benjamin I. P. Rubinstein, and Trevor Cohn. 2019. Truth inference at scale: A bayesian model for adjudicating highly redundant crowd annotations. In Proceedings of the World Wide Web Conference. ACM, 1028–1038.

[29]

Fenglong Ma, Yaliang Li, Qi Li, Minghui Qiu, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, and Jiawei Han. 2015. FaitCrowd: Fine grained truth discovery for crowdsourced data aggregation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015). ACM, 745–754.

[30]

Chao Huang and Dong Wang. 2016. Topic-aware social sensing with arbitrary source dependency graphs. In Proceedings of the 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (2016), 7:1–7:2.

[31]

Chao Huang, Dong Wang, and Nitesh V. Chawla. 2020. Scalable uncertainty-aware truth discovery in big data social sensing applications for cyber-physical systems. IEEE Transactions on Big Data 6, 4 (2020), 702–713.

[32]

Hengtong Zhang, Yaliang Li, Fenglong Ma, Jing Gao, and Lu Su. 2018. TextTruth: An unsupervised approach to discover trustworthy information from multi-sourced text data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018). ACM, 2729–2737.

[33]

Chris Buckley, Matthew Lease, and Mark D. Smucker. 2010. Overview of the TREC 2010 relevance feedback track (notebook). In Proceeding of the 19th TREC Notebook. NIST, 1--4.

[34]

Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop Human Computation, ACM, 64–67.

[35]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483–1494.

Digital Library

[36]

Catherine Grady and Matthew Lease. 2010. Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT’10 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association or Computational Linguistics, 172–179.

[37]

Charles Mallah, James Cope, and James Orwell. 2013. Plant leaf classification using probabilistic integration of shape, texture and margin features. In Proceedings of the IASTED International Conference on Signal Processing, Pattern Recognition, and Applications. 279–286.

Cited By

Chen ZJiang LZhang WLi C(2024)Weighted Adversarial Learning from CrowdsIEEE Transactions on Services Computing10.1109/TSC.2024.3404353(1-14)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3404353
Chen DShi XZhang HSong XZhang DChen YYan J(2024)A Phone-Based Distributed Ambient Temperature Measurement System With an Efficient Label-Free Automated Training StrategyIEEE Transactions on Mobile Computing10.1109/TMC.2024.339984323:12(11781-11793)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TMC.2024.3399843
Mao YDang ZWang HZhang YZhong S(2024)Solution Probing Attack Against Coin Mixing Based Privacy-Preserving Crowdsourcing PlatformsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.335545321:5(4684-4698)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TDSC.2024.3355453
Show More Cited By

Index Terms

Crowdsourcing Truth Inference Based on Label Confidence Clustering
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering
  2. World Wide Web
    1. Web applications
      1. Crowdsourcing
    2. Web searching and information discovery
      1. Social tagging

Recommendations

Partial multi-label learning based on sparse asymmetric label correlations
Abstract
In many real-world applications, an instance from the training dataset of multi-label learning (MLL) often has some irrelevant labels. Traditional MLL and partial label learning (PLL) cannot deal with this problem very well. This has given rise ...
Label confidence-based noise correction for crowdsourcing
Abstract
In crowdsourcing scenarios, each instance obtains multiple noisy labels from different crowd workers and then gets its integrated label via a label aggregation method. In spite of the effectiveness of label aggregation methods, a ...
Modeling Random Guessing and Task Difficulty for Truth Inference in Crowdsourcing
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

This paper addresses the challenge of truth inference in crowdsourcing applications. We propose a generative method that jointly models tasks' difficulties, workers' abilities and guessing behavior to estimate the truths of crowdsourced tasks, which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 17, Issue 4

May 2023

364 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3583065

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2023

Online AM: 17 August 2022

Accepted: 01 July 2022

Revised: 01 March 2022

Received: 01 January 2021

Published in TKDD Volume 17, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
Program for Innovative Research Team in University of the Ministry of Education
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
642
Total Downloads

Downloads (Last 12 months)205
Downloads (Last 6 weeks)14

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen ZJiang LZhang WLi C(2024)Weighted Adversarial Learning from CrowdsIEEE Transactions on Services Computing10.1109/TSC.2024.3404353(1-14)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3404353
Chen DShi XZhang HSong XZhang DChen YYan J(2024)A Phone-Based Distributed Ambient Temperature Measurement System With an Efficient Label-Free Automated Training StrategyIEEE Transactions on Mobile Computing10.1109/TMC.2024.339984323:12(11781-11793)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TMC.2024.3399843
Mao YDang ZWang HZhang YZhong S(2024)Solution Probing Attack Against Coin Mixing Based Privacy-Preserving Crowdsourcing PlatformsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.335545321:5(4684-4698)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TDSC.2024.3355453
Xu HHe ZLan D(2024)Revolutionizing machine learning: Blockchain-based crowdsourcing for transparent and fair labeled datasets supplyFuture Generation Computer Systems10.1016/j.future.2024.06.061161(106-118)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.06.061
Fang XDu XChen HWei ZZhan YSun G(2024)Efficient Privacy-Preserving Truth Discovery and Copy Detection in CrowdsourcingMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70352-2_22(368-385)Online publication date: 8-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70352-2_22
Li HJiang LXue S(2023)Neighborhood Weighted Voting-Based Noise Correction for CrowdsourcingACM Transactions on Knowledge Discovery from Data10.1145/358699817:7(1-18)Online publication date: 14-Apr-2023
https://doi.org/10.1145/3586998
Yao ERamakrishnan JChen XNguyen VWeinsberg USingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)From Labels to Decisions: A Mapping-Aware Annotator ModelProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599828(5404-5415)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599828
Ying ZZhang JLi QWu MSheng V(2023)A Little Truth Injection But a Big Reward: Label Aggregation With Graph Neural NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333821646:5(3169-3182)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TPAMI.2023.3338216

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents