abstract

Challenges in Data Production for AI with Human-in-the-Loop

Author:

Dmitry UstalovAuthors Info & Claims

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

Pages 1651 - 1652

https://doi.org/10.1145/3488560.3510011

Published: 15 February 2022 Publication History

Get Access

Abstract

Today, successful Artificial Intelligence applications rely on three pillars: machine learning algorithms, hardware for running them, and data for training and evaluating models. Although algorithms and hardware have already become commodities, obtaining up-to-date and high-quality data at scale is still challenging-but possible by building hybrid human-computer pipelines called human-in-the-loop. This talk will show how to make a significant business impact using human-in-the-loop pipelines that combine machine learning with crowdsourcing. We will share the experience of one of the world's largest search engines, Yandex.

After a brief introduction to human-in-the-loop, we will describe two insightful case studies with a significant business impact at Yandex. First, we will show how to use human-in-the-loop with subjective human opinions to gather training data for learning-to-rank models in the online setting, crucial for the recommendation, e-commerce, and search applications. Second, we will show how human-in-the-loop combined with spatial crowdsourcing enables keeping information on brick-and-mortar businesses up-to-date and transformed into structured data, essential for social impactful applications like online maps and directories.

Then, we will present the practical challenges of deploying human-in-the-loop pipelines, focusing on common issues with task design and quality control. We will demonstrate the end-to-end task design techniques that better fit for open-ended and subjective questions compared to widely-used classification tasks. We will present our recent advances in this field, including the use of large-scale language models (like BART and T5) for sequence aggregation. Also, we will show the new evaluation datasets for textual and subjective annotation, which are publicly available at https://toloka.ai/datasets. We will discuss the problem of reliable quality control in crowdsourcing by describing the relevant computational methods for aggregation, quality estimation, and model selection. Finally, we will demonstrate Crowd-Kit, an open-source library that offers battle-tested and platform-agnostic implementations of all the above-described methods in Python: https://github.com/Toloka/crowd-kit.

Overall, we will share our experience in running impactful human-in-the-loop pipelines in production while overcoming the common practical challenges using the available and reliable open-source technologies, datasets, and tools.

References

[1]

Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. Comput. Surveys, Vol. 51, 1 (2018), 7:1--7:40. https://doi.org/10.1145/3148148

Digital Library

Google Scholar

[2]

Nikita Pavlichenko, Ivan Stelmakh, and Dmitry Ustalov. 2021. CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. bibinfonumpages14 pages.arxiv: 2107.01091 [cs.SD] https://openreview.net/forum?id=3_hgF1NAXU7

Google Scholar

[3]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67. https://jmlr.org/papers/v21/20-074.html

Google Scholar

[4]

Dmitry Ustalov, Nikita Pavlichenko, Vladimir Losev, Iulian Giliazev, and Evgeny Tulin. 2021. A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python. In The Ninth AAAI Conference on Human Computation and Crowdsourcing: Works-in-Progress and Demonstration Track (HCOMP 2021). bibinfonumpages4 pages.arxiv: 2109.08584 [cs.HC] https://www.humancomputation.com/assets/wips_demos/HCOMP_2021_paper_85.pdf

Google Scholar

Cited By

View all

Kumar SDatta SSingh VDatta DKumar Singh SSharma R(2024)Applications, Challenges, and Future Directions of Human-in-the-Loop LearningIEEE Access10.1109/ACCESS.2024.340154712(75735-75760)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3401547
Banipal IAsthana SMazumder SKochura N(2024)Cognitive Programming AssistantAdvances in Information and Communication10.1007/978-3-031-54053-0_1(1-11)Online publication date: 17-Mar-2024
https://doi.org/10.1007/978-3-031-54053-0_1

Index Terms

Challenges in Data Production for AI with Human-in-the-Loop

Recommendations

Improving Recommender Systems with Human-in-the-Loop
RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems

Today, most recommender systems employ Machine Learning to recommend posts, products, and other items, usually produced by the users. Although the impressive progress in Deep Learning and Reinforcement Learning, we observe that recommendations made by ...
When the Human is in the Loop: Cost, Effort and Behavior
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Data from human-machine interaction can be used to improve the quality of artificial intelligence (AI) systems. When designing a system with humans in the loop, one of the questions to be asked is how much human work is required to create a reliable ...
Human models in human-in-the-loop control systems

Understanding the collaboration between physical systems and human is an essential task in man-machine systems. This area of research has been significantly explored in the recent years with the focus on the machine side. Much less attention has been ...

Comments

Information & Contributors

Information

Published In

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

February 2022

1690 pages

ISBN:9781450391320

DOI:10.1145/3488560

General Chairs:
K. Selcuk Candan
Arizona State University, USA
,
Huan Liu
Arizona State University, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Xin Luna Dong
Meta Platforms, Inc. (former Facebook), USA
,
Jiliang Tang
Michigan State University, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Check for updates

Author Tags

Qualifiers

Abstract

Conference

WSDM '22

Sponsor:

WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining

February 21 - 25, 2022

AZ, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
165
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kumar SDatta SSingh VDatta DKumar Singh SSharma R(2024)Applications, Challenges, and Future Directions of Human-in-the-Loop LearningIEEE Access10.1109/ACCESS.2024.340154712(75735-75760)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3401547
Banipal IAsthana SMazumder SKochura N(2024)Cognitive Programming AssistantAdvances in Information and Communication10.1007/978-3-031-54053-0_1(1-11)Online publication date: 17-Mar-2024
https://doi.org/10.1007/978-3-031-54053-0_1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Improving Recommender Systems with Human-in-the-Loop

When the Human is in the Loop: Cost, Effort and Behavior

Human models in human-in-the-loop control systems

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations