skip to main content
10.1145/3488560.3510011acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
abstract

Challenges in Data Production for AI with Human-in-the-Loop

Published: 15 February 2022 Publication History

Abstract

Today, successful Artificial Intelligence applications rely on three pillars: machine learning algorithms, hardware for running them, and data for training and evaluating models. Although algorithms and hardware have already become commodities, obtaining up-to-date and high-quality data at scale is still challenging-but possible by building hybrid human-computer pipelines called human-in-the-loop. This talk will show how to make a significant business impact using human-in-the-loop pipelines that combine machine learning with crowdsourcing. We will share the experience of one of the world's largest search engines, Yandex.
After a brief introduction to human-in-the-loop, we will describe two insightful case studies with a significant business impact at Yandex. First, we will show how to use human-in-the-loop with subjective human opinions to gather training data for learning-to-rank models in the online setting, crucial for the recommendation, e-commerce, and search applications. Second, we will show how human-in-the-loop combined with spatial crowdsourcing enables keeping information on brick-and-mortar businesses up-to-date and transformed into structured data, essential for social impactful applications like online maps and directories.
Then, we will present the practical challenges of deploying human-in-the-loop pipelines, focusing on common issues with task design and quality control. We will demonstrate the end-to-end task design techniques that better fit for open-ended and subjective questions compared to widely-used classification tasks. We will present our recent advances in this field, including the use of large-scale language models (like BART and T5) for sequence aggregation. Also, we will show the new evaluation datasets for textual and subjective annotation, which are publicly available at https://toloka.ai/datasets. We will discuss the problem of reliable quality control in crowdsourcing by describing the relevant computational methods for aggregation, quality estimation, and model selection. Finally, we will demonstrate Crowd-Kit, an open-source library that offers battle-tested and platform-agnostic implementations of all the above-described methods in Python: https://github.com/Toloka/crowd-kit.
Overall, we will share our experience in running impactful human-in-the-loop pipelines in production while overcoming the common practical challenges using the available and reliable open-source technologies, datasets, and tools.

References

[1]
Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. Comput. Surveys, Vol. 51, 1 (2018), 7:1--7:40. https://doi.org/10.1145/3148148
[2]
Nikita Pavlichenko, Ivan Stelmakh, and Dmitry Ustalov. 2021. CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. bibinfonumpages14 pages.arxiv: 2107.01091 [cs.SD] https://openreview.net/forum?id=3_hgF1NAXU7
[3]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67. https://jmlr.org/papers/v21/20-074.html
[4]
Dmitry Ustalov, Nikita Pavlichenko, Vladimir Losev, Iulian Giliazev, and Evgeny Tulin. 2021. A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python. In The Ninth AAAI Conference on Human Computation and Crowdsourcing: Works-in-Progress and Demonstration Track (HCOMP 2021). bibinfonumpages4 pages.arxiv: 2109.08584 [cs.HC] https://www.humancomputation.com/assets/wips_demos/HCOMP_2021_paper_85.pdf

Cited By

View all
  • (2024)Applications, Challenges, and Future Directions of Human-in-the-Loop LearningIEEE Access10.1109/ACCESS.2024.340154712(75735-75760)Online publication date: 2024
  • (2024)Cognitive Programming AssistantAdvances in Information and Communication10.1007/978-3-031-54053-0_1(1-11)Online publication date: 17-Mar-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
February 2022
1690 pages
ISBN:9781450391320
DOI:10.1145/3488560
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Check for updates

Author Tags

  1. aggregation
  2. crowdsourcing
  3. data production
  4. evaluation
  5. human-in-the-loop
  6. information retrieval
  7. learning-to-rank
  8. quality control
  9. spatial crowdsourcing

Qualifiers

  • Abstract

Conference

WSDM '22

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Applications, Challenges, and Future Directions of Human-in-the-Loop LearningIEEE Access10.1109/ACCESS.2024.340154712(75735-75760)Online publication date: 2024
  • (2024)Cognitive Programming AssistantAdvances in Information and Communication10.1007/978-3-031-54053-0_1(1-11)Online publication date: 17-Mar-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media