skip to main content
10.1145/3544549.3585616acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Work in Progress

Unlocking the Tacit Knowledge of Data Work in Machine Learning

Published: 19 April 2023 Publication History

Abstract

Creating datasets for ML is an inherently human endeavor, as the data’s heterogeneity mandates human intervention. However, most data workflows being one-time and hardly transferable leads to a lack of standardization and reusability. There has been a push to impose more structure on the data work process, but little is known about the implicit or "tacit" knowledge of data workers, i.e., "know-how"s that is difficult to transfer to others. Identifying and formalizing this knowledge can help data work improve, leading it from current "exploration" to more systematic "engineering." We interviewed 19 ML practitioners in this study to find "why" they use "what" tacit knowledge. As a result, we identified the following themes: 1) data is context/situation dependent, 2) human workers are inseparable from data, and 3) models must be understood to build data. We finally discuss future systematic supports and research to convert what is implicit to explicit.

Supplementary Material

MP4 File (3544549.3585616-talk-video.mp4)
Pre-recorded Video Presentation
MP4 File (3544549.3585616-video-preview.mp4)
Video Preview

References

[1]
Apache Airflow. [n. d.]. Apache Airflow. https://airflow.apache.org/
[2]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300.
[3]
Ariful Islam Anik and Andrea Bunt. 2021. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 75, 13 pages. https://doi.org/10.1145/3411764.3445736
[4]
Lora Aroyo, Matthew Lease, Praveen Paritosh, and Mike Schaekermann. 2022. Data excellence for AI: why should you care?Interactions 29, 2 (Feb. 2022), 66–69.
[5]
Mark Cartwright, Graham Dove, Ana Elisa Méndez Méndez, Juan P. Bello, and Oded Nov. 2019. Crowdsourcing Multi-Label Audio Annotation Tasks with Citizen Scientists. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300522
[6]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé, III, and Kate Crawford. 2018. Datasheets for Datasets. (March 2018). arxiv:1803.09010 [cs.DB]
[7]
Robert M Grant. 1996. Toward a knowledge-based theory of the firm. Strategic management journal 17, S2 (1996), 109–122.
[8]
Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Proactive Wrangling: Mixed-Initiative End-User Programming of Data Transformation Scripts. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) (UIST ’11). Association for Computing Machinery, New York, NY, USA, 65–74. https://doi.org/10.1145/2047196.2047205
[9]
Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, III, Miro Dudík, and Hanna Wallach. 2018. Improving fairness in machine learning systems: What do industry practitioners need? (Dec. 2018). arxiv:1812.05239 [cs.HC]
[10]
Connor Huff and Dustin Tingley. 2015. “Who are these people?” Evaluating the demographic characteristics and political preferences of MTurk survey respondents. Research & Politics 2, 3 (July 2015), 2053168015604648.
[11]
Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 560–575.
[12]
Shubhra Kanti Karmaker (“Santu”), Md Mahadi Hassan, Micah J Smith, Lei Xu, Chengxiang Zhai, and Kalyan Veeramachaneni. 2021. AutoML to Date and Beyond: Challenges and Opportunities. ACM Comput. Surv. 54, 8 (Oct. 2021), 1–36.
[13]
Mary Beth Kery, Bonnie E John, Patrick O’Flaherty, Amber Horvath, and Brad A Myers. 2019. Towards Effective Foraging by Data Scientists to Find Past Analysis Choices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19, Paper 92). Association for Computing Machinery, New York, NY, USA, 1–13.
[14]
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18, Paper 174). Association for Computing Machinery, New York, NY, USA, 1–11.
[15]
Milagros Miceli, Martin Schuessler, and Tianling Yang. 2020. Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision. Proc. ACM Hum.-Comput. Interact. 4, CSCW2 (Oct. 2020), 1–25.
[16]
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3290605.3300356
[17]
Michael Muller, Christine T Wolf, Josh Andres, Michael Desmond, Narendra Nath Joshi, Zahra Ashktorab, Aabhas Sharma, Kristina Brimijoin, Qian Pan, Evelyn Duesterwald, and Casey Dugan. 2021. Designing Ground Truth and the Social Life of Labels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21, Article 94). Association for Computing Machinery, New York, NY, USA, 1–16.
[18]
Ikujiro Nonaka. 1991. The Knowledge-Creating Compony. Harvard business review (1991).
[19]
Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns (N Y) 2, 11 (Nov. 2021), 100336.
[20]
Kathleen Pine, Claus Bossen, Naja Holten Møller, Milagros Miceli, Alex Jiahong Lu, Yunan Chen, Leah Horgan, Zhaoyuan Su, Gina Neff, and Melissa Mazmanian. 2022. Investigating Data Work Across Domains: New Perspectives on the Work of Creating Data. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 87, 6 pages. https://doi.org/10.1145/3491101.3503724
[21]
Michael Polanyi. 2009. The tacit dimension. In Knowledge in organizations. Routledge, 135–146.
[22]
Tye Rattenbury, Joseph M. Hellerstein, Jeffrey Michael Heer, Sean Kandel, and Connor Carreras. 2017. Principles of data wrangling: Practical Techniques for Data Preparation. O’Reilly.
[23]
Yuji Roh, Geon Heo, and Steven Euijong Whang. 2018. A Survey on Data Collection for Machine Learning: a Big Data – AI Integration Perspective. (Nov. 2018). arxiv:1811.03402 [cs.LG]
[24]
Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. 2021. Sample Selection for Fair and Robust Training. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.). Vol. 34. Curran Associates, Inc., 815–827. https://proceedings.neurips.cc/paper/2021/file/07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf
[25]
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. 2021. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21, Article 39). Association for Computing Machinery, New York, NY, USA, 1–15.
[26]
John—Christopher Spender. 1993. Competitive Advantage from Tacit Knowledge? Unpacking the Concept and Its Strategic Implications. In Academy of Management Proceedings, Vol. 1993. Academy of Management Briarcliff Manor, NY 10510, 37–41.
[27]
Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD ’18). Association for Computing Machinery, New York, NY, USA, 2279–2288.
[28]
Divy Thakkar, Azra Ismail, Pratyush Kumar, Alex Hanna, Nithya Sambasivan, and Neha Kumar. 2022. When is Machine Learning Data Good?: Valuing in Public Health Datafication. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 322, 16 pages. https://doi.org/10.1145/3491102.3501868
[29]
April Yi Wang, Zihan Wu, Christopher Brooks, and Steve Oney. 2020. Callisto: Capturing the “Why” by Connecting Conversations with Computational Narratives. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13.
[30]
Ding Wang, Shantanu Prabhat, and Nithya Sambasivan. 2022. Whose AI Dream? In Search of the Aspiration in Data Annotation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 582, 16 pages. https://doi.org/10.1145/3491102.3502121
[31]
Doris Xin, Eva Yiwei Wu, Doris Jung-Lin Lee, Niloufar Salehi, and Aditya Parameswaran. 2021. Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 83, 16 pages. https://doi.org/10.1145/3411764.3445306
[32]
Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proc. ACM Hum.-Comput. Interact. 4, CSCW1 (May 2020), 1–23.
[33]
Amy X. Zhang, Michael Muller, and Dakuo Wang. 2020. How Do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proc. ACM Hum.-Comput. Interact. 4, CSCW1, Article 22 (may 2020), 23 pages. https://doi.org/10.1145/3392826
[34]
Yu Zhang, Yun Wang, Haidong Zhang, Bin Zhu, Siming Chen, and Dongmei Zhang. 2022. OneLabeler: A Flexible System for Building Data Labeling Tools. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22, Article 93). Association for Computing Machinery, New York, NY, USA, 1–22.
[35]
Marc-André Zöller and Marco F Huber. 2021. Benchmark and Survey of Automated Machine Learning Frameworks. jair 70 (Jan. 2021), 409–472.

Cited By

View all
  • (2024)Do good: Strategies for leading an inclusive data science or statistics consulting teamStat10.1002/sta4.68713:2Online publication date: 12-May-2024

Index Terms

  1. Unlocking the Tacit Knowledge of Data Work in Machine Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems
    April 2023
    3914 pages
    ISBN:9781450394222
    DOI:10.1145/3544549
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 April 2023

    Check for updates

    Author Tags

    1. Data Construction
    2. Machine Learning
    3. Practitioners
    4. Semi-structured In-depth Interviews

    Qualifiers

    • Work in progress
    • Research
    • Refereed limited

    Conference

    CHI '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

    Upcoming Conference

    CHI 2025
    ACM CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)187
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 11 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Do good: Strategies for leading an inclusive data science or statistics consulting teamStat10.1002/sta4.68713:2Online publication date: 12-May-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media