skip to main content
10.1145/3242587.3242596acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article

The Exploratory Labeling Assistant: Mixed-Initiative Label Curation with Large Document Collections

Published: 11 October 2018 Publication History

Abstract

In this paper, we define the concept of exploratory labeling: the use of computational and interactive methods to help analysts categorize groups of documents into a set of unknown and evolving labels. While many computational methods exist to analyze data and build models once the data is organized around a set of predefined categories or labels, few methods address the problem of reliably discovering and curating such labels in the first place. In order to move first steps towards bridging this gap, we propose an interactive visual data analysis method that integrates human-driven label ideation, specification and refinement with machine-driven recommendations. The proposed method enables the user to progressively discover and ideate labels in an exploratory fashion and specify rules that can be used to automatically match sets of documents to labels. To support this process of ideation, specification, as well as evaluation of the labels, we use unsupervised machine learning methods that provide suggestions and data summaries. We evaluate our method by applying it to a real-world labeling problem as well as through controlled user studies to identify and reflect on patterns of interaction emerging from exploratory labeling activities.

Supplementary Material

suppl.mov (ufp1056.mp4)
Supplemental video
suppl.mov (ufp1056p.mp4)
Supplemental video
MP4 File (p153-felix.mp4)

References

[1]
Aaron Bangor, Philip T Kortum, and James T Miller. 2008. An empirical evaluation of the system usability scale. International Journal of Human-Computer Interaction 24, 6 (2008), 574--594.
[2]
Jürgen Bernard, Marco Hutter, Matthias Zeppelzauer, Dieter Fellner, and Michael Sedlmair. 2018a. Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 298--308.
[3]
Jürgen Bernard, Matthias Zeppelzauer, Michael Sedlmair, and Wolfgang Aigner. 2018b. VIAL: a unified process for visual interactive labeling. The Visual Computer (2018), 1--19.
[4]
John Brooke and others. 1996. SUS-A quick and dirty usability scale. Usability evaluation in industry 189, 194 (1996), 4--7.
[5]
Davide Ceneda, Theresia Gschwandtner, Thorsten May, Silvia Miksch, Hans-Jörg Schulz, Marc Streit, and Christian Tominski. 2017. Characterizing guidance in visual analytics. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 111--120.
[6]
Senthil Chandrasegaran, Sriram Karthik Badam, Lorraine Kisselburgh, Karthik Ramani, and Niklas Elmqvist. 2017. Integrating visual analytics support for grounded theory practice in qualitative text analysis. In Computer Graphics Forum, Vol. 36. Wiley Online Library, 201--212.
[7]
Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems. 288--296.
[8]
Kathy Charmaz. 2014. Constructing grounded theory. Sage.
[9]
Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R. Aragon. 2018. Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity. ACM Trans. Interact. Intell. Syst. 8, 2 (2018), 9:1--9:20.
[10]
Jaegul Choo, Changhyun Lee, Hannah Kim, Hanseung Lee, Zhicheng Liu, Ramakrishnan Kannan, Charles D Stolper, John Stasko, Barry L Drake, and Haesun Park. 2014. VisIRR: Visual analytics for information retrieval and recommendation with large-scale document data. In IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 243--244.
[11]
Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 1992--2001.
[12]
Rudi L Cilibrasi and Paul MB Vitanyi. 2007. The Google Similarity Distance . IEEE Transactions on Knowledge and Data Engineering (TKDE) 19, 3 (2007), 370--383.
[13]
Kristin Cook, Nick Cramer, David Israel, Michael Wolverton, Joe Bruce, Russ Burtner, and Alex Endert. 2015. Mixed-initiative visual analytics using task-driven recommendations. In IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 9--16.
[14]
Wenwen Dou, Xiaoyu Wang, Drew Skau, William Ribarsky, and Michelle X Zhou. 2012. Leadline: Interactive visual analysis of text data through event identification and exploration. In IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 93--102.
[15]
Wenwen Dou, Li Yu, Xiaoyu Wang, Zhiqiang Ma, and William Ribarsky. 2013. Hierarchicaltopics: Visually exploring large text collections using topic hierarchies. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 2002--2011.
[16]
Margaret Drouhard, Nan-Chen Chen, Jina Suh, Rafal Kocielnik, Vanessa Pena-Araya, Keting Cen, Xiangyi Zheng, and Cecilia R Aragon. 2017. Aeonium: Visual analytics to support collaborative qualitative coding. In IEEE Pacific Visualization Symposium (PacificVis). IEEE, 220--229.
[17]
Steven M Drucker, Danyel Fisher, and Sumit Basu. 2011. Helping users sort faster with adaptive machine learning recommendations. In IFIP Conference on Human-Computer Interaction. Springer, 187--203.
[18]
Mennatallah El-Assady, Rita Sevastjanova, Fabian Sperrle, Daniel Keim, and Christopher Collins. 2018. Progressive Learning of Topic Modeling Parameters: A Visual Analytics Framework. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 382--391.
[19]
C. Felix, S. Franconeri, and E. Bertini. 2018. Taking Word Clouds Apart: An Empirical Investigation of the Design Space for Keyword Summaries. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 657--666.
[20]
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.
[21]
Benjamin Höferlin, Rudolf Netzel, Markus Höferlin, Daniel Weiskopf, and Gunther Heidemann. 2012. Inter-active learning of ad-hoc classifiers for video visual analytics. In, 2012 IEEE Symposium on Visual Analytics Science and Technology (VAST). IEEE, 23--32.
[22]
Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems. 856--864.
[23]
Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. ACM, 159--166.
[24]
Minjeong Kim, Kyeongpil Kang, Deokgun Park, Jaegul Choo, and Niklas Elmqvist. 2017. Topiclens: Efficient multi-level visual topic exploration of large-scale document collections. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 151--160.
[25]
Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. 2014. Structured labeling for facilitating concept evolution in machine learning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3075--3084.
[26]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188--1196.
[27]
Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. In Computer Graphics Forum, Vol. 31. Wiley Online Library, 1155--1164.
[28]
Tak Yeon Lee, Alison Smith, Kevin Seppi, Niklas Elmqvist, Jordan Boyd-Graber, and Leah Findlater. 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies 105 (2017), 28--42.
[29]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.
[30]
Megh Marathe and Kentaro Toyama. 2018. Semi-Automated Coding for Qualitative Research: A User-Centered Inquiry and Initial Prototypes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, 348:1--348:12.
[31]
Fredrik Olsson. 2009. A literature survey of active machine learning in the context of natural language processing. (2009).
[32]
Deokgun Park, Seungyeon Kim, Jurim Lee, Jaegul Choo, Nicholas Diakopoulos, and Niklas Elmqvist. 2018. ConceptVector: text visual analytics via interactive lexicon building using word embedding. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 361--370.
[33]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14--1162
[34]
Daniel Ramage, Evan Rosen, Jason Chuang, Christopher D. Manning, and Daniel A. McFarland. 2009. Topic Modeling for the Social Sciences. In Workshop on Applications for Topic Models, NIPS. http://vis.stanford.edu/papers/topic-modeling-social-sciences
[35]
Christin Seifert and Michael Granitzer. 2010. User-based active learning. In IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 418--425.
[36]
Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1 (2012), 1--114.
[37]
Ehsan Sherkat, Seyednaser Nourashrafeddin, Evangelos E Milios, and Rosane Minghim. 2018. Interactive Document Clustering Revisited: A Visual Analytics Approach. In 23rd International Conference on Intelligent User Interfaces. ACM, 281--292.
[38]
James J Thomas and Kristin A Cook. 2006. A visual analytics agenda. IEEE computer graphics and applications 26, 1 (2006), 10--13.
[39]
Yelp. 2018. Yelp Open Dataset. (2018). Retrieved March 01, 2018 from https://www.yelp.com/dataset.

Cited By

View all
  • (2024)Man and the Machine: Effects of AI-assisted Human Labeling on Interactive Annotation of Real-time Video StreamsACM Transactions on Interactive Intelligent Systems10.1145/364945714:2(1-22)Online publication date: 29-Feb-2024
  • (2024)SenseMate: An Accessible and Beginner-Friendly Human-AI Platform for Qualitative Data AnalysisProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645194(922-939)Online publication date: 18-Mar-2024
  • (2024)Understanding Novice's Annotation Process For 3D Semantic Segmentation Task With Human-In-The-LoopProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645150(444-454)Online publication date: 18-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology
October 2018
1016 pages
ISBN:9781450359481
DOI:10.1145/3242587
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document labeling
  2. exploratory labeling
  3. text analysis
  4. visualization

Qualifiers

  • Research-article

Funding Sources

  • CAPES Foundation, Ministryof Education of Brazil

Conference

UIST '18

Acceptance Rates

UIST '18 Paper Acceptance Rate 80 of 375 submissions, 21%;
Overall Acceptance Rate 561 of 2,567 submissions, 22%

Upcoming Conference

UIST '25
The 38th Annual ACM Symposium on User Interface Software and Technology
September 28 - October 1, 2025
Busan , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)5
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Man and the Machine: Effects of AI-assisted Human Labeling on Interactive Annotation of Real-time Video StreamsACM Transactions on Interactive Intelligent Systems10.1145/364945714:2(1-22)Online publication date: 29-Feb-2024
  • (2024)SenseMate: An Accessible and Beginner-Friendly Human-AI Platform for Qualitative Data AnalysisProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645194(922-939)Online publication date: 18-Mar-2024
  • (2024)Understanding Novice's Annotation Process For 3D Semantic Segmentation Task With Human-In-The-LoopProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645150(444-454)Online publication date: 18-Mar-2024
  • (2024)Marco: Supporting Business Document Workflows via Collection-Centric Information Foraging with Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641969(1-20)Online publication date: 11-May-2024
  • (2024)DaedalusData: Exploration, Knowledge Externalization and Labeling of Particles in Medical Manufacturing — A Design StudyIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345632931:1(54-64)Online publication date: 23-Sep-2024
  • (2023)Polyphony: an Interactive Transfer Learning Framework for Single-Cell Data AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.320940829:1(591-601)Online publication date: Jan-2023
  • (2022)Active Pattern Classification for Automatic Visual Exploration of Multi-Dimensional DataApplied Sciences10.3390/app12221138612:22(11386)Online publication date: 10-Nov-2022
  • (2022)OneLabeler: A Flexible System for Building Data Labeling ToolsProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517612(1-22)Online publication date: 29-Apr-2022
  • (2022)Towards Visual Explainable Active Learning for Zero-Shot ClassificationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.311479328:1(791-801)Online publication date: 1-Jan-2022
  • (2021)LabelUX! Guidelines to support software engineers to design data labeling systemsProceedings of the XX Brazilian Symposium on Software Quality10.1145/3493244.3493252(1-10)Online publication date: 8-Nov-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media