research-article

The Exploratory Labeling Assistant: Mixed-Initiative Label Curation with Large Document Collections

Authors:

Cristian Felix,

Aritra Dasgupta,

Enrico BertiniAuthors Info & Claims

UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology

Pages 153 - 164

https://doi.org/10.1145/3242587.3242596

Published: 11 October 2018 Publication History

Abstract

In this paper, we define the concept of exploratory labeling: the use of computational and interactive methods to help analysts categorize groups of documents into a set of unknown and evolving labels. While many computational methods exist to analyze data and build models once the data is organized around a set of predefined categories or labels, few methods address the problem of reliably discovering and curating such labels in the first place. In order to move first steps towards bridging this gap, we propose an interactive visual data analysis method that integrates human-driven label ideation, specification and refinement with machine-driven recommendations. The proposed method enables the user to progressively discover and ideate labels in an exploratory fashion and specify rules that can be used to automatically match sets of documents to labels. To support this process of ideation, specification, as well as evaluation of the labels, we use unsupervised machine learning methods that provide suggestions and data summaries. We evaluate our method by applying it to a real-world labeling problem as well as through controlled user studies to identify and reflect on patterns of interaction emerging from exploratory labeling activities.

Supplementary Material

suppl.mov (ufp1056.mp4)

Supplemental video

Download
13.40 MB

suppl.mov (ufp1056p.mp4)

Supplemental video

Download
5.59 MB

MP4 File (p153-felix.mp4)

Download
302.81 MB

References

[1]

Aaron Bangor, Philip T Kortum, and James T Miller. 2008. An empirical evaluation of the system usability scale. International Journal of Human-Computer Interaction 24, 6 (2008), 574--594.

[2]

Jürgen Bernard, Marco Hutter, Matthias Zeppelzauer, Dieter Fellner, and Michael Sedlmair. 2018a. Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 298--308.

[3]

Jürgen Bernard, Matthias Zeppelzauer, Michael Sedlmair, and Wolfgang Aigner. 2018b. VIAL: a unified process for visual interactive labeling. The Visual Computer (2018), 1--19.

Digital Library

[4]

John Brooke and others. 1996. SUS-A quick and dirty usability scale. Usability evaluation in industry 189, 194 (1996), 4--7.

[5]

Davide Ceneda, Theresia Gschwandtner, Thorsten May, Silvia Miksch, Hans-Jörg Schulz, Marc Streit, and Christian Tominski. 2017. Characterizing guidance in visual analytics. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 111--120.

Digital Library

[6]

Senthil Chandrasegaran, Sriram Karthik Badam, Lorraine Kisselburgh, Karthik Ramani, and Niklas Elmqvist. 2017. Integrating visual analytics support for grounded theory practice in qualitative text analysis. In Computer Graphics Forum, Vol. 36. Wiley Online Library, 201--212.

Digital Library

[7]

Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems. 288--296.

Digital Library

[8]

Kathy Charmaz. 2014. Constructing grounded theory. Sage.

[9]

Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R. Aragon. 2018. Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity. ACM Trans. Interact. Intell. Syst. 8, 2 (2018), 9:1--9:20.

Digital Library

[10]

Jaegul Choo, Changhyun Lee, Hannah Kim, Hanseung Lee, Zhicheng Liu, Ramakrishnan Kannan, Charles D Stolper, John Stasko, Barry L Drake, and Haesun Park. 2014. VisIRR: Visual analytics for information retrieval and recommendation with large-scale document data. In IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 243--244.

[11]

Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 1992--2001.

Digital Library

[12]

Rudi L Cilibrasi and Paul MB Vitanyi. 2007. The Google Similarity Distance . IEEE Transactions on Knowledge and Data Engineering (TKDE) 19, 3 (2007), 370--383.

Digital Library

[13]

Kristin Cook, Nick Cramer, David Israel, Michael Wolverton, Joe Bruce, Russ Burtner, and Alex Endert. 2015. Mixed-initiative visual analytics using task-driven recommendations. In IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 9--16.

[14]

Wenwen Dou, Xiaoyu Wang, Drew Skau, William Ribarsky, and Michelle X Zhou. 2012. Leadline: Interactive visual analysis of text data through event identification and exploration. In IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 93--102.

Digital Library

[15]

Wenwen Dou, Li Yu, Xiaoyu Wang, Zhiqiang Ma, and William Ribarsky. 2013. Hierarchicaltopics: Visually exploring large text collections using topic hierarchies. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 2002--2011.

Digital Library

[16]

Margaret Drouhard, Nan-Chen Chen, Jina Suh, Rafal Kocielnik, Vanessa Pena-Araya, Keting Cen, Xiangyi Zheng, and Cecilia R Aragon. 2017. Aeonium: Visual analytics to support collaborative qualitative coding. In IEEE Pacific Visualization Symposium (PacificVis). IEEE, 220--229.

[17]

Steven M Drucker, Danyel Fisher, and Sumit Basu. 2011. Helping users sort faster with adaptive machine learning recommendations. In IFIP Conference on Human-Computer Interaction. Springer, 187--203.

Digital Library

[18]

Mennatallah El-Assady, Rita Sevastjanova, Fabian Sperrle, Daniel Keim, and Christopher Collins. 2018. Progressive Learning of Topic Modeling Parameters: A Visual Analytics Framework. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 382--391.

[19]

C. Felix, S. Franconeri, and E. Bertini. 2018. Taking Word Clouds Apart: An Empirical Investigation of the Design Space for Keyword Summaries. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 657--666.

[20]

Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.

[21]

Benjamin Höferlin, Rudolf Netzel, Markus Höferlin, Daniel Weiskopf, and Gunther Heidemann. 2012. Inter-active learning of ad-hoc classifiers for video visual analytics. In, 2012 IEEE Symposium on Visual Analytics Science and Technology (VAST). IEEE, 23--32.

Digital Library

[22]

Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems. 856--864.

Digital Library

[23]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. ACM, 159--166.

Digital Library

[24]

Minjeong Kim, Kyeongpil Kang, Deokgun Park, Jaegul Choo, and Niklas Elmqvist. 2017. Topiclens: Efficient multi-level visual topic exploration of large-scale document collections. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 151--160.

Digital Library

[25]

Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. 2014. Structured labeling for facilitating concept evolution in machine learning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3075--3084.

Digital Library

[26]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188--1196.

Digital Library

[27]

Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. In Computer Graphics Forum, Vol. 31. Wiley Online Library, 1155--1164.

Digital Library

[28]

Tak Yeon Lee, Alison Smith, Kevin Seppi, Niklas Elmqvist, Jordan Boyd-Graber, and Leah Findlater. 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies 105 (2017), 28--42.

[29]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.

[30]

Megh Marathe and Kentaro Toyama. 2018. Semi-Automated Coding for Qualitative Research: A User-Centered Inquiry and Initial Prototypes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, 348:1--348:12.

Digital Library

[31]

Fredrik Olsson. 2009. A literature survey of active machine learning in the context of natural language processing. (2009).

[32]

Deokgun Park, Seungyeon Kim, Jurim Lee, Jaegul Choo, Nicholas Diakopoulos, and Niklas Elmqvist. 2018. ConceptVector: text visual analytics via interactive lexicon building using word embedding. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 361--370.

Digital Library

[33]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14--1162

[34]

Daniel Ramage, Evan Rosen, Jason Chuang, Christopher D. Manning, and Daniel A. McFarland. 2009. Topic Modeling for the Social Sciences. In Workshop on Applications for Topic Models, NIPS. http://vis.stanford.edu/papers/topic-modeling-social-sciences

[35]

Christin Seifert and Michael Granitzer. 2010. User-based active learning. In IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 418--425.

Digital Library

[36]

Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1 (2012), 1--114.

Digital Library

[37]

Ehsan Sherkat, Seyednaser Nourashrafeddin, Evangelos E Milios, and Rosane Minghim. 2018. Interactive Document Clustering Revisited: A Visual Analytics Approach. In 23rd International Conference on Intelligent User Interfaces. ACM, 281--292.

Digital Library

[38]

James J Thomas and Kristin A Cook. 2006. A visual analytics agenda. IEEE computer graphics and applications 26, 1 (2006), 10--13.

Digital Library

[39]

Yelp. 2018. Yelp Open Dataset. (2018). Retrieved March 01, 2018 from https://www.yelp.com/dataset.

Cited By

Radeta MFreitas RRodrigues CZuniga ANguyen NFlores HNurmi P(2024)Man and the Machine: Effects of AI-assisted Human Labeling on Interactive Annotation of Real-time Video StreamsACM Transactions on Interactive Intelligent Systems10.1145/364945714:2(1-22)Online publication date: 29-Feb-2024
https://dl.acm.org/doi/10.1145/3649457
Overney CSaldías BDimitrakopoulou DRoy D(2024)SenseMate: An Accessible and Beginner-Friendly Human-AI Platform for Qualitative Data AnalysisProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645194(922-939)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645194
Kim YLee ELee YOh U(2024)Understanding Novice's Annotation Process For 3D Semantic Segmentation Task With Human-In-The-LoopProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645150(444-454)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645150
Show More Cited By

Index Terms

The Exploratory Labeling Assistant: Mixed-Initiative Label Curation with Large Document Collections
1. Human-centered computing

Recommendations

Word clouds for efficient document labeling
DS'11: Proceedings of the 14th international conference on Discovery science

In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labelers - a tedious and time-consuming work. We propose to use condensed representations ...
Boundary-labeling algorithms for panorama images
GIS '11: Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Boundary labeling deals with placing annotations for objects in an image on the boundary of that image. This problem occurs frequently in situations where placing labels directly in the image is impossible or produces too much visual clutter. Previous ...
Effective Document Labeling with Very Few Seed Words: A Topic Model Approach
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology

October 2018

1016 pages

ISBN:9781450359481

DOI:10.1145/3242587

General Chairs:
Patrick Baudisch
Hasso-Plattner Institute, Germany
,
Albrecht Schmidt
LMU, Germany
,
Program Chair:
Andy Wilson
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

CAPES Foundation, Ministryof Education of Brazil

Conference

UIST '18

Sponsor:

UIST '18: The 31st Annual ACM Symposium on User Interface Software and Technology

October 14, 2018

Berlin, Germany

Acceptance Rates

UIST '18 Paper Acceptance Rate 80 of 375 submissions, 21%;

Overall Acceptance Rate 561 of 2,567 submissions, 22%

Upcoming Conference

UIST '25

Sponsor:
sigchi
sigchi

The 38th Annual ACM Symposium on User Interface Software and Technology

September 28 - October 1, 2025

Busan , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
841
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)5

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Radeta MFreitas RRodrigues CZuniga ANguyen NFlores HNurmi P(2024)Man and the Machine: Effects of AI-assisted Human Labeling on Interactive Annotation of Real-time Video StreamsACM Transactions on Interactive Intelligent Systems10.1145/364945714:2(1-22)Online publication date: 29-Feb-2024
https://dl.acm.org/doi/10.1145/3649457
Overney CSaldías BDimitrakopoulou DRoy D(2024)SenseMate: An Accessible and Beginner-Friendly Human-AI Platform for Qualitative Data AnalysisProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645194(922-939)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645194
Kim YLee ELee YOh U(2024)Understanding Novice's Annotation Process For 3D Semantic Segmentation Task With Human-In-The-LoopProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645150(444-454)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645150
Fok RLipka NSun TSiu A(2024)Marco: Supporting Business Document Workflows via Collection-Centric Information Foraging with Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641969(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3641969
Wyss AMorgenshtern GHirsch-Hüsler ABernard J(2024)DaedalusData: Exploration, Knowledge Externalization and Labeling of Particles in Medical Manufacturing — A Design StudyIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345632931:1(54-64)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3456329
Cheng FKeller MQu HGehlenborg NWang Q(2023)Polyphony: an Interactive Transfer Learning Framework for Single-Cell Data AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.320940829:1(591-601)Online publication date: Jan-2023
https://doi.org/10.1109/TVCG.2022.3209408
Li JTan HHuang W(2022)Active Pattern Classification for Automatic Visual Exploration of Multi-Dimensional DataApplied Sciences10.3390/app12221138612:22(11386)Online publication date: 10-Nov-2022
https://doi.org/10.3390/app122211386
Zhang YWang YZhang HZhu BChen SZhang D(2022)OneLabeler: A Flexible System for Building Data Labeling ToolsProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517612(1-22)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3517612
Jia SLi ZChen NZhang J(2022)Towards Visual Explainable Active Learning for Zero-Shot ClassificationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.311479328:1(791-801)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TVCG.2021.3114793
Passos LViana LOliveira EConte T(2021)LabelUX! Guidelines to support software engineers to design data labeling systemsProceedings of the XX Brazilian Symposium on Software Quality10.1145/3493244.3493252(1-10)Online publication date: 8-Nov-2021
https://dl.acm.org/doi/10.1145/3493244.3493252
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten