skip to main content
10.1145/3430036.3430060acmotherconferencesArticle/Chapter ViewAbstractPublication PagesvinciConference Proceedingsconference-collections
research-article

Cluster-clean-label: an interactive machine learning approach for labeling high-dimensional data

Published: 08 December 2020 Publication History

Abstract

One of the major problems of applying supervised machine learning methods in real-world problems is the absence of labeled data. Labeling huge amounts of data is time consuming and cost intensive. Moreover, in many cases, labels can only be assigned by domain experts like medical doctors or engineers, who have little time and do not necessarily have profound machine learning knowledge. In this paper, we propose an efficient interactive cluster-clean-label approach. First, to visualize the potentially huge amount of data, principal component analysis followed by t-SNE projection is applied. On the 2-dimensional representation of the data, HDBSCAN clustering is utilized to identify groups of potentially similar class membership. Subsequently, anomaly detection in form of an autoencoder is applied on each cluster, and instances that are likely to belong to different classes are suggested to the user. The user decides which of these suggested instances to include and restarts the anomaly detection process with the remaining subset of instances. This iterative process is repeated until the user is satisfied with the clusters' purity. Eventually, labels are assigned to the clusters. The approach is evaluated by a user study with 25 participants using the initially unlabeled MNIST data set, where on average users were able to label 91.59% of the data set, with an accuracy of 98.99%. A video showing the approach is available: https://youtu.be/RsLI0dg90qE.

References

[1]
Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433--459, 2010.
[2]
David Agis and Francesc Pozo. A Frequency-Based Approach for the Detection and Classification of Structural Changes Using t-SNE. Sensors, 19(23):5097, 2019.
[3]
Jinwon An and Sungzoon Cho. Variational Autoencoder based Anomaly Detection using Reconstruction Probability. Special Lecture on IE, 2(1), 2015.
[4]
Pierre Baldi. Autoencoders, Unsupervised Learning, and Deep Architectures. In Isabelle Guyon, Gideon Dror, Vincent Lemaire, Graham Taylor, and Daniel Silver, editors, Proceedings of ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proceedings of Machine Learning Research, pages 37--49, Bellevue, Washington, USA, 2012. PMLR.
[5]
Jürgen Bernard, Marco Hutter, Matthias Zeppelzauer, Dieter Fellner, and Michael Sedlmair. Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study. IEEE Transactions on Visualization and Computer Graphics, 24(1):298--308, 2018.
[6]
Jürgen Bernard, Matthias Zeppelzauer, Michael Sedlmair, and Wolfgang Aigner. VIAL: a unified process for visual interactive labeling. The Visual Computer, 34(9):1189--1207, 2018.
[7]
John Brooke. SUS: a "quick and dirty" usability. Usability evaluation in industry, page 189, 1996.
[8]
Ricardo Buettner. Predicting user behavior in electronic markets based on personality-mining in large online social networks. Electronic Markets, 27(3):247--265, 2017.
[9]
Ricardo Buettner, Annika Grimmeisen, and Anne Gotschlich. High-performance diagnosis of sleep disorders: a novel, accurate and fast machine learning approach using electroencephalographic data. In Proceedings of the 53rd Hawaii International Conference on System Sciences, 2020.
[10]
Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-Based Clustering Based on Hierarchical Density Estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 160--172. Springer, 2013.
[11]
Mohammad Chegini, Jürgen Bernard, Philip Berger, Alexei Sourin, Keith Andrews, and Tobias Schreck. Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning. Visual Informatics, 3(1):9--17, 2019.
[12]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, volume 96, pages 226--231, 1996.
[13]
Benedikt Grimmeisen and Andreas Theissler. The Machine Learning Model as a Guide: Pointing Users to Interesting Instances for Labeling through Visual Cues. In The 13th International Symposium on Visual Information Communication and Interaction (VINCI 2020), December 8--10, 2020, Eindhoven, Netherlands. ACM, 2020.
[14]
Geoffrey E Hinton and Sam T Roweis. Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems, pages 857--864, 2003.
[15]
Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504--507, 2006.
[16]
Yann LeCun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.
[17]
Yann LeCun, Y Bengio, and Geoffrey Hinton. Deep Learning. Nature, 521:436--44, 05 2015.
[18]
Yang Liu and Jiajun Zhang. Deep Learning in Machine Translation, pages 147--183. Springer Singapore, Singapore, 2018.
[19]
Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86):2579--2605, 2008.
[20]
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
[21]
Fredrik Olsson. A literature survey of active machine learning in the context of natural language processing. 2009.
[22]
Mohammad Peikari, Sherine Salama, Sharon Nofech-Mozes, and Anne L. Martel. A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification. Scientific Reports, 8(1), 2018.
[23]
S Benson Edwin Raj and A Annie Portia. Analysis on credit card fraud detection methods. In 2011 International Conference on Computer, Communication and Electrical Technology (ICCCET), pages 152--156. IEEE, 2011.
[24]
Waseem Rawat and Zenghui Wang. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation, 29:1--98, 06 2017.
[25]
Thilo Rieg, Janek Frick, Marius Hitzler, and Ricardo Buettner. High-performance detection of alcoholism by unfolding the amalgamated EEG spectra using the Random Forests method. In Proceedings of the 52nd Hawaii International Conference on System Sciences, 2019.
[26]
Dominik Sacha, Andreas Stoffel, Florian Stoffel, Bum Chul Kwon, Geoffrey Ellis, and Daniel A Keim. Knowledge Generation Model for Visual Analytics. IEEE Transactions on Visualization and Computer Graphics, 20(12):1604--1613, 2014.
[27]
Jeff Sauro and James R. Lewis. Chapter 8 - standardized usability questionnaires. In Jeff Sauro and James R. Lewis, editors, Quantifying the User Experience (Second Edition), pages 185 -- 248. Morgan Kaufmann, Boston, second edition edition, 2016.
[28]
Burr Settles. Active Learning Literature Survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
[29]
Gian Antonio Susto, Andrea Schirru, Simone Pampuri, Seán McLoone, and Alessandro Beghi. Machine Learning for Predictive Maintenance: A Multiple Classifier Approach. IEEE Transactions on Industrial Informatics, 11(3):812--820, 2014.
[30]
Akmal Szil'rd Vajda, Junaidi and Gernot A. Fink. A Semi-supervised Ensemble Learning Approach for Character Labeling with Minimal Human Effort. In 2011 International Conference on Document Analysis and Recognition, pages 259--263. IEEE, 2011.
[31]
Andreas Theissler. Detecting Known and Unknown Faults in Automotive Systems Using Ensemble-based Anomaly Detection. Knowledge-Based Systems, 123(C):163--173, May 2017.
[32]
Andreas Theissler, Anna-Lena Kraft, Max Rudeck, and Fabian Erlenbusch. VIAL-AD: Visual Interactive Labelling for Anomaly Detection - An approach and open research questions. In 4th International Workshop on Interactive Adaptive Learning (IAL2020). CEUR-WS, 2020.
[33]
Devis Tuia, Michele Volpi, Loris Copa, Mikhail Kanevski, and Jordi Munoz-Mari. A Survey of Active Learning Algorithms for Supervised Remote Sensing Image Classification. IEEE Journal of Selected Topics in Signal Processing, 5(3):606--617, 2011.
[34]
Laurens Van Der Maaten, Eric Postma, and Jaap Van den Herik. Dimensionality Reduction: A Comparative Review. J Mach Learn Res, 10(66-71):13, 2009.
[35]
Jarke J Van Wijk. The value of visualization. In VIS 05. IEEE Visualization, 2005., pages 79--86. IEEE, 2005.
[36]
D. Wang and Y. Shang. A new active labeling method for deep learning. In 2014 International Joint Conference on Neural Networks (IJCNN), pages 112--119, 2014.
[37]
Meng Wang and Xian-Sheng Hua. Active Learning in Multimedia Annotation and Retrieval: A Survey. ACM Transactions on Intelligent Systems and Technology (TIST), 2(2):1--21, 2011.
[38]
Thorsten Wuest, Christopher Irgens, and Klaus-Dieter Thoben. An approach to monitoring quality in manufacturing using supervised machine learning on product state data. Journal of Intelligent Manufacturing, 25(5):1167--1180, 2014.

Cited By

View all
  • (2024)Detection of Zero-Day Attacks in a Software-Defined LEO Constellation Network Using Enhanced Network Metric PredictionsIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34819655(6611-6634)Online publication date: 2024
  • (2024)Integrating categorical and continuous data in a cluster-then-classify methodology for predicting undergraduate student success2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825671(8118-8126)Online publication date: 15-Dec-2024
  • (2022)Active Pattern Classification for Automatic Visual Exploration of Multi-Dimensional DataApplied Sciences10.3390/app12221138612:22(11386)Online publication date: 10-Nov-2022
  • Show More Cited By
  1. Cluster-clean-label: an interactive machine learning approach for labeling high-dimensional data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    VINCI '20: Proceedings of the 13th International Symposium on Visual Information Communication and Interaction
    December 2020
    205 pages
    ISBN:9781450387507
    DOI:10.1145/3430036
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 December 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    VINCI 2020

    Acceptance Rates

    Overall Acceptance Rate 71 of 193 submissions, 37%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)59
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Detection of Zero-Day Attacks in a Software-Defined LEO Constellation Network Using Enhanced Network Metric PredictionsIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34819655(6611-6634)Online publication date: 2024
    • (2024)Integrating categorical and continuous data in a cluster-then-classify methodology for predicting undergraduate student success2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825671(8118-8126)Online publication date: 15-Dec-2024
    • (2022)Active Pattern Classification for Automatic Visual Exploration of Multi-Dimensional DataApplied Sciences10.3390/app12221138612:22(11386)Online publication date: 10-Nov-2022
    • (2022)TRAFFICVIS: Visualizing Organized Activity and Spatio-Temporal Patterns for Detecting and Labeling Human TraffickingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209403(1-10)Online publication date: 2022
    • (2022)ConfusionVisKnowledge-Based Systems10.1016/j.knosys.2022.108651247:COnline publication date: 8-Jul-2022
    • (2022)VisGIL: machine learning-based visual guidance for interactive labelingThe Visual Computer10.1007/s00371-022-02648-239:10(5097-5119)Online publication date: 25-Sep-2022
    • (2020)The machine learning model as a guideProceedings of the 13th International Symposium on Visual Information Communication and Interaction10.1145/3430036.3430058(1-8)Online publication date: 8-Dec-2020

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media