Open-Set Semi-Supervised Text Classification with Latent Outlier Softening

Authors:
Junfan Chen

Beihang University, Beijing, China

Beihang University, Beijing, China

0000-0001-6807-0089
View Profile

,
Richong Zhang

Beihang University, Beijing, China

Beihang University, Beijing, China

0000-0002-1207-0300
View Profile

,
Junchi Chen

Beihang University, Beijing, China

Beihang University, Beijing, China

0000-0002-6375-9490
View Profile

,
Chunming Hu

Beihang University, Beijing, China

Beihang University, Beijing, China

0000-0002-9502-3955
View Profile

,
Yongyi Mao

University of Ottawa, Ottawa, Canada

University of Ottawa, Ottawa, Canada

0000-0001-5298-5778
View Profile

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningAugust 2023Pages 226–236https://doi.org/10.1145/3580305.3599456

Published:04 August 2023Publication History

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 226–236

ABSTRACT

Semi-supervised text classification (STC) has been extensively researched and reduces human annotation. However, existing research assuming that unlabeled data only contains in-distribution texts is unrealistic. This paper extends STC to a more practical Open-set Semi-supervised Text Classification (OSTC) setting, which assumes that the unlabeled data contains out-of-distribution (OOD) texts. The main challenge in OSTC is the false positive inference problem caused by inadvertently including OOD texts during training. To address the problem, we first develop baseline models using outlier detectors for hard OOD-data filtering in a pipeline procedure. Furthermore, we propose a Latent Outlier Softening (LOS) framework that integrates semi-supervised training and outlier detection within probabilistic latent variable modeling. LOS softens the OOD impacts by the Expectation-Maximization (EM) algorithm and weighted entropy maximization. Experiments on 3 created datasets show that LOS significantly outperforms baselines.

Supplemental Material

video1355596953.mp4

mp4

3.1 MB

Download

References

Randall Balestriero, Sebastien Paris, and Richard G. Baraniuk. 2020. Analytical Probability Distributions and Exact Expectation-Maximization for Deep Generative Networks. In NeurIPS.Google Scholar
Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng Wang. 2020. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable. In ACL. Association for Computational Linguistics, 85--96.Google Scholar
Yu Bao, Hao Zhou, Shujian Huang, Dongqi Wang, Lihua Qian, Xinyu Dai, Jiajun Chen, and Lei Li. 2022. latent-GLAT: Glancing at Latent Variables for Parallel Text Generation. In ACL. 8398--8409.Google Scholar
Iacer Calixto, Miguel Rios, and Wilker Aziz. 2019. Latent Variable Model for Multi-modal Translation. In ACL. 6392--6405.Google Scholar
Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of Semantic Representation: Dataless Classification.. In AAAI, Vol. 2. 830--835.Google ScholarDigital Library
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In ACL. 2147--2157.Google Scholar
Junfan Chen, Richong Zhang, Yongyi Mao, Hongyu Guo, and Jie Xu. 2019. Uncover the Ground-Truth Relations in Distant Supervision: A Neural Expectation-Maximization Framework. In EMNLP-IJCNLP. 326--336.Google Scholar
Junfan Chen, Richong Zhang, Jie Xu, Chunming Hu, and Yongyi Mao. 2022b. A Neural Expectation-Maximization Framework for Noisy Multi-Label Text Classification. TKDE 01 (2022), 1--12.Google Scholar
Wei Chen, Yeyun Gong, Song Wang, Bolun Yao, Weizhen Qi, Zhongyu Wei, Xiaowu Hu, Bartuer Zhou, Yi Mao, Weizhu Chen, Biao Cheng, and Nan Duan. 2022a. DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation. In ACL. 4852--4864.Google Scholar
Jihun Choi, Taeuk Kim, and Sang-goo Lee. 2019. A Cross-Sentence Latent Variable Model for Semi-Supervised Text Sequence Matching. In ACL. 4747--4761.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.Google Scholar
Angela F. Gao, Jorge C. Castellanos, Yisong Yue, Zachary E. Ross, and Katherine L. Bouman. 2021. DeepGEM: Generalized Expectation-Maximization for Blind Inversion. In NeurIPS. 11592--11603.Google Scholar
Suchin Gururangan, Tam Dang, Dallas Card, and Noah A. Smith. 2019. Variational Pretraining for Semi-supervised Text Classification. In ACL. Association for Computational Linguistics, 5880--5894.Google Scholar
Dan Hendrycks and Kevin Gimpel. 2017. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In ICLR.Google Scholar
Junkai Huang, Chaowei Fang, Weikai Chen, Zhenhua Chai, Xiaolin Wei, Pengxu Wei, Liang Lin, and Guanbin Li. 2021. Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for Open-Set Semi-Supervised Learning. In ICCV. 8290--8299.Google Scholar
Shuning Jin, Sam Wiseman, Karl Stratos, and Karen Livescu. 2020. Discrete Latent Variable Representations for Low-Resource Text Classification. In ACL. 4831--4842.Google Scholar
Canasai Kruengkrai. 2019. Better Exploiting Latent Variables in Text Modeling. In ACL. 5527--5532.Google Scholar
Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. 896.Google Scholar
Ju Hyoung Lee, Sang-Ki Ko, and Yo-Sub Han. 2021. SALNet: Semi-supervised Few-Shot Text Classification with Attention-based Lexicon Construction. In AAAI. 13189--13197.Google Scholar
Changchun Li, Ximing Li, and Jihong Ouyang. 2021. Semi-Supervised Text Classification with Balanced Deep Representation Distributions. In ACL. 5044--5053.Google Scholar
Haoran Li, Chun-Mei Feng, Tao Zhou, Yong Xu, and Xiaojun Chang. 2022a. Prompt-driven efficient Open-set Semi-supervised Learning. CoRR , Vol. abs/2209.14205 (2022).Google Scholar
Shujie Li, Min Yang, Chengming Li, and Ruifeng Xu. 2022b. Dual Pseudo Supervision for Semi-Supervised Text Classification with a Reliable Teacher. In SIGIR. 2513--2518.Google Scholar
Ting-En Lin and Hua Xu. 2019. Deep Unknown Intent Detection with Margin Loss. In ACL. 5491--5496.Google Scholar
Chen Liu, Mengchao Zhang, Zhibing Fu, Panpan Hou, and Yu Li. 2021. FLiText: A Faster and Lighter Semi-Supervised Text Classification with Convolution Networks. In EMNLP. 2481--2491.Google Scholar
Yen-Cheng Liu, Chih-Yao Ma, Xiaoliang Dai, Junjiao Tian, Peter Vajda, Zijian He, and Zsolt Kira. 2022. Open-Set Semi-Supervised Object Detection. In ECCV. 143--159.Google Scholar
Pablo Mendes, Max Jakob, and Christian Bizer. 2012. DBpedia: A Multilingual Cross-domain Knowledge Base. In LREC. Istanbul, Turkey, 1813--1817.Google Scholar
Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-Supervised Neural Text Classification. In CIKM. 983--992.Google Scholar
Takeru Miyato, Andrew M. Dai, and Ian J. Goodfellow. 2017. Adversarial Training Methods for Semi-Supervised Text Classification. In ICLR.Google Scholar
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.Google Scholar
Kuniaki Saito, Donghyun Kim, and Kate Saenko. 2021. OpenMatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers. In NeurIPS.Google Scholar
Lei Shu, Hu Xu, and Bing Liu. 2017. DOC: Deep Open Classification of Text Documents. In EMNLP. 2911--2916.Google Scholar
Antti Tarvainen and Harri Valpola. 2017. Weight-averaged consistency targets improve semi-supervised deep learning results. CoRR , Vol. abs/1703.01780 (2017).Google Scholar
Austin Cheng-Yun Tsai, Sheng-Ya Lin, and Li-Chen Fu. 2022. Contrast-Enhanced Semi-supervised Text Classification with Few Labels. In AAAI. 11394--11402.Google Scholar
Qizhe Xie, Zihang Dai, Eduard H. Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised Data Augmentation for Consistency Training. In NeurIPS.Google Scholar
Hai-Ming Xu, Lingqiao Liu, and Ehsan Abbasnejad. 2022. Progressive Class Semantic Matching for Semi-supervised Text Classification. In NAACL. 3003--3013.Google Scholar
Guangfeng Yan, Lu Fan, Qimai Li, Han Liu, Xiaotong Zhang, Xiao-Ming Wu, and Albert Y. S. Lam. 2020. Unknown Intent Detection Using Gaussian Mixture Model with an Application to Zero-shot Intent Classification. In ACL. 1050--1060.Google Scholar
Qing Yu, Daiki Ikami, Go Irie, and Kiyoharu Aizawa. 2020. Multi-task Curriculum Framework for Open-Set Semi-supervised Learning. In ECCV, , Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12357. 438--454.Google Scholar
Qingfu Zhu, Wei-Nan Zhang, Ting Liu, and William Yang Wang. 2021. Neural Stylistic Response Generation with Disentangled Latent Variables. In ACL. 4391--4401.Google Scholar
Ronghang Zhu and Sheng Li. 2022. CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization. In ICLR.Google Scholar

Index Terms

Open-Set Semi-Supervised Text Classification with Latent Outlier Softening
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Semi-supervised learning settings

Recommendations

Semi-supervised Based Training Set Construction for Outlier Detection
CLOUDCOM-ASIA '13: Proceedings of the 2013 International Conference on Cloud Computing and Big Data

Outliers are sparse and few. It's costly to obtain a training set with enough outliers so that existing approaches to the problem of outlier detection seldom processed with supervised manner. However, given a training set with sufficient outliers, ...
Read More
Semi-supervised classification with hybrid generative/discriminative methods
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

We compare two recently proposed frameworks for combining generative and discriminative probabilistic classifiers and apply them to semi-supervised classification. In both cases we explore the tradeoff between maximizing a discriminative likelihood of ...
Read More
Semi-Supervised Text Classification via Self-Paced Semantic-Level Contrast
Advances in Knowledge Discovery and Data Mining
Abstract
Semi-Supervised Text Classification (SSTC) aims to explore discriminative information from unlabeled texts in a self-training manner. These methods pre-train the deep classifier on labeled texts. Recent works further fine-tune the model on the ... $^{}$ $^{}$ $^{}$ $^{}$
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 August 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
latent variable
semi-supervised learning
text classification
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 457
  Total Downloads
- Downloads (Last 12 months)457
- Downloads (Last 6 weeks)85
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Open-Set Semi-Supervised Text Classification with Latent Outlier Softening

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Semi-supervised Based Training Set Construction for Outlier Detection

Semi-supervised classification with hybrid generative/discriminative methods

Semi-Supervised Text Classification via Self-Paced Semantic-Level Contrast