ABSTRACT
Semi-supervised text classification (STC) has been extensively researched and reduces human annotation. However, existing research assuming that unlabeled data only contains in-distribution texts is unrealistic. This paper extends STC to a more practical Open-set Semi-supervised Text Classification (OSTC) setting, which assumes that the unlabeled data contains out-of-distribution (OOD) texts. The main challenge in OSTC is the false positive inference problem caused by inadvertently including OOD texts during training. To address the problem, we first develop baseline models using outlier detectors for hard OOD-data filtering in a pipeline procedure. Furthermore, we propose a Latent Outlier Softening (LOS) framework that integrates semi-supervised training and outlier detection within probabilistic latent variable modeling. LOS softens the OOD impacts by the Expectation-Maximization (EM) algorithm and weighted entropy maximization. Experiments on 3 created datasets show that LOS significantly outperforms baselines.
Supplemental Material
- Randall Balestriero, Sebastien Paris, and Richard G. Baraniuk. 2020. Analytical Probability Distributions and Exact Expectation-Maximization for Deep Generative Networks. In NeurIPS.Google Scholar
- Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng Wang. 2020. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable. In ACL. Association for Computational Linguistics, 85--96.Google Scholar
- Yu Bao, Hao Zhou, Shujian Huang, Dongqi Wang, Lihua Qian, Xinyu Dai, Jiajun Chen, and Lei Li. 2022. latent-GLAT: Glancing at Latent Variables for Parallel Text Generation. In ACL. 8398--8409.Google Scholar
- Iacer Calixto, Miguel Rios, and Wilker Aziz. 2019. Latent Variable Model for Multi-modal Translation. In ACL. 6392--6405.Google Scholar
- Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of Semantic Representation: Dataless Classification.. In AAAI, Vol. 2. 830--835.Google ScholarDigital Library
- Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In ACL. 2147--2157.Google Scholar
- Junfan Chen, Richong Zhang, Yongyi Mao, Hongyu Guo, and Jie Xu. 2019. Uncover the Ground-Truth Relations in Distant Supervision: A Neural Expectation-Maximization Framework. In EMNLP-IJCNLP. 326--336.Google Scholar
- Junfan Chen, Richong Zhang, Jie Xu, Chunming Hu, and Yongyi Mao. 2022b. A Neural Expectation-Maximization Framework for Noisy Multi-Label Text Classification. TKDE 01 (2022), 1--12.Google Scholar
- Wei Chen, Yeyun Gong, Song Wang, Bolun Yao, Weizhen Qi, Zhongyu Wei, Xiaowu Hu, Bartuer Zhou, Yi Mao, Weizhu Chen, Biao Cheng, and Nan Duan. 2022a. DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation. In ACL. 4852--4864.Google Scholar
- Jihun Choi, Taeuk Kim, and Sang-goo Lee. 2019. A Cross-Sentence Latent Variable Model for Semi-Supervised Text Sequence Matching. In ACL. 4747--4761.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.Google Scholar
- Angela F. Gao, Jorge C. Castellanos, Yisong Yue, Zachary E. Ross, and Katherine L. Bouman. 2021. DeepGEM: Generalized Expectation-Maximization for Blind Inversion. In NeurIPS. 11592--11603.Google Scholar
- Suchin Gururangan, Tam Dang, Dallas Card, and Noah A. Smith. 2019. Variational Pretraining for Semi-supervised Text Classification. In ACL. Association for Computational Linguistics, 5880--5894.Google Scholar
- Dan Hendrycks and Kevin Gimpel. 2017. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In ICLR.Google Scholar
- Junkai Huang, Chaowei Fang, Weikai Chen, Zhenhua Chai, Xiaolin Wei, Pengxu Wei, Liang Lin, and Guanbin Li. 2021. Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for Open-Set Semi-Supervised Learning. In ICCV. 8290--8299.Google Scholar
- Shuning Jin, Sam Wiseman, Karl Stratos, and Karen Livescu. 2020. Discrete Latent Variable Representations for Low-Resource Text Classification. In ACL. 4831--4842.Google Scholar
- Canasai Kruengkrai. 2019. Better Exploiting Latent Variables in Text Modeling. In ACL. 5527--5532.Google Scholar
- Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. 896.Google Scholar
- Ju Hyoung Lee, Sang-Ki Ko, and Yo-Sub Han. 2021. SALNet: Semi-supervised Few-Shot Text Classification with Attention-based Lexicon Construction. In AAAI. 13189--13197.Google Scholar
- Changchun Li, Ximing Li, and Jihong Ouyang. 2021. Semi-Supervised Text Classification with Balanced Deep Representation Distributions. In ACL. 5044--5053.Google Scholar
- Haoran Li, Chun-Mei Feng, Tao Zhou, Yong Xu, and Xiaojun Chang. 2022a. Prompt-driven efficient Open-set Semi-supervised Learning. CoRR , Vol. abs/2209.14205 (2022).Google Scholar
- Shujie Li, Min Yang, Chengming Li, and Ruifeng Xu. 2022b. Dual Pseudo Supervision for Semi-Supervised Text Classification with a Reliable Teacher. In SIGIR. 2513--2518.Google Scholar
- Ting-En Lin and Hua Xu. 2019. Deep Unknown Intent Detection with Margin Loss. In ACL. 5491--5496.Google Scholar
- Chen Liu, Mengchao Zhang, Zhibing Fu, Panpan Hou, and Yu Li. 2021. FLiText: A Faster and Lighter Semi-Supervised Text Classification with Convolution Networks. In EMNLP. 2481--2491.Google Scholar
- Yen-Cheng Liu, Chih-Yao Ma, Xiaoliang Dai, Junjiao Tian, Peter Vajda, Zijian He, and Zsolt Kira. 2022. Open-Set Semi-Supervised Object Detection. In ECCV. 143--159.Google Scholar
- Pablo Mendes, Max Jakob, and Christian Bizer. 2012. DBpedia: A Multilingual Cross-domain Knowledge Base. In LREC. Istanbul, Turkey, 1813--1817.Google Scholar
- Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-Supervised Neural Text Classification. In CIKM. 983--992.Google Scholar
- Takeru Miyato, Andrew M. Dai, and Ian J. Goodfellow. 2017. Adversarial Training Methods for Semi-Supervised Text Classification. In ICLR.Google Scholar
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.Google Scholar
- Kuniaki Saito, Donghyun Kim, and Kate Saenko. 2021. OpenMatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers. In NeurIPS.Google Scholar
- Lei Shu, Hu Xu, and Bing Liu. 2017. DOC: Deep Open Classification of Text Documents. In EMNLP. 2911--2916.Google Scholar
- Antti Tarvainen and Harri Valpola. 2017. Weight-averaged consistency targets improve semi-supervised deep learning results. CoRR , Vol. abs/1703.01780 (2017).Google Scholar
- Austin Cheng-Yun Tsai, Sheng-Ya Lin, and Li-Chen Fu. 2022. Contrast-Enhanced Semi-supervised Text Classification with Few Labels. In AAAI. 11394--11402.Google Scholar
- Qizhe Xie, Zihang Dai, Eduard H. Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised Data Augmentation for Consistency Training. In NeurIPS.Google Scholar
- Hai-Ming Xu, Lingqiao Liu, and Ehsan Abbasnejad. 2022. Progressive Class Semantic Matching for Semi-supervised Text Classification. In NAACL. 3003--3013.Google Scholar
- Guangfeng Yan, Lu Fan, Qimai Li, Han Liu, Xiaotong Zhang, Xiao-Ming Wu, and Albert Y. S. Lam. 2020. Unknown Intent Detection Using Gaussian Mixture Model with an Application to Zero-shot Intent Classification. In ACL. 1050--1060.Google Scholar
- Qing Yu, Daiki Ikami, Go Irie, and Kiyoharu Aizawa. 2020. Multi-task Curriculum Framework for Open-Set Semi-supervised Learning. In ECCV, , Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12357. 438--454.Google Scholar
- Qingfu Zhu, Wei-Nan Zhang, Ting Liu, and William Yang Wang. 2021. Neural Stylistic Response Generation with Disentangled Latent Variables. In ACL. 4391--4401.Google Scholar
- Ronghang Zhu and Sheng Li. 2022. CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization. In ICLR.Google Scholar
Index Terms
- Open-Set Semi-Supervised Text Classification with Latent Outlier Softening
Recommendations
Semi-supervised Based Training Set Construction for Outlier Detection
CLOUDCOM-ASIA '13: Proceedings of the 2013 International Conference on Cloud Computing and Big DataOutliers are sparse and few. It's costly to obtain a training set with enough outliers so that existing approaches to the problem of outlier detection seldom processed with supervised manner. However, given a training set with sufficient outliers, ...
Semi-supervised classification with hybrid generative/discriminative methods
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningWe compare two recently proposed frameworks for combining generative and discriminative probabilistic classifiers and apply them to semi-supervised classification. In both cases we explore the tradeoff between maximizing a discriminative likelihood of ...
Semi-Supervised Text Classification via Self-Paced Semantic-Level Contrast
Advances in Knowledge Discovery and Data MiningAbstractSemi-Supervised Text Classification (SSTC) aims to explore discriminative information from unlabeled texts in a self-training manner. These methods pre-train the deep classifier on labeled texts. Recent works further fine-tune the model on the ...
Comments