Abstract
The embedding representation of the case text represent text as vector which consist information of original texts abundantly. Text embedding representation usually uses text statistical features or content features alone. However, case texts have characteristics that include similar structure, repeated words, and different text lengths. And the statistical feature or content feature cannot represent case text efficiently. In this paper, we propose a joint variational autoencoder (VAE) to represent case text embedding representation. We consider the statistical features and content features of case texts together, and use VAE to align the two features into the same space. We compare our representations with existing methods in terms of quality, relationship, and efficiency. The experiment results show that our method has achieved good results, which have higher performance than the model using single feature.









Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Gururangan S, Dang T, Card D et al (2019) Variational pretraining for semi-supervised text classification. In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp 5880–5894
Zhao R, Mao K (2017) Fuzzy bag-of-words model for document representation. IEEE Trans Fuzzy Syst 26(2):794–804
Ma S, Sun X, Wang Y et al (2018) Bag-of-words as target for neural machine translation. In: Proceedings of the 56th annual meeting of the association for computational linguistics, vol 2 (Short Papers). pp 332–338
Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based framework for text categorization. Proc Eng 69:1356–1364
Zhu Z, Liang J, Li D et al (2019) Hot topic detection based on a refined TF-IDF algorithm. IEEE Access 7:26996–27007
Blei DM, Ng AY, Jordan MI et al (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Johnson R, Zhang T (2015) Effective use of word order for text categorization with convolutional neural networks. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL HLT 2015
Naz S, Umar AI, Ahmad R et al (2017) Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features. Neural Comput Appl 28(2):219–231
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1746–1751
Yang Z, Yang D, Dyer C et al (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. pp 1480–1489
Gupta P, Pagliardini M, Jaggi M (2019) Better word embeddings by disentangling contextual n-gram information. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers). pp 933–939
Yang M et al (2018) Investigating capsule networks with dynamic routing for text classification. In: Proceedings of the 2018 conference on empirical methods in natural language processing
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In NAACL-HLT
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In NeurIPS
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training
Devlin J, Chang M-W, Lee K, Toutanova K (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT
Kingma DP, Welling M (2014) Auto-encoding variational bayes. Stat 1050:1
Bowman S, Vilnis L, Vinyals O, et al (2016) Generating sentences from a continuous space[C]. In: Proceedings of the 20th SIGNLL conference on computational natural language learning, p 10–21
Yishu M, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: International conference on machine learning
Yang Z, Hu Z, Salakhutdinov R et al (2017) Improved variational autoencoders for text modeling using dilated convolutions. In: Proceedings of the 34th international conference on machine learning, vol 70. JMLR. org, pp 3881–3890
Hoyle AM, Wolf-Sonkin L, Wallach H et al (2019) combining sentiment lexica with a multi-view variational autoencoder. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers). pp 635–640
Zhao T, Zhao R, Eskenazi M (2017) Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In: Proceedings of the 55th annual meeting of the association for computational linguistics, vol 1 (Long Papers). pp 654–664
Kusner MJ, Paige B, Hernández-Lobato JM (2017) Grammar variational autoencoder. In: Proceedings of the 34th international conference on machine learning, vol 70. JMLR. org, pp 1945–1954
Li X, Chen Z, Poon LKM et al (2019) Learning latent superstructures in variational autoencoders for deep multidimensional clustering. In: Proceedings of international conference on learning representations
Paszke A, Gross S, Chintala S (2017) Automatic differentiation in PyTorch. In: Proceedings of the NIPS auto diff workshop. MIT Press
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Yinhan L et al (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Zhenzhong L et al (2020) ALBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of international conference on learning representations
Acknowledgements
The work was supported by National Key Research and Development Plan (Grant Nos. 2018YFC0830101, 2018YFC0830105, 2018YFC0830100), National Natural Science Foundation of China (Grant Nos. 61972186, 61761026, 61732005, 61672271 and 61762056), Yunnan high-tech industry development project (Grant No. 201606), Yunnan provincial major science and technology special plan projects: digitization research and application demonstration of Yunnan characteristic industry (Grant No. 202002AD080001-5), Yunnan Basic Research Project (Grant Nos. 202001AS070014, 2018FB104), and Talent Fund for Kunming University of Science and Technology (Grant No. KKSY201703005).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, R., Gao, S., Yu, Z. et al. Case2vec: joint variational autoencoder for case text embedding representation. Int. J. Mach. Learn. & Cyber. 12, 2517–2528 (2021). https://doi.org/10.1007/s13042-021-01335-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01335-3