ABSTRACT
At present, the data in most text classification tasks are only in a single language, but the bilingual text information value can be fully utilized in the scenario of Chinese-English parallel corpus. A classification model combining the text features of pre-training model ERNIE and BERT is proposed. ERNIE is used to process The Chinese corpus, and BERT is used to process the English corpus.TextCNN is used to fuse text feature vectors.Thus, the classification effect of parallel corpus can be improved.Comparative experimental tests were performed on the data set.The results show that this method has better classification effect in parallel corpus.
- Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.Google ScholarDigital Library
- Mikolov T, Chen K, Corrado G, Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google Scholar
- Devlin J, Chang M W, Lee K, BERT: Pre-training of DeepBidirectional Transformers for Language Understanding[OL].arXiv Preprint, arXiv: 1810.04805.Google Scholar
- Sun Y, Wang S H, Li Y K, ERNIE 2.0: A Continual PreTraining Framework for Language Understanding[J]. Proceedingsof the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.Google Scholar
- Duan Dandan, Tang Jiashan, Wen Yong, Yuan Kehai. Chinese Short Text Classification Algorithm based on BERT Model [J]. Computer Engineering,2021,47(01):79-86.DOI:10.19678/j.issn.1000-3428.0056222.Google Scholar
- Chen Jie, Ma Jing, Li Xiaofeng. Short text Classification method based on pre-trained Model text features [J]. Data Analysis and Knowledge Discovery,2021,5(09):21-30.Google Scholar
- Bi Yun-shan, QIAN Ya-guan, ZHANG Chao-hua, PAN Jun, XU Qing-hua. Research on Chinese Text Classification based on ERNIE Model [J]. Journal of Zhejiang University of Science and Technology,2021,33(06):461-468+476.Google Scholar
- Mikolov T, Sutskever I, Chen K, Distributed representations of words and phrases and their compositionality[J]. Advances in neural information processing systems, 2013, 26.Google Scholar
- Steinberger R, Pouliquen B, Widiger A, The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages[J]. arXiv preprint cs/0609058, 2006.Google Scholar
- Rakhlin A. Convolutional neural networks for sentence classification[J]. GitHub, 2016.Google Scholar
- Lin H, Lu Y, Han X, Nugget proposal networks for Chinese event detection[J]. arXiv preprint arXiv:1805.00249, 2018.Google Scholar
- Minaee S, Kalchbrenner N, Cambria E, Deep learning–based text classification: a comprehensive review[J]. ACM Computing Surveys (CSUR), 2021, 54(3): 1-40.Google Scholar
Index Terms
- Research on parallel corpus classification based on pre-trained model
Recommendations
Corpus-Based statistics of pre-qin chinese
CLSW'12: Proceedings of the 13th Chinese conference on Chinese Lexical SemanticsThe Pre-Qin Chinese plays a key role in the history of Chinese. However, for the lack of annotated corpus, the overview of Pre-Qin Chinese vocabulary is still not clear. This paper introduces the corpus of 25 Pre-Qin classical texts, which are under ...
A massively parallel corpus: the Bible in 100 languages
We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Comments