research-article

Research on parallel corpus classification based on pre-trained model

Authors:

Xiang LiAuthors Info & Claims

ICBDT '22: Proceedings of the 5th International Conference on Big Data Technologies

Pages 229 - 235

https://doi.org/10.1145/3565291.3565328

Published: 16 December 2022 Publication History

Get Access

Abstract

At present, the data in most text classification tasks are only in a single language, but the bilingual text information value can be fully utilized in the scenario of Chinese-English parallel corpus. A classification model combining the text features of pre-training model ERNIE and BERT is proposed. ERNIE is used to process The Chinese corpus, and BERT is used to process the English corpus.TextCNN is used to fuse text feature vectors.Thus, the classification effect of parallel corpus can be improved.Comparative experimental tests were performed on the data set.The results show that this method has better classification effect in parallel corpus.

References

[1]

Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.

Digital Library

Google Scholar

[2]

Mikolov T, Chen K, Corrado G, Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.

Google Scholar

[3]

Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

Google Scholar

[4]

Devlin J, Chang M W, Lee K, BERT: Pre-training of DeepBidirectional Transformers for Language Understanding[OL].arXiv Preprint, arXiv: 1810.04805.

Google Scholar

[5]

Sun Y, Wang S H, Li Y K, ERNIE 2.0: A Continual PreTraining Framework for Language Understanding[J]. Proceedingsof the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.

Google Scholar

[6]

Duan Dandan, Tang Jiashan, Wen Yong, Yuan Kehai. Chinese Short Text Classification Algorithm based on BERT Model [J]. Computer Engineering,2021,47(01):79-86.

Crossref

Google Scholar

[7]

Chen Jie, Ma Jing, Li Xiaofeng. Short text Classification method based on pre-trained Model text features [J]. Data Analysis and Knowledge Discovery,2021,5(09):21-30.

Google Scholar

[8]

Bi Yun-shan, QIAN Ya-guan, ZHANG Chao-hua, PAN Jun, XU Qing-hua. Research on Chinese Text Classification based on ERNIE Model [J]. Journal of Zhejiang University of Science and Technology,2021,33(06):461-468+476.

Google Scholar

[9]

Mikolov T, Sutskever I, Chen K, Distributed representations of words and phrases and their compositionality[J]. Advances in neural information processing systems, 2013, 26.

Google Scholar

[10]

Steinberger R, Pouliquen B, Widiger A, The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages[J]. arXiv preprint cs/0609058, 2006.

Google Scholar

[11]

Rakhlin A. Convolutional neural networks for sentence classification[J]. GitHub, 2016.

Google Scholar

[12]

Lin H, Lu Y, Han X, Nugget proposal networks for Chinese event detection[J]. arXiv preprint arXiv:1805.00249, 2018.

Google Scholar

[13]

Minaee S, Kalchbrenner N, Cambria E, Deep learning–based text classification: a comprehensive review[J]. ACM Computing Surveys (CSUR), 2021, 54(3): 1-40.

Google Scholar

Index Terms

Research on parallel corpus classification based on pre-trained model
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Corpus-Based statistics of pre-qin chinese
CLSW'12: Proceedings of the 13th Chinese conference on Chinese Lexical Semantics

The Pre-Qin Chinese plays a key role in the history of Chinese. However, for the lack of annotated corpus, the overview of Pre-Qin Chinese vocabulary is still not clear. This paper introduces the corpus of 25 Pre-Qin classical texts, which are under ...
A massively parallel corpus: the Bible in 100 languages

We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...

Comments

Information & Contributors

Information

Published In

ICBDT '22: Proceedings of the 5th International Conference on Big Data Technologies

September 2022

454 pages

ISBN:9781450396875

DOI:10.1145/3565291

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICBDT 2022

ICBDT 2022: 2022 5th International Conference on Big Data Technologies

September 23 - 25, 2022

Qingdao, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
15
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

Corpus-Based statistics of pre-qin chinese

A massively parallel corpus: the Bible in 100 languages

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations