skip to main content
10.1145/3565291.3565328acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbdtConference Proceedingsconference-collections
research-article

Research on parallel corpus classification based on pre-trained model

Published: 16 December 2022 Publication History

Abstract

At present, the data in most text classification tasks are only in a single language, but the bilingual text information value can be fully utilized in the scenario of Chinese-English parallel corpus. A classification model combining the text features of pre-training model ERNIE and BERT is proposed. ERNIE is used to process The Chinese corpus, and BERT is used to process the English corpus.TextCNN is used to fuse text feature vectors.Thus, the classification effect of parallel corpus can be improved.Comparative experimental tests were performed on the data set.The results show that this method has better classification effect in parallel corpus.

References

[1]
Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[2]
Mikolov T, Chen K, Corrado G, Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[3]
Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[4]
Devlin J, Chang M W, Lee K, BERT: Pre-training of DeepBidirectional Transformers for Language Understanding[OL].arXiv Preprint, arXiv: 1810.04805.
[5]
Sun Y, Wang S H, Li Y K, ERNIE 2.0: A Continual PreTraining Framework for Language Understanding[J]. Proceedingsof the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.
[6]
Duan Dandan, Tang Jiashan, Wen Yong, Yuan Kehai. Chinese Short Text Classification Algorithm based on BERT Model [J]. Computer Engineering,2021,47(01):79-86.
[7]
Chen Jie, Ma Jing, Li Xiaofeng. Short text Classification method based on pre-trained Model text features [J]. Data Analysis and Knowledge Discovery,2021,5(09):21-30.
[8]
Bi Yun-shan, QIAN Ya-guan, ZHANG Chao-hua, PAN Jun, XU Qing-hua. Research on Chinese Text Classification based on ERNIE Model [J]. Journal of Zhejiang University of Science and Technology,2021,33(06):461-468+476.
[9]
Mikolov T, Sutskever I, Chen K, Distributed representations of words and phrases and their compositionality[J]. Advances in neural information processing systems, 2013, 26.
[10]
Steinberger R, Pouliquen B, Widiger A, The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages[J]. arXiv preprint cs/0609058, 2006.
[11]
Rakhlin A. Convolutional neural networks for sentence classification[J]. GitHub, 2016.
[12]
Lin H, Lu Y, Han X, Nugget proposal networks for Chinese event detection[J]. arXiv preprint arXiv:1805.00249, 2018.
[13]
Minaee S, Kalchbrenner N, Cambria E, Deep learning–based text classification: a comprehensive review[J]. ACM Computing Surveys (CSUR), 2021, 54(3): 1-40.

Index Terms

  1. Research on parallel corpus classification based on pre-trained model

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICBDT '22: Proceedings of the 5th International Conference on Big Data Technologies
    September 2022
    454 pages
    ISBN:9781450396875
    DOI:10.1145/3565291
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 December 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BERT,ERNIE,Parallel corpus,TextCNN,Text feature fusion
    2. Short Text Classification

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICBDT 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 15
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media