skip to main content
10.1145/3565291.3565328acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbdtConference Proceedingsconference-collections
research-article

Research on parallel corpus classification based on pre-trained model

Published:16 December 2022Publication History

ABSTRACT

At present, the data in most text classification tasks are only in a single language, but the bilingual text information value can be fully utilized in the scenario of Chinese-English parallel corpus. A classification model combining the text features of pre-training model ERNIE and BERT is proposed. ERNIE is used to process The Chinese corpus, and BERT is used to process the English corpus.TextCNN is used to fuse text feature vectors.Thus, the classification effect of parallel corpus can be improved.Comparative experimental tests were performed on the data set.The results show that this method has better classification effect in parallel corpus.

References

  1. Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mikolov T, Chen K, Corrado G, Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.Google ScholarGoogle Scholar
  3. Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google ScholarGoogle Scholar
  4. Devlin J, Chang M W, Lee K, BERT: Pre-training of DeepBidirectional Transformers for Language Understanding[OL].arXiv Preprint, arXiv: 1810.04805.Google ScholarGoogle Scholar
  5. Sun Y, Wang S H, Li Y K, ERNIE 2.0: A Continual PreTraining Framework for Language Understanding[J]. Proceedingsof the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.Google ScholarGoogle Scholar
  6. Duan Dandan, Tang Jiashan, Wen Yong, Yuan Kehai. Chinese Short Text Classification Algorithm based on BERT Model [J]. Computer Engineering,2021,47(01):79-86.DOI:10.19678/j.issn.1000-3428.0056222.Google ScholarGoogle Scholar
  7. Chen Jie, Ma Jing, Li Xiaofeng. Short text Classification method based on pre-trained Model text features [J]. Data Analysis and Knowledge Discovery,2021,5(09):21-30.Google ScholarGoogle Scholar
  8. Bi Yun-shan, QIAN Ya-guan, ZHANG Chao-hua, PAN Jun, XU Qing-hua. Research on Chinese Text Classification based on ERNIE Model [J]. Journal of Zhejiang University of Science and Technology,2021,33(06):461-468+476.Google ScholarGoogle Scholar
  9. Mikolov T, Sutskever I, Chen K, Distributed representations of words and phrases and their compositionality[J]. Advances in neural information processing systems, 2013, 26.Google ScholarGoogle Scholar
  10. Steinberger R, Pouliquen B, Widiger A, The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages[J]. arXiv preprint cs/0609058, 2006.Google ScholarGoogle Scholar
  11. Rakhlin A. Convolutional neural networks for sentence classification[J]. GitHub, 2016.Google ScholarGoogle Scholar
  12. Lin H, Lu Y, Han X, Nugget proposal networks for Chinese event detection[J]. arXiv preprint arXiv:1805.00249, 2018.Google ScholarGoogle Scholar
  13. Minaee S, Kalchbrenner N, Cambria E, Deep learning–based text classification: a comprehensive review[J]. ACM Computing Surveys (CSUR), 2021, 54(3): 1-40.Google ScholarGoogle Scholar

Index Terms

  1. Research on parallel corpus classification based on pre-trained model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICBDT '22: Proceedings of the 5th International Conference on Big Data Technologies
      September 2022
      454 pages
      ISBN:9781450396875
      DOI:10.1145/3565291

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 December 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format