skip to main content
10.1145/3582768.3582783acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnlpirConference Proceedingsconference-collections
research-article
Open access

Random Division: An Effective Method for Chinese Text Classification

Published: 27 June 2023 Publication History

Abstract

As a fundamental part of natural language processing, text classification is the backbone of tasks and applications such as machine translation and classification. Among the text classification tasks of all languages, the one for Chinese appears to be one of the most challenging due to the complex structures and expressions within the nature of Chinese. Researchers generally require a significant amount of data for model training and tuning, while most of the time, that desired amount of data cannot be fulfilled and satisfied. Given the circumstances, we propose an effective data enhancement technique to lower the demand for data. The central principle is as follows: Randomize the acquired word vectors and tokens from tokenizing the text based on a certain density level (i.e., every group contains five words), then use the randomized results as data input. During the above process, a considerable number of data variations would be generated, easing the demand for data. From the experiments, we tested our theory on multiple Chinese natural language processing datasets and received signs of improvements in model performance across all the datasets used, thus proving the validity of the previously mentioned method.

References

[1]
Ma, Yajing, “Chinese text classification review.” 2018 9th In- ternational Conference on Information Technology in Medicine and Education (ITME). IEEE, 2018.
[2]
Wong, Pak-kwong, and Chorkin Chan. “Chinese word segmentation based on maximum matching and word binding force.” COLING 1996 Volume 1: The 16th International Conference on Computational Lin- guistics. 1996.
[3]
Lafferty, John, Andrew McCallum, and Fernando CN Pereira. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” (2001).
[4]
Kersting, Kristian, Luc De Raedt, and Tapani Raiko. “Logical hidden markov models.” Journal of Artificial Intelligence Research 25 (2006): 425-456.
[5]
Onishi, Takamune, and Hiromitsu Shiina. “Distributed Representation Computation Using CBOW Model and Skip–gram Model.” 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI). IEEE, 2020.
[6]
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[7]
Hochreiter, Sepp, and Ju¨rgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.
[8]
Ding, Yi, ”Research on Text Information Mining Technology of Substation Inspection Based on Improved Jieba.” 2021 International Conference on Wireless Communications and Smart Grid (ICWCSG). IEEE, 2021.
[9]
Li, Shen, ”Analogical reasoning on chinese morphological and semantic relations.” arXiv preprint arXiv:1805.06504 (2018).
[10]
Graves, Alex. ”Long short-term memory.” Supervised sequence labelling with recurrent neural networks (2012): 37-45.
[11]
Staudemeyer, Ralf C., and Eric Rothstein Morris. ”Understanding LSTM–a tutorial into long short-term memory recurrent neural net- works.” arXiv preprint arXiv:1909.09586 (2019)

Index Terms

  1. Random Division: An Effective Method for Chinese Text Classification
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval
        December 2022
        241 pages
        ISBN:9781450397629
        DOI:10.1145/3582768
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 June 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Chinese word segmentation
        2. Data Enhancement
        3. Natural language processing

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        NLPIR 2022

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 186
          Total Downloads
        • Downloads (Last 12 months)117
        • Downloads (Last 6 weeks)11
        Reflects downloads up to 13 Feb 2025

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Login options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media