research-article

Open access

Random Division: An Effective Method for Chinese Text Classification

Author:

NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

Pages 21 - 25

https://doi.org/10.1145/3582768.3582783

Published: 27 June 2023 Publication History

All formats PDF

Abstract

As a fundamental part of natural language processing, text classification is the backbone of tasks and applications such as machine translation and classification. Among the text classification tasks of all languages, the one for Chinese appears to be one of the most challenging due to the complex structures and expressions within the nature of Chinese. Researchers generally require a significant amount of data for model training and tuning, while most of the time, that desired amount of data cannot be fulfilled and satisfied. Given the circumstances, we propose an effective data enhancement technique to lower the demand for data. The central principle is as follows: Randomize the acquired word vectors and tokens from tokenizing the text based on a certain density level (i.e., every group contains five words), then use the randomized results as data input. During the above process, a considerable number of data variations would be generated, easing the demand for data. From the experiments, we tested our theory on multiple Chinese natural language processing datasets and received signs of improvements in model performance across all the datasets used, thus proving the validity of the previously mentioned method.

References

[1]

Ma, Yajing, “Chinese text classification review.” 2018 9th In- ternational Conference on Information Technology in Medicine and Education (ITME). IEEE, 2018.

Google Scholar

[2]

Wong, Pak-kwong, and Chorkin Chan. “Chinese word segmentation based on maximum matching and word binding force.” COLING 1996 Volume 1: The 16th International Conference on Computational Lin- guistics. 1996.

Google Scholar

[3]

Lafferty, John, Andrew McCallum, and Fernando CN Pereira. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” (2001).

Digital Library

Google Scholar

[4]

Kersting, Kristian, Luc De Raedt, and Tapani Raiko. “Logical hidden markov models.” Journal of Artificial Intelligence Research 25 (2006): 425-456.

Digital Library

Google Scholar

[5]

Onishi, Takamune, and Hiromitsu Shiina. “Distributed Representation Computation Using CBOW Model and Skip–gram Model.” 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI). IEEE, 2020.

Crossref

Google Scholar

[6]

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.

Crossref

Google Scholar

[7]

Hochreiter, Sepp, and Ju¨rgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.

Digital Library

Google Scholar

[8]

Ding, Yi, ”Research on Text Information Mining Technology of Substation Inspection Based on Improved Jieba.” 2021 International Conference on Wireless Communications and Smart Grid (ICWCSG). IEEE, 2021.

Google Scholar

[9]

Li, Shen, ”Analogical reasoning on chinese morphological and semantic relations.” arXiv preprint arXiv:1805.06504 (2018).

Google Scholar

[10]

Graves, Alex. ”Long short-term memory.” Supervised sequence labelling with recurrent neural networks (2012): 37-45.

Google Scholar

[11]

Staudemeyer, Ralf C., and Eric Rothstein Morris. ”Understanding LSTM–a tutorial into long short-term memory recurrent neural net- works.” arXiv preprint arXiv:1909.09586 (2019)

Google Scholar

Index Terms

Random Division: An Effective Method for Chinese Text Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Index terms have been assigned to the content through auto-classification.

Recommendations

Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

Chinese word segmentation is an essential step in a processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is widely adopted, ...
An integrated approach to chinese word segmentation and part-of-speech tagging
ICCPOL'06: Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead

This paper discusses and compares various integration schemes of Chinese word segmentation and part-of-speech tagging in the framework of true-integration and pseudo-integration. A true-integration approach, named ‘the divide-and-conquer integration', ...
Chinese word segmentation as morpheme-based lexical chunking

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...

Comments

Information & Contributors

Information

Published In

NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

December 2022

241 pages

ISBN:9781450397629

DOI:10.1145/3582768

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

NLPIR 2022

NLPIR 2022: 2022 6th International Conference on Natural Language Processing and Information Retrieval

December 16 - 18, 2022

Bangkok, Thailand

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
186
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)11

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Index Terms

Recommendations

Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

An integrated approach to chinese word segmentation and part-of-speech tagging

Chinese word segmentation as morpheme-based lexical chunking

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations