ABSTRACT
As a fundamental part of natural language processing, text classification is the backbone of tasks and applications such as machine translation and classification. Among the text classification tasks of all languages, the one for Chinese appears to be one of the most challenging due to the complex structures and expressions within the nature of Chinese. Researchers generally require a significant amount of data for model training and tuning, while most of the time, that desired amount of data cannot be fulfilled and satisfied. Given the circumstances, we propose an effective data enhancement technique to lower the demand for data. The central principle is as follows: Randomize the acquired word vectors and tokens from tokenizing the text based on a certain density level (i.e., every group contains five words), then use the randomized results as data input. During the above process, a considerable number of data variations would be generated, easing the demand for data. From the experiments, we tested our theory on multiple Chinese natural language processing datasets and received signs of improvements in model performance across all the datasets used, thus proving the validity of the previously mentioned method.
- Ma, Yajing, “Chinese text classification review.” 2018 9th In- ternational Conference on Information Technology in Medicine and Education (ITME). IEEE, 2018.Google Scholar
- Wong, Pak-kwong, and Chorkin Chan. “Chinese word segmentation based on maximum matching and word binding force.” COLING 1996 Volume 1: The 16th International Conference on Computational Lin- guistics. 1996.Google Scholar
- Lafferty, John, Andrew McCallum, and Fernando CN Pereira. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” (2001).Google ScholarDigital Library
- Kersting, Kristian, Luc De Raedt, and Tapani Raiko. “Logical hidden markov models.” Journal of Artificial Intelligence Research 25 (2006): 425-456.Google ScholarDigital Library
- Onishi, Takamune, and Hiromitsu Shiina. “Distributed Representation Computation Using CBOW Model and Skip–gram Model.” 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI). IEEE, 2020.Google ScholarCross Ref
- Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.Google ScholarCross Ref
- Hochreiter, Sepp, and Ju¨rgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.Google ScholarDigital Library
- Ding, Yi, ”Research on Text Information Mining Technology of Substation Inspection Based on Improved Jieba.” 2021 International Conference on Wireless Communications and Smart Grid (ICWCSG). IEEE, 2021.Google Scholar
- Li, Shen, ”Analogical reasoning on chinese morphological and semantic relations.” arXiv preprint arXiv:1805.06504 (2018).Google Scholar
- Graves, Alex. ”Long short-term memory.” Supervised sequence labelling with recurrent neural networks (2012): 37-45..Google Scholar
- Staudemeyer, Ralf C., and Eric Rothstein Morris. ”Understanding LSTM–a tutorial into long short-term memory recurrent neural net- works.” arXiv preprint arXiv:1909.09586 (2019)Google Scholar
Index Terms
- Random Division: An Effective Method for Chinese Text Classification
Recommendations
Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems
Chinese word segmentation is an essential step in a processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is widely adopted, ...
An integrated approach to chinese word segmentation and part-of-speech tagging
ICCPOL'06: Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges aheadThis paper discusses and compares various integration schemes of Chinese word segmentation and part-of-speech tagging in the framework of true-integration and pseudo-integration. A true-integration approach, named ‘the divide-and-conquer integration', ...
Chinese word segmentation as morpheme-based lexical chunking
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...
Comments