ABSTRACT
Aiming at the problems of ambiguity segmentation and low success rate of new words discovery in Chinese word segmentation, this paper proposes a Chinese word segmentation method based on dictionary and Hidden Markov Model. Through forward maximum matching algorithm and backward maximum matching algorithm, the coarse segmentation results are obtained, and the ambiguous fragments are collected and input into the Hidden Markov model. The Hidden Markov Model performs secondary word segmentation through word order tagging and identifies new words, and adds new words to the dictionary to improve the dictionary. The experimental results show that the proposed algorithm improves the problem of low success rate of ambiguity recognition and new word discovery, improves the accuracy, recall and F1 value of ordinary text segmentation, and improves the problem that Jieba segmentation ability decreases in professional text.
- GONG F H, ZHU P H. Word segmentation Based on Adaptive Hidden Markov Model in Oil field [J]. COMPUTER SCIENCE, 2018, 45(S1): 97-100.Google Scholar
- JIANG W L, CHEN Z H, SHAO D G. Dynamic programming word segmentation algorithm based on domain dictionaries [J]. Journal of Nanjing University of Science and Technology, 2019, 43(1): 63-71.Google Scholar
- YUAN Y, PENG J H, ZHANG R Y. Study on Chinese Word Sense Disambiguation Based on Statistics [J]. JOURNAL OF INFORMATION ENGINEERING UNIVERSITY, 2007, 8(4): 501-504.Google Scholar
- LIU Y, WEI G Z. Improvement on maximum matching method mechanism based on double character Hash indexing [J]. Electronic Design Engineering, 2017, 25(16): 11-15.Google Scholar
- DU L P, LI X G, YU G. New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 35-40.Google Scholar
- ZHAO Z Q, CHEN Z Y, LIU J B, Chinese named entity recognition in power domain based on Bi-LSTM-CRF [C] //International Conference on Artificial Intelligence and Pattern Recognition. Beijing: AIPR, 2019: 176-180. DOI: 10.1145/3357254.3357283.Google ScholarDigital Library
- XU C W, WANG F Y, HAN J L, Exploiting multiple embedding for Chinese named entity recognition [C] //Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Beijing: Association for Computing Machinery, 2019: 2269-2272.Google Scholar
- Zhang Q, Liu X Y, Fu J L. Neural networks incorporating dictionaries for Chinese word segmentation [C] //Proceedings of the Thirty- Second AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 5682-5689.Google Scholar
- WU Y F, WEI X, QIN Y B, A radical-based method for Chinese named entity recognition [C] //International Conference on Big Data. Los Angeles: IEEE, 2019: 125-130.Google Scholar
- YANG F, ZHANG J H, LIU G S, Five-strokebased CNN-Bi RNN-CRF network for Chinese named entity recognition [C]//CCF International Conference on Natural Language Processing and Chinese Computing. Hohhot China Computer Federation, 2018: 184-195.Google Scholar
Recommendations
Chinese word segmentation as morpheme-based lexical chunking
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...
Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06: Proceedings of the COLING/ACL on Main conference poster sessionsWe proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found ...
Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling
ICASSP '96: Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01A novel ergodic multigram hidden Markov model (HMM) is introduced which models sentence production as a doubly stochastic process, in which word classes are first produced according to a first order Markov model, and then single or multi-character words ...
Comments