research-article

A Chinese word segmentation method based on dictionary and HMM

Authors:
Chunling Liu

College of Information Engineering, Dalian University, China

College of Information Engineering, Dalian University, China

0000-0003-1484-3896
View Profile

,
Qizhen Zhang

College of Information Engineering, Dalian University, China

College of Information Engineering, Dalian University, China

0000-0002-3397-3955
View Profile

,
Jinlong Feng

College of Information Engineering, Dalian University, China

College of Information Engineering, Dalian University, China

0000-0002-4685-1375
View Profile

,
Yuqi Tian

College of Information Engineering, Dalian University, China

College of Information Engineering, Dalian University, China

0000-0002-0018-7392
View Profile

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer EngineeringOctober 2022Pages 644–649https://doi.org/10.1145/3573428.3573542

Published:15 March 2023Publication History

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

Pages 644–649

ABSTRACT

Aiming at the problems of ambiguity segmentation and low success rate of new words discovery in Chinese word segmentation, this paper proposes a Chinese word segmentation method based on dictionary and Hidden Markov Model. Through forward maximum matching algorithm and backward maximum matching algorithm, the coarse segmentation results are obtained, and the ambiguous fragments are collected and input into the Hidden Markov model. The Hidden Markov Model performs secondary word segmentation through word order tagging and identifies new words, and adds new words to the dictionary to improve the dictionary. The experimental results show that the proposed algorithm improves the problem of low success rate of ambiguity recognition and new word discovery, improves the accuracy, recall and F1 value of ordinary text segmentation, and improves the problem that Jieba segmentation ability decreases in professional text.

References

GONG F H, ZHU P H. Word segmentation Based on Adaptive Hidden Markov Model in Oil field [J]. COMPUTER SCIENCE, 2018, 45(S1): 97-100.Google Scholar
JIANG W L, CHEN Z H, SHAO D G. Dynamic programming word segmentation algorithm based on domain dictionaries [J]. Journal of Nanjing University of Science and Technology, 2019, 43(1): 63-71.Google Scholar
YUAN Y, PENG J H, ZHANG R Y. Study on Chinese Word Sense Disambiguation Based on Statistics [J]. JOURNAL OF INFORMATION ENGINEERING UNIVERSITY, 2007, 8(4): 501-504.Google Scholar
LIU Y, WEI G Z. Improvement on maximum matching method mechanism based on double character Hash indexing [J]. Electronic Design Engineering, 2017, 25(16): 11-15.Google Scholar
DU L P, LI X G, YU G. New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 35-40.Google Scholar
ZHAO Z Q, CHEN Z Y, LIU J B, Chinese named entity recognition in power domain based on Bi-LSTM-CRF [C] //International Conference on Artificial Intelligence and Pattern Recognition. Beijing: AIPR, 2019: 176-180. DOI: 10.1145/3357254.3357283.Google ScholarDigital Library
XU C W, WANG F Y, HAN J L, Exploiting multiple embedding for Chinese named entity recognition [C] //Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Beijing: Association for Computing Machinery, 2019: 2269-2272.Google Scholar
Zhang Q, Liu X Y, Fu J L. Neural networks incorporating dictionaries for Chinese word segmentation [C] //Proceedings of the Thirty- Second AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 5682-5689.Google Scholar
WU Y F, WEI X, QIN Y B, A radical-based method for Chinese named entity recognition [C] //International Conference on Big Data. Los Angeles: IEEE, 2019: 125-130.Google Scholar
YANG F, ZHANG J H, LIU G S, Five-strokebased CNN-Bi RNN-CRF network for Chinese named entity recognition [C]//CCF International Conference on Natural Language Processing and Chinese Computing. Hohhot China Computer Federation, 2018: 184-195.Google Scholar

Recommendations

Chinese word segmentation as morpheme-based lexical chunking

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...
Read More
Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06: Proceedings of the COLING/ACL on Main conference poster sessions

We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found ...
Read More
Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling
ICASSP '96: Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01

A novel ergodic multigram hidden Markov model (HMM) is introduced which models sentence production as a doubly stochastic process, in which word classes are first produced according to a first order Markov model, and then single or multi-character words ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering
October 2022
1999 pages
ISBN:9781450397148
DOI:10.1145/3573428

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 March 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate508of972submissions,52%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 26
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Chinese word segmentation method based on dictionary and HMM

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

ABSTRACT

References

Cited By

Recommendations

Chinese word segmentation as morpheme-based lexical chunking

Subword-based tagging for confidence-dependent Chinese word segmentation

Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A Chinese word segmentation method based on dictionary and HMM

EITCE '22: Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

ABSTRACT

References

Cited By

Recommendations

Chinese word segmentation as morpheme-based lexical chunking

Subword-based tagging for confidence-dependent Chinese word segmentation

Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media