skip to main content
10.1145/3377170.3377205acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicitConference Proceedingsconference-collections
research-article

Domain Neural Chinese Word Segmentation with Mutual Information and Entropy

Published: 20 March 2020 Publication History

Abstract

Chinese word segmentation (CWS) is an important basic task for NLP. However, the word segmentation model trained by the generic domain corpus has a significant decline in performance in the word segmentation task oriented to the specific domain. Aiming at the features of domain segmentation, this paper using domain corpus as the training samples, and proposed combined with the terminology dictionary, new word detection and Bi-LSTM-CRF segmentation method to improve the problem of out-of-vocabulary (OOV). The word segmentation experiment was carried out on the corpus of the automotive domain. The results show that the precision and recall of the word segmentation have reached 0.95, and the value of F1 also achieved 0.95, and they are better than state-of-the-art method. This method can also be combined with N-gram and chi-square statistic to further improve the recognition accuracy of OOV.

References

[1]
Zaharia, M. and Chowdhury, S. 2010. Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing.
[2]
Zhang, H. P., Yu, H. K., Xiong, D. Y. and Liu, Q. 2003, July. HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17 (pp. 184--187). Association for Computational Linguistics.
[3]
Murdoch S.J., Lewis S. 2005. Embedding Covert Channels into TCP/IP. In: Barni M., Herrera-Joancomartí J., Katzenbeisser S., Pérez-González F. (eds) Information Hiding. IH 2005. Lecture Notes in Computer Science, vol 3727. Springer, Berlin, Heidelberg. DOI=https://doi.org/10.1007/11558859_19
[4]
Du X, Cai Y, et al. 2016. Overview of deep learning. 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 159--164.
[5]
Wang, P., Qian, Y., Soong, F. K., He, L., and Zhao, H. 2015. A unified tagging solution: Bidirectional lstm recurrent neural network with word embedding. arXiv preprint arXiv:1511.00215.
[6]
Ma, J., Ganchev, K., and Weiss, D. (2018). State-of-the-art Chinese word segmentation with bi-lstms. arXiv preprint arXiv:1808.06511.
[7]
Jin, Y., Xie, J., Guo, W., Luo, C., Wu, D., & Wang, R. 2019. LSTM-CRF Neural Network with Gated Self Attention for Chinese NER. IEEE Access.136694--136703.
[8]
Caselles-Dupré, H., Lesaint, F., and Royo-Letelier, J. 2018, September. Word2vec applied to recommendation: Hyperparameters matter. In Proceedings of the 12th ACM Conference on Recommender Systems (pp. 352--356). ACM.
[9]
Xiaojuan, F. H. K. S. Z., and Wenbiao, X. 2005. Chinese Word Segmentation Research Based on Statistic the Frequency of the Word. In Computer Engineering and Applications.
[10]
Sun, J. 2012. 'Jieba'Chinese word segmentation tool. https://github.com/fxsjy/jieba.
[11]
Liu, J., Wu, F., Wu, C., Huang, Y., and Xie, X. 2019. Neural Chinese word segmentation with dictionary. Neurocomputing, 338, 46--54.
[12]
Yin, W., Zhu, M., and Chen, T. 2013, July. Domain Thesaurus Construction from Wikipedia. In International Conference on Computer, Networks and Communication Engineering (ICCNCE 2013). Atlantis Press.

Index Terms

  1. Domain Neural Chinese Word Segmentation with Mutual Information and Entropy

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICIT '19: Proceedings of the 2019 7th International Conference on Information Technology: IoT and Smart City
    December 2019
    601 pages
    ISBN:9781450376631
    DOI:10.1145/3377170
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Shanghai Jiao Tong University: Shanghai Jiao Tong University
    • The Hong Kong Polytechnic: The Hong Kong Polytechnic University
    • University of Malaya: University of Malaya

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 March 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bi-LSTM-CRF
    2. Chinese Word Segmentation
    3. Entropy
    4. Mutual Information
    5. OOV

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICIT 2019
    ICIT 2019: IoT and Smart City
    December 20 - 23, 2019
    Shanghai, China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 94
      Total Downloads
    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media