A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

Jia, Yalu; Liu, Lei; Chen, Hao; Sun, Yinghong

doi:10.1007/s10044-019-00833-z

A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

Industrial and commercial application
Published: 11 July 2019

Volume 23, pages 1011–1020, (2020)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Yalu Jia^1,2,
Lei Liu ORCID: orcid.org/0000-0003-1544-693X^1,3,
Hao Chen^1,3 &
…
Yinghong Sun^1,3

353 Accesses
7 Citations
Explore all metrics

Abstract

Unknown word recognition technology is of great significance to improve the precision of text segmentation and syntax analysis. Social network has become an important platform for sharing, disseminating, and acquiring information. Unknown word recognition based on micro-blog short text has become a research hot spot, while the micro-blog text contains a large number of nonstandard terms and network buzzwords, which has increased the difficulty of unknown word recognition. This paper proposes a Chinese unknown word recognition method for micro-blog short text based on improved FP-growth (POS-FP). Firstly, the POS-FP algorithm is used to get frequent itemsets from micro-blog, and the N-grams model is used to filter out unknown words from frequent itemsets. Secondly, the improved mutual information and left–right information entropy are used to verify the internal features of candidate unknown words. Then, context-dependent and open-source methods are used for external verification of candidate unknown words. Compared with traditional methods, this algorithm improves the recognition rate of unknown words in micro-blog short texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on sentiment analysis and emotion detection from text

Article 28 August 2021

Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text

Article Open access 01 September 2023

Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews

Article 29 April 2022

References

Maosong S, Jiayan Z (1995) The hard work theory in the study of Chinese automatic word segmentation. Lang Charact Appl 04:40–46
Google Scholar
Chen X (1999) A package of solutions to the problem of unknown words in automatic word segmentation. Appl Linguist 3:103–109
Google Scholar
Zhao H, Cai D, Huang C, Kit C (2019) Chinese word segmentation: another decade review (2007–2017). arXiv: https://arxiv.org/abs/1901.06079
Wang L (2017) Research and implementation of evaluation object phrase recognition in the field of sentiment analysis. Donghua University, Shanghai
Google Scholar
Wang Y (2015) Research on automatic segmentation of Chinese product description information based on the characteristics of conditional random fields and e-commerce. East China Normal University, Shanghai
Google Scholar
Chen H (2012) Research on network information collection technology and Chinese unknown words. Beijing University of Posts and Telecommunications, Beijing
Google Scholar
Wu A, Jiang Z (2000) Statistically enhanced new word identification in a rule based Chinese system. In: Proceedings of the second Chinese language processing workshop. Hong Kong, China, pp 46–51
Zhang HP, Liu Q, Yu HK (2003) Chinese name entity recognition using role model. Comput Linguist Chin Lang Proces 8(2):29–60
Google Scholar
Deng W (2014) Improved BP-HMM and its application in Chinese part-of-speech tagging. Jiangxi Institute of Technology, Ganzhou
Google Scholar
Han Y et al (2016) J Nanjing Univ (Nat Sci) 2:353–360
Google Scholar
Gang Z, Yang L, Qun L (2004) Internet-oriented Chinese new word detection. Chin J Inform 18(6):1–9
Google Scholar
Zheng J, Li X, Tan H (2000) Research on Chinese names recognition method based on corpus. J Chin Inform Process 14(1):7–12
Google Scholar
Liu B, Huang W, Guo Y et al (2000) Chinese name recognition based on statistical methods. J Chin Inform Process 14(3):16–24
Google Scholar
Sun M, Huang C, Gao H et al (1994) Automatic identification of Chinese names. Chin J Inform 9(2):16–27
Google Scholar
Huang D, Yue G, Yang Y et al (2003) Identification of Chinese place names based on statistics. J Chin Inform Process 17(2):36–41
Article Google Scholar
Tan H, Zheng J, Liu K (2002) Design and implementation of automatic identification system for Chinese geographical names. Comput Eng 28(8):128–129
Google Scholar
Xiang X (2016) Research and application of the Chinese organization names recognition and disambiguation. East China Normal University, Shanghai
Google Scholar
Hao Z et al (2019) Unknown word recognition based on extended rules and statistical features. Comput Appl Res 09:1–6
Google Scholar
Xianying H, Hongyang C, Yingtao L, Liyuan X (2015) A new microblog short text feature word selection algorithm. Comput Eng Sci 37(09):1761–1767
Google Scholar
Xianyi C, Qian Z (2010) Text mining principle. Science Press, Beijing, pp 1–8
Google Scholar
Veeraswamy A (2011) A survey of feature selection algorithms in data mining. Int J Adv Res Technol 1:108–117
Google Scholar
El-Fishawy N, Hamouda A, Attiya GM, Atef M (2014) Arabic summarization in Twitter social network. Ain Shams Eng J 5(2):411–420
Article Google Scholar
Niu P (2015) Research on automatic extraction of Chinese keywords in combination with TF–IDF and rules. Dalian University of Technology, Dalian
Google Scholar
He X (2017) Improvement and experimental research on TF–IDF algorithm. Jilin University, Changchun
Google Scholar
He H, Sun X (2017) F-score driven max margin neural network for named entity recognition in Chinese social media. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 713–718
Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J (2019) Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35(10):1745–1752
Article Google Scholar
Liu L, Shang J, Xu F, Ren X, Gui H, Peng J, Han J (2018) Empower sequence labeling with task-aware neural language model. In: AAAI, pp 5245–5253
Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 852–860
The ICTCLAS Word Segmentation System. https://github.com/NLPIR-team/NLPIR
Zhang H, Wang S, Zhao M et al (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30(10):1873–1886
Article Google Scholar
Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29(11):5304–5318
Article MathSciNet Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (Grant Nos. 61105040, 61203284), the Beijing Natural Science Foundation (Grant No. 4133085), the general program of science and technology development project of Beijing Municipal Education Commission (Grant No. KM201810005005).

Author information

Authors and Affiliations

College of Applied Sciences, Beijing University of Technology, Beijing, China
Yalu Jia, Lei Liu, Hao Chen & Yinghong Sun
Taiji Computer Co., Ltd, Beijing, China
Yalu Jia
Beijing Institute for Scientific and Engineering Computing, Beijing University of Technology, Beijing, China
Lei Liu, Hao Chen & Yinghong Sun

Authors

Yalu Jia
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yinghong Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, Y., Liu, L., Chen, H. et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth. Pattern Anal Applic 23, 1011–1020 (2020). https://doi.org/10.1007/s10044-019-00833-z

Download citation

Received: 17 February 2019
Accepted: 17 June 2019
Published: 11 July 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10044-019-00833-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

Abstract

Access this article

Similar content being viewed by others

A review on sentiment analysis and emotion detection from text

Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text

Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

Abstract

Access this article

Similar content being viewed by others

A review on sentiment analysis and emotion detection from text

Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text

Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation