New Word Detection and Tagging on Chinese Twitter Stream

Liang, Yuzhi; Yin, Pengcheng; Yiu, S. M.

doi:10.1007/978-3-319-22729-0_24

Yuzhi Liang¹⁵,
Pengcheng Yin¹⁵ &
S. M. Yiu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9263))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1756 Accesses
4 Citations

Abstract

Twitter becomes one of the critical channels for disseminating up-to-date information. The volume of tweets can be huge. It is desirable to have an automatic system to analyze tweets. The obstacle is that Twitter users usually invent new words using non-standard rules that appear in a burst within a short period of time. Existing new word detection methods are not able to identify them effectively. Even if the new words can be identified, it is difficult to understand their meanings. In this paper, we focus on Chinese Twitter. There are no natural word delimiters in a sentence, which makes the problem more difficult. To solve the problem, we derive an unsupervised new word detection framework without relying on training data. Then, we introduce automatic tagging to new word annotation which tag the new words using known words according to our proposed tagging algorithm.

Y. Liang and P. Yin—These two authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here 15 is an experimental number, but this number can be evaluated by some statistical features such as mean and standardization of all the character sequences’ frequency.
2.
TF-IDF is a numerical statistic used to indicate the importance of the given word in a corpus. The score is TF \(\times \) IDF, where TF is term frequency which is a normalized term count, IDF is Inverse Document Frequency which indicates the proportion of documents in the corpus containing \(w_i\).
3.
\(Sim_{ccs} = \frac{Sim_{rawccs}-Min_{rawccs}}{Max_{rawccs}-Min_{rawccs}}\).

References

Ritter, A., Clark, S., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2011)
Google Scholar
Gattani, A., et al.: Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. In: Proceedings of the VLDB Endowment 6.11, pp. 1126–1137 (2013)
Google Scholar
Ye, Y., Qingyao, W., Li, Y., Chow, K.P., Hui, L.C.K., Yiu, S.-M.: Unknown chinese word extraction based on variety of overlapping strings. Inf. Process. Manag. 49(2), 497–512 (2013)
Article Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investig. 30(1), 3–26 (2007)
Article Google Scholar
Zhou, N., et al.: A hybrid probabilistic model for unified collaborative and content-based image tagging. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1281–1294 (2011)
Article Google Scholar
Kim, H.-N., et al.: Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation. Electron. Commer. Res. Appl. 9(1), 73–83 (2010)
Article Google Scholar
Luo, S., Sun, M.: Two-character Chinese word extraction based on hybrid of internal and contextual measures. In: Proceedings of the second SIGHAN workshop on Chinese language processing, vol. 17. Association for Computational Linguistics (2003)
Google Scholar
Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of Chinese text by use of branching entropy. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics (2006)
Google Scholar
Wang, L., et al.: CRFs-based Chinese word segmentation for micro-blog with small-scale data. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language (2012)
Google Scholar
Zhang, K., Sun, M., Zhou, C.: Word segmentation on Chinese mirco-blog data with a linear-time incremental model. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin (2012)
Google Scholar
Zhang, H.-P., et al.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17. Association for Computational Linguistics (2003)
Google Scholar
Gang, Z., et al.: Chinese New Words Detection in Internet. Chin. Inf. Technol. 18(6), 1–9 (2004)
Google Scholar
Tseng, H., et al.: A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, vol. 171, Jeju Island (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Hong Kong, Hong Kong, China
Yuzhi Liang, Pengcheng Yin & S. M. Yiu

Authors

Yuzhi Liang
View author publications
You can also search for this author in PubMed Google Scholar
Pengcheng Yin
View author publications
You can also search for this author in PubMed Google Scholar
S. M. Yiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuzhi Liang .

Editor information

Editors and Affiliations

University of Science and Technology, Rolla, Missouri, USA
Sanjay Madria
Osaka University, Osaka, Japan
Takahiro Hara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liang, Y., Yin, P., Yiu, S.M. (2015). New Word Detection and Tagging on Chinese Twitter Stream. In: Madria, S., Hara, T. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2015. Lecture Notes in Computer Science(), vol 9263. Springer, Cham. https://doi.org/10.1007/978-3-319-22729-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-22729-0_24
Published: 05 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22728-3
Online ISBN: 978-3-319-22729-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics