Abstract
Real-time communication platforms such as ICQ, MSN and online chat rooms are getting more popular than ever on the Internet. There are, however, real risks where criminals and terrorists can perpetrate illegal and criminal abuses. This highlights the security significance of accurate detection and translation of the chat language to its stand language counterpart. The language used on these platforms differs significantly from the standard language. This language, referred to as chat language, is comparatively informal, anomalous and dynamic. Such features render conventional language resources such as dictionaries, and processing tools such as parsers ineffective. In this paper, we present the NIL corpus, a chat language text collection annotated to facilitate training and testing of chat language processing algorithms. We analyse the NIL corpus to study the linguistic characteristics and contextual behaviour of a chat language. First we observe that majority of the chat terms, i.e. informal words in a chat text, is formed by phonetic mapping. We then propose the eXtended Source Channel Model (XSCM) for the normalization of the chat language, which is a process to convert messages expressed in a chat language to its standard language counterpart. Experimental results indicate that the performance of XSCM in terms of chat term recognition and normalization accuracy is superior to its Source Channel Model (SCM) counterparts, and is also more consistent over time.






Similar content being viewed by others
Notes
Unless stated otherwise, both NIL corpus and chat language corpus refer to NIL corpus 2.0 hereafter.
References
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.
Cheng, C. (2004). Network language: Advance or degeneration of Chinese language? http://www.tech.163.com/special/w/wlyy.html.
Gao, W., Wong, K.-F., & Lam, W. (2004). Phoneme-based transliteration of foreign names for OOV problem. In Proceedings of International Joint Conference on Natural Language Processing (IJCNLP’04), Sanya, China, 22–24 March, pp. 110–119.
Gianforte, G. (2003). From call center to contact center: How to successfully blend phone, email, web and chat to deliver great service and slash costs. RightNow Technologies.
Graf, D., Chen, K., Kong, J., & Maeda, K. (2005). Chinese gigaword (2nd ed.). LDC Catalog Number LDC2005T14.
Heard-White, M., Saunders, G., & Pincas, A. (2004). Report into the use of CHAT in education. Final report for project of Effective use of CHAT in Online Learning. Institute of Education, University of London.
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3), 400–401.
Li, H., He, W., & Yuan, B. (2003). A kind of Chinese text strings’ similarity and its application in speech recognition. Journal of Chinese Information Processing, 17(1), 60–64.
Manning, C., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
McCullagh, D. (2004). Security officials to spy on chat rooms. News provided by CNET Networks, 24 November, 2004.
Metcalf, A. (2002). Predicting new words: The secrets of their success. Houghton Mifflin.
Xia, Y., & Wong, K.-F. (2006). Anomaly detecting within dynamic Chinese chat text. In Proceedings of NEW TEXT Workshop at the 11th Conference for European Chapter of the Association for Computational Linguistics (EACL’06), Trento, Italy, 3–7 April, pp. 48–55.
Xia, Y., Wong, K.-F., & Gao, W. (2005). NIL is not nothing: Recognition of Chinese network informal language expressions. In Proceedings of 4th SIGHAN Workshop at International Joint Conference on Natural Language Processing (IJCNLP’05), Jeju Island, Republic of Korea, 11–13 October, pp. 95–102.
Xia, Y., Wong, K.-F., & Li, W. (2006a) Constructing a Chinese chat text corpus with a two-stage incremental annotation approach. In Proceedings of The 5th International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 24–26 May.
Xia, Y., Wong, K.-F., & Li, W. (2006b). A phonetic based approach to Chinese chat term normalization. In Proceedings of COLING/ACL Joint Conference, Sydney, Australia, 17–21 July, Vol. 2, pp. 993–1000.
Zhang, Z., Yu, H., Xiong, D., & Liu, Q. (2003). HMM-based Chinese lexical analyzer ICTCLAS. In The 2nd SIGHAN Workshop Affiliated with ACL’2003, Sapporo, Japan, 11–12 July, pp. 184–187.
Acknowledgement
Research described in this paper is partially supported by The Chinese University of Hong Kong under the Direct Grant Scheme project (No. 2050330 and 2050417), Strategic Grant Scheme project (No. 4410001) and NSFC (No. 60703051). We would also like to thank the reviewers for their valuable advices on this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
This is an extension of the paper presented at COLING/ACL 2006 (Xia et al. 2006b).
Appendix 1: Some categorized examples of phonetic mappings
Appendix 1: Some categorized examples of phonetic mappings
-
1.
Chinese to Chinese phonetic mappings
-
(1)
偶 \( \mathop{\longrightarrow}\limits^{{(wo,ou,0.685)}} \) 我: 偶 (even; ou3) replaces我 (me, wo3) with p = 0.685.
-
(2)
介 \( \mathop{\longrightarrow}\limits^{{(zhe,jie,0.56)}} \) 这: 介 (interrupt; jie4) replaces 这 (this; zhe4) with p = 0.560.
-
(3)
素 \( \mathop{\longrightarrow}\limits^{{(shi,su,0.491)}} \) 是: 素 (white, su4) replaces 是 (is, shi4) with p = 0.491.
-
(4)
银 \( \mathop{\longrightarrow}\limits^{{(ren,yin,0.457)}} \) 人: 银 (silver, yin2) replaces 人 (human, ren2) with p = 0.457.
-
(5)
米 \( \mathop{\longrightarrow}\limits^{{(mei,mi,0.452)}} \) 没: 米 (rice, mi3) replaces 没 (have not, mei2) with p = 0.452.
-
(1)
-
2.
Letter to Chinese phonetic mappings
-
(6)
J \( \mathop{\longrightarrow}\limits^{{(jie,ji,0.671)}} \) 姐: J replaces 姐 (older sister; jie3) with p = 0.671.
-
(7)
M \( \mathop{\longrightarrow}\limits^{{(mei,mi,0.593)}} \) 妹: M replaces 妹 (younger sister; mei4) with p = 0.593.
-
(8)
S \( \mathop{\longrightarrow}\limits^{{(si,si,0.587)}} \) 死: S replaces 死 (die; si3) with p = 0.587.
-
(9)
T \( \mathop{\longrightarrow}\limits^{{(ti,ti,0.465)}} \) 踢: T replaces 踢 (kick; ti1) with p = 0.465.
-
(10)
K \( \mathop{\longrightarrow}\limits^{{(kuai,ki,0.447)}} \) 快: K replaces 快 (quick; kuai4) with p = 0.447.
-
(6)
-
3.
Number to Chinese phonetic mappings
-
(11) 9 \( \mathop{\longrightarrow}\limits^{{(jiu,jiu,0.541)}} \) 酒: 9 replaces 酒 (wine; jiu3) with p = 0.541.
-
(12) 8 \( \mathop{\longrightarrow}\limits^{{(bu,ba,0.519)}} \) 不: 8 replaces 不 (no; bu4) with p = 0.519.
-
(13) 7 \( \mathop{\longrightarrow}\limits^{{(chi,qi,0.454)}} \) 吃: 7 replaces 吃 (eat; chi1) with p = 0.454.
-
(14) 4 \( \mathop{\longrightarrow}\limits^{{(si,si,0.449)}} \) 死: 4 replaces 死 (die; si3) with p = 0.449.
-
(15) 5 \( \mathop{\longrightarrow}\limits^{{(wu,wu,0.297)}} \) 呜: 5 replaces 呜 (crying sound; wu1) with p = 0.297.
-
Rights and permissions
About this article
Cite this article
Wong, KF., Xia, Y. Normalization of Chinese chat language. Lang Resources & Evaluation 42, 219–242 (2008). https://doi.org/10.1007/s10579-008-9067-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-008-9067-7