Skip to main content
Log in

Normalization of Chinese chat language

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Real-time communication platforms such as ICQ, MSN and online chat rooms are getting more popular than ever on the Internet. There are, however, real risks where criminals and terrorists can perpetrate illegal and criminal abuses. This highlights the security significance of accurate detection and translation of the chat language to its stand language counterpart. The language used on these platforms differs significantly from the standard language. This language, referred to as chat language, is comparatively informal, anomalous and dynamic. Such features render conventional language resources such as dictionaries, and processing tools such as parsers ineffective. In this paper, we present the NIL corpus, a chat language text collection annotated to facilitate training and testing of chat language processing algorithms. We analyse the NIL corpus to study the linguistic characteristics and contextual behaviour of a chat language. First we observe that majority of the chat terms, i.e. informal words in a chat text, is formed by phonetic mapping. We then propose the eXtended Source Channel Model (XSCM) for the normalization of the chat language, which is a process to convert messages expressed in a chat language to its standard language counterpart. Experimental results indicate that the performance of XSCM in terms of chat term recognition and normalization accuracy is superior to its Source Channel Model (SCM) counterparts, and is also more consistent over time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Unless stated otherwise, both NIL corpus and chat language corpus refer to NIL corpus 2.0 hereafter.

References

  • Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.

    Google Scholar 

  • Cheng, C. (2004). Network language: Advance or degeneration of Chinese language? http://www.tech.163.com/special/w/wlyy.html.

  • Gao, W., Wong, K.-F., & Lam, W. (2004). Phoneme-based transliteration of foreign names for OOV problem. In Proceedings of International Joint Conference on Natural Language Processing (IJCNLP’04), Sanya, China, 22–24 March, pp. 110–119.

  • Gianforte, G. (2003). From call center to contact center: How to successfully blend phone, email, web and chat to deliver great service and slash costs. RightNow Technologies.

  • Graf, D., Chen, K., Kong, J., & Maeda, K. (2005). Chinese gigaword (2nd ed.). LDC Catalog Number LDC2005T14.

  • Heard-White, M., Saunders, G., & Pincas, A. (2004). Report into the use of CHAT in education. Final report for project of Effective use of CHAT in Online Learning. Institute of Education, University of London.

  • Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3), 400–401.

    Article  Google Scholar 

  • Li, H., He, W., & Yuan, B. (2003). A kind of Chinese text strings’ similarity and its application in speech recognition. Journal of Chinese Information Processing, 17(1), 60–64.

    Google Scholar 

  • Manning, C., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

    Google Scholar 

  • McCullagh, D. (2004). Security officials to spy on chat rooms. News provided by CNET Networks, 24 November, 2004.

  • Metcalf, A. (2002). Predicting new words: The secrets of their success. Houghton Mifflin.

  • Xia, Y., & Wong, K.-F. (2006). Anomaly detecting within dynamic Chinese chat text. In Proceedings of NEW TEXT Workshop at the 11th Conference for European Chapter of the Association for Computational Linguistics (EACL’06), Trento, Italy, 3–7 April, pp. 48–55.

  • Xia, Y., Wong, K.-F., & Gao, W. (2005). NIL is not nothing: Recognition of Chinese network informal language expressions. In Proceedings of 4th SIGHAN Workshop at International Joint Conference on Natural Language Processing (IJCNLP’05), Jeju Island, Republic of Korea, 11–13 October, pp. 95–102.

  • Xia, Y., Wong, K.-F., & Li, W. (2006a) Constructing a Chinese chat text corpus with a two-stage incremental annotation approach. In Proceedings of The 5th International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 24–26 May.

  • Xia, Y., Wong, K.-F., & Li, W. (2006b). A phonetic based approach to Chinese chat term normalization. In Proceedings of COLING/ACL Joint Conference, Sydney, Australia, 17–21 July, Vol. 2, pp. 993–1000.

  • Zhang, Z., Yu, H., Xiong, D., & Liu, Q. (2003). HMM-based Chinese lexical analyzer ICTCLAS. In The 2nd SIGHAN Workshop Affiliated with ACL’2003, Sapporo, Japan, 11–12 July, pp. 184–187.

Download references

Acknowledgement

Research described in this paper is partially supported by The Chinese University of Hong Kong under the Direct Grant Scheme project (No. 2050330 and 2050417), Strategic Grant Scheme project (No. 4410001) and NSFC (No. 60703051). We would also like to thank the reviewers for their valuable advices on this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunqing Xia.

Additional information

This is an extension of the paper presented at COLING/ACL 2006 (Xia et al. 2006b).

Appendix 1: Some categorized examples of phonetic mappings

Appendix 1: Some categorized examples of phonetic mappings

  1. 1.

    Chinese to Chinese phonetic mappings

    1. (1)

      \( \mathop{\longrightarrow}\limits^{{(wo,ou,0.685)}} \) 我: 偶 (even; ou3) replaces我 (me, wo3) with p = 0.685.

    2. (2)

      \( \mathop{\longrightarrow}\limits^{{(zhe,jie,0.56)}} \) 这: 介 (interrupt; jie4) replaces 这 (this; zhe4) with p = 0.560.

    3. (3)

      \( \mathop{\longrightarrow}\limits^{{(shi,su,0.491)}} \) 是: 素 (white, su4) replaces 是 (is, shi4) with p = 0.491.

    4. (4)

      \( \mathop{\longrightarrow}\limits^{{(ren,yin,0.457)}} \) 人: 银 (silver, yin2) replaces 人 (human, ren2) with p = 0.457.

    5. (5)

      \( \mathop{\longrightarrow}\limits^{{(mei,mi,0.452)}} \) 没: 米 (rice, mi3) replaces 没 (have not, mei2) with p = 0.452.

  2. 2.

    Letter to Chinese phonetic mappings

    1. (6)

      J \( \mathop{\longrightarrow}\limits^{{(jie,ji,0.671)}} \) 姐: J replaces 姐 (older sister; jie3) with p = 0.671.

    2. (7)

      M \( \mathop{\longrightarrow}\limits^{{(mei,mi,0.593)}} \) 妹: M replaces 妹 (younger sister; mei4) with p = 0.593.

    3. (8)

      S \( \mathop{\longrightarrow}\limits^{{(si,si,0.587)}} \) 死: S replaces 死 (die; si3) with p = 0.587.

    4. (9)

      T \( \mathop{\longrightarrow}\limits^{{(ti,ti,0.465)}} \) 踢: T replaces 踢 (kick; ti1) with p = 0.465.

    5. (10)

      K \( \mathop{\longrightarrow}\limits^{{(kuai,ki,0.447)}} \) 快: K replaces 快 (quick; kuai4) with p = 0.447.

  3. 3.

    Number to Chinese phonetic mappings

    • (11) 9 \( \mathop{\longrightarrow}\limits^{{(jiu,jiu,0.541)}} \) 酒: 9 replaces 酒 (wine; jiu3) with p = 0.541.

    • (12) 8 \( \mathop{\longrightarrow}\limits^{{(bu,ba,0.519)}} \) 不: 8 replaces 不 (no; bu4) with p = 0.519.

    • (13) 7 \( \mathop{\longrightarrow}\limits^{{(chi,qi,0.454)}} \) 吃: 7 replaces 吃 (eat; chi1) with p = 0.454.

    • (14) 4 \( \mathop{\longrightarrow}\limits^{{(si,si,0.449)}} \) 死: 4 replaces 死 (die; si3) with p = 0.449.

    • (15) 5 \( \mathop{\longrightarrow}\limits^{{(wu,wu,0.297)}} \) 呜: 5 replaces 呜 (crying sound; wu1) with p = 0.297.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wong, KF., Xia, Y. Normalization of Chinese chat language. Lang Resources & Evaluation 42, 219–242 (2008). https://doi.org/10.1007/s10579-008-9067-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-008-9067-7

Keywords

Navigation