skip to main content
10.1145/2556195.2556228acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Chinese-English mixed text normalization

Published:24 February 2014Publication History

ABSTRACT

Along with the expansion of globalization, multilingualism has become a popular social phenomenon. More than one language may occur in the context of a single conversation. This phenomenon is also prevalent in China. A huge variety of informal Chinese texts contain English words, especially in emails, social media, and other user generated informal contents. Since most of the existing natural language processing algorithms were designed for processing monolingual information, mixed multilingual texts cannot be well analyzed by them. Hence, it is of critical importance to preprocess the mixed texts before applying other tasks. In this paper, we firstly analyze the phenomena of mixed usage of Chinese and English in Chinese microblogs. Then, we detail the proposed two-stage method for normalizing mixed texts. We propose to use a noisy channel approach to translate in-vocabulary words into Chinese. For better incorporating the historical information of users, we introduce a novel user aware neural network language model. For the out-of-vocabulary words (such as pronunciations, informal expressions and et al.), we propose to use a graph-based unsupervised method to categorize them. Experimental results on a manually annotated microblog dataset demonstrate the effectiveness of the proposed method. We also evaluate three natural language parsers with and without using the proposed method as the preprocessing step. From the results, we can see that the proposed method can significantly benefit other NLP tasks in processing mixed text.

References

  1. A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 33--40, Sydney, Australia, July 2006. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Beaufort, S. Roekhaut, L.-A. Cougnon, and C. Fairon. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 770--779, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J.-S. Chang and W.-L. Teng. Mining atomic chinese abbreviations with a probabilistic single character recovery model. Language Resources and Evaluation, 40(3--4):367--374, 2006.Google ScholarGoogle Scholar
  4. R. Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, ICML '08, pages 160--167, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Das and S. Petrov. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 600--609, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Association for Computational Linguistics, 2013.Google ScholarGoogle Scholar
  7. D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2--3):169--202, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Han and T. Baldwin. Lexical normalisation of short text messages: Makn sens a#twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 368--378, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Han, P. Cook, and T. Baldwin. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 421--432, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Han, P. Cook, and T. Baldwin. Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol., 4(1):5:1--5:27, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 873--882, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Johnson and A. E. Ural. Reranking the berkeley and brown parsers. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 665--668, Los Angeles, California, June 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are two metaphors better than one? In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING '08, pages 441--448, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Li and Y. Liu. Improving text normalization using character-blocks based models and system combination. In Proceedings of COLING 2012, pages 1587--1602, Mumbai, India, December 2012. The COLING 2012 Organizing Committee.Google ScholarGoogle Scholar
  15. X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 339--346, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Li and D. Yarowsky. Mining and modeling relations between formal and informal chinese phrases from web corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 1031--1040, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Z. Li and D. Yarowsky. Unsupervised translation induction for chinese abbreviations using monolingual corpora. In Proceedings of ACL-08: HLT, pages 425--433, Columbus, Ohio, June 2008. Association for Computational Linguistics.Google ScholarGoogle Scholar
  18. F. Liu, F. Weng, and X. Jiang. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 1035--1044, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05, pages 443--450, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Mullen and R. Malouf. A preliminary investigation into sentiment analysis of informal political discourse. In Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.Google ScholarGoogle Scholar
  21. Z.-Y. Niu, D.-H. Ji, and C. L. Tan. Word sense disambiguation using label propagation based semi-supervised learning. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 395--402, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Comput. Linguist., 29(1):19--51, Mar. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Okazaki, M. Ishizuka, and J. Tsujii. A discriminative approach to japanese abbreviation extraction. In Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP 2008), pages 889--894, 2008.Google ScholarGoogle Scholar
  24. X. Qian, Q. Zhang, X. Huang, and L. Wu. 2d trie for fast parsing. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING '10, pages 904--912, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. X. Qiu, Q. Zhang, and X. Huang. Fudannlp: A toolkit for chinese natural language processing. In Proceedings of ACL, 2013.Google ScholarGoogle Scholar
  26. G. Richard. A global perspective on bilingualism and bilingual education. Georgetown University Round Table on Languages and Linguistics 1999: Language in Our Time: Bilingual Education and Official English, Ebonics and Standard English, Immigration and the Unz Initiative Languages and Linguistics 1999, page 332, 2001.Google ScholarGoogle Scholar
  27. A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, pages 1524--1534, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing with compositional vector grammars. In Proceedings of ACL 2013, June 2013.Google ScholarGoogle Scholar
  29. A. Tamura, T. Watanabe, and E. Sumita. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 24--36, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12):2544--2558, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. D. Turney and M. L. Littman. Unsupervised learning of semantic orientation from a hundred-billion-word corpus. (No. ERB-1094, NRC#44929): National Research Council of Canada, 2002.Google ScholarGoogle Scholar
  32. L. Velikovich, S. Blair-Goldensohn, K. Hannan, and R. McDonald. The viability of web-derived polarity lexicons. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 777--785, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Wang and H. T. Ng. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 471--481, Atlanta, Georgia, June 2013. Association for Computational Linguistics.Google ScholarGoogle Scholar
  34. L.-X. Xie, Y.-B. Zheng, Z.-Y. Liu, M.-S. Sun, and C.-H. Wang. Extracting chinese abbreviation-definition pairs from anchor texts. In Machine Learning and Cybernetics (ICMLC), volume 4, pages 1485--1491, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  35. D. Yang, Y.-C. Pan, and S. Furui. Vocabulary expansion through automatic abbreviation generation for chinese voice search. Computer Speech & Language, 26(5):321--335, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Zhao, X. Qiu, S. Zhang, F. Ji, and X. Huang. Part-of-speech tagging for chinese-english mixed texts with dynamic features. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL '12, pages 1379--1388, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. X. Zhu and Z. Ghahramani. Learning from Labeled and Unlabeled Data with Label Propagation. In Technical Report Carnegie Mellon University-CALD-02-107. Carnegie Mellon University, 2002.Google ScholarGoogle Scholar

Index Terms

  1. Chinese-English mixed text normalization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining
      February 2014
      712 pages
      ISBN:9781450323512
      DOI:10.1145/2556195

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 February 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WSDM '14 Paper Acceptance Rate64of355submissions,18%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader