Abstract
In this work, we described a novel method for normalizing Chinese informal words to their standard equivalents. We form the task as an information extraction problem, using Q & A community answers as source corpus. We proposed several LSTM based models for the extraction task. To evaluate and compare performances of the proposed models, we developed a standard dataset containing factoid generated by real-world users in daily life. Since our method do not use any linguistic features, it’s also applicable to other languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
Our dataset is available at www.github.com/tiantian002/.
- 6.
- 7.
References
Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, pp. 368–378. ACL (2011)
Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, pp. 71–76. ACL (2011)
Li, Z., Yarowsky, D.: Mining and modeling relations between formal and informal Chinese phrases from web corpora. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1031–1040 (2008)
Bengio, Y., Simard, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013)
Aw, A.T., Zhang, M., Xiao, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of COLING/ACL 2006, Sydney. ACL (2006)
Liu, F., Weng, F., Jiang, X.: A broad-coverage normalization system for social media language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, pp. 1035–1044. ACL (2012)
Beaufort, R., Roekhaut, S., Cougnon, L.-A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: ACL, pp. 770–779 (2010)
Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: Proceedings of ACL (2013)
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Boulder, pp. 71–78. ACL (2009)
Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, pp. 421–432. ACL (2012)
Yang, Y., Eisenstein, J.: A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, pp. 61–72. ACL (2013)
Wang, A., Kan, M.-Y.: Mining informal language from Chinese microtext: joint word recognition and segmentation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 731–741 (2013)
Wang, A., Kan, M.-Y., Andrade, D., Onishi, T., Ishikawa, K.: Chinese informal word normalization: an experimental study. In: Proceedings of IJCNLP, pp. 127–135 (2013)
Qian, T., et al.: A transition-based model for joint segmentation, POS-tagging and normalization. In: EMNLP (2015)
Min, W., Mott, B., Lester, J.: NCSU_SAS_WOOKHEE: a deep contextual long-short term memory model for text normalization. In: Proceedings of WNUT, Beijing (2015)
Leeman-Munk, S., Lester, J.: NCSU_SAS_SAM: deep encoding and reconstruction for normalization of noisy text. In: Proceedings of WNUT, Beijing (2015)
Baldwin, T., Li, Y.: An in-depth analysis of the effect of text normalization in social media. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2015)
Acknowledgments
This work was supported by 111 Project of China under Grant No. B08004, National Natural Science Foundation of China (61273217, 61300080, 61671078), the Ph.D Programs Foundation of Ministry of Education of China (20130005110004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Tian, T., Xu, W. (2017). Chinese Lexical Normalization Based on Information Extraction: An Experimental Study. In: Lintas, A., Rovetta, S., Verschure, P., Villa, A. (eds) Artificial Neural Networks and Machine Learning – ICANN 2017. ICANN 2017. Lecture Notes in Computer Science(), vol 10614. Springer, Cham. https://doi.org/10.1007/978-3-319-68612-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-68612-7_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68611-0
Online ISBN: 978-3-319-68612-7
eBook Packages: Computer ScienceComputer Science (R0)