Chinese Lexical Normalization Based on Information Extraction: An Experimental Study

Tian, Tian; Xu, WeiRan

doi:10.1007/978-3-319-68612-7_25

Tian Tian¹⁷ &
WeiRan Xu¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10614))

Included in the following conference series:

International Conference on Artificial Neural Networks

4218 Accesses

Abstract

In this work, we described a novel method for normalizing Chinese informal words to their standard equivalents. We form the task as an information extraction problem, using Q & A community answers as source corpus. We proposed several LSTM based models for the extraction task. To evaluate and compare performances of the proposed models, we developed a standard dataset containing factoid generated by real-world users in daily life. Since our method do not use any linguistic features, it’s also applicable to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.twitter.com.
2.
www.weibo.com.
3.
www.baidu.com.
4.
www.zhidao.baidu.com.
5.
Our dataset is available at www.github.com/tiantian002/.
6.
www.ltp-cloud.com.
7.
www.catalog.ldc.upenn.edu/LDC2009T14.

References

Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, pp. 368–378. ACL (2011)
Google Scholar
Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, pp. 71–76. ACL (2011)
Google Scholar
Li, Z., Yarowsky, D.: Mining and modeling relations between formal and informal Chinese phrases from web corpora. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1031–1040 (2008)
Google Scholar
Bengio, Y., Simard, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013)
Google Scholar
Aw, A.T., Zhang, M., Xiao, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of COLING/ACL 2006, Sydney. ACL (2006)
Google Scholar
Liu, F., Weng, F., Jiang, X.: A broad-coverage normalization system for social media language. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, pp. 1035–1044. ACL (2012)
Google Scholar
Beaufort, R., Roekhaut, S., Cougnon, L.-A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: ACL, pp. 770–779 (2010)
Google Scholar
Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: Proceedings of ACL (2013)
Google Scholar
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Boulder, pp. 71–78. ACL (2009)
Google Scholar
Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, pp. 421–432. ACL (2012)
Google Scholar
Yang, Y., Eisenstein, J.: A log-linear model for unsupervised text normalization. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, pp. 61–72. ACL (2013)
Google Scholar
Wang, A., Kan, M.-Y.: Mining informal language from Chinese microtext: joint word recognition and segmentation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 731–741 (2013)
Google Scholar
Wang, A., Kan, M.-Y., Andrade, D., Onishi, T., Ishikawa, K.: Chinese informal word normalization: an experimental study. In: Proceedings of IJCNLP, pp. 127–135 (2013)
Google Scholar
Qian, T., et al.: A transition-based model for joint segmentation, POS-tagging and normalization. In: EMNLP (2015)
Google Scholar
Min, W., Mott, B., Lester, J.: NCSU_SAS_WOOKHEE: a deep contextual long-short term memory model for text normalization. In: Proceedings of WNUT, Beijing (2015)
Google Scholar
Leeman-Munk, S., Lester, J.: NCSU_SAS_SAM: deep encoding and reconstruction for normalization of noisy text. In: Proceedings of WNUT, Beijing (2015)
Google Scholar
Baldwin, T., Li, Y.: An in-depth analysis of the effect of text normalization in social media. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2015)
Google Scholar

Download references

Acknowledgments

This work was supported by 111 Project of China under Grant No. B08004, National Natural Science Foundation of China (61273217, 61300080, 61671078), the Ph.D Programs Foundation of Ministry of Education of China (20130005110004).

Author information

Authors and Affiliations

Pattern Recognition and Intelligent System Laboratory, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Tian Tian & WeiRan Xu

Authors

Tian Tian
View author publications
You can also search for this author in PubMed Google Scholar
WeiRan Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tian Tian .

Editor information

Editors and Affiliations

University of Lausanne, Lausanne, Switzerland
Alessandra Lintas
University of Genoa, Genoa, Italy
Stefano Rovetta
Universitat Pompeu Fabra, Barcelona, Spain
Paul F.M.J. Verschure
University of Lausanne, Lausanne, Switzerland
Alessandro E.P. Villa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, T., Xu, W. (2017). Chinese Lexical Normalization Based on Information Extraction: An Experimental Study. In: Lintas, A., Rovetta, S., Verschure, P., Villa, A. (eds) Artificial Neural Networks and Machine Learning – ICANN 2017. ICANN 2017. Lecture Notes in Computer Science(), vol 10614. Springer, Cham. https://doi.org/10.1007/978-3-319-68612-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-68612-7_25
Published: 25 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68611-0
Online ISBN: 978-3-319-68612-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Chinese Lexical Normalization Based on Information Extraction: An Experimental Study