Abstract
Mining Ancient Chinese corpus is not as convenient as modern Chinese, because there is no complete dictionary of ancient Chinese words which leads to the bad performance of tokenizers. So finding new words in ancient Chinese texts is significant. In this paper, the Apriori algorithm is improved and used to produce candidate character sequences. And a long short-term memory (LSTM) neural network is used to identify the boundaries of the word. Furthermore, we design word confidence feature to measure the confidence score of new words. The experimental results demonstrate that the improved Apriori-like algorithm can greatly improve the recall rate of valid candidate character sequences, and the average accuracy of our method on new word detection raise to 89.7%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB. vol. 1215, pp. 487–499 (1994)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
Bu, F., Zhu, X., Li, M.: Measuring the non-compositionality of multiword expressions. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 116–124. Association for Computational Linguistics (2010)
Chen, A.: Chinese word segmentation using minimal linguistic knowledge. In: Proceedings of the Second SIGHAN Workshop on Chinese Kanguage Processing, vol. 17, pp. 148–151. Association for Computational Linguistics (2003)
Chen, K.J., Ma, W.Y.: Unknown word extraction for Chinese documents. In: Proceedings of the 19th international conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Deng, K., Bol, P.K., Li, K.J., Liu, J.S.: On the unsupervised analysis of domain-specific Chinese texts. In: Proceedings of the National Academy of Sciences, p. 201516510 (2016)
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). doi:10.1007/11550907_126
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)
Huang, M., Ye, B., Wang, Y., Chen, H., Cheng, J., Zhu, X.: New word detection for sentiment analysis. In: ACL (1), pp. 531–541 (2014)
Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Lang. Eng. 1(01), 9–27 (1995)
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 562. Association for Computational Linguistics (2004)
Sun, X., Wang, H., Li, W.: Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 253–262. Association for Computational Linguistics (2012)
Wang, L.X., Wang, J.D., Wang, J.: Approach for lexicon updating based on data mining. Appl. Res. Comput. 12, 062 (2006)
Zhang, H., Shi, S.: Which performs better for new word detection, character based or Chinese word segmentation based? In: 2014 International Conference on Asian Language Processing (IALP), pp. 10–14. IEEE (2014)
Zhang, W., Yoshida, T., Tang, X., Ho, T.B.: Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Syst. Appl. 36(8), 10919–10930 (2009)
Zhang, Y., Sun, M., Zhang, Y.: Chinese new word detection from query logs. In: Cao, L., Zhong, J., Feng, Y. (eds.) ADMA 2010. LNCS, vol. 6441, pp. 233–243. Springer, Heidelberg (2010). doi:10.1007/978-3-642-17313-4_24
Zheng, Y., Liu, Z., Sun, M., Ru, L., Zhang, Y.: Incorporating user behaviors in new word detection. In: IJCAI. vol. 9, pp. 2101–2106. Citeseer (2009)
Acknowledgement
This work is supported in part by the National Basic Research (973) Program of China (No. 2013CB329606). The authors would like to thank Xinyu Wu, Chunzi Wu, Chang Liu and Zhao Tang for their help in tagging, and Bin Wu for his advice to this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Xie, T., Wu, B., Wang, B. (2017). New Word Detection in Ancient Chinese Literature. In: Chen, L., Jensen, C., Shahabi, C., Yang, X., Lian, X. (eds) Web and Big Data. APWeb-WAIM 2017. Lecture Notes in Computer Science(), vol 10367. Springer, Cham. https://doi.org/10.1007/978-3-319-63564-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-63564-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63563-7
Online ISBN: 978-3-319-63564-4
eBook Packages: Computer ScienceComputer Science (R0)