High Order N-gram Model Construction and Application Based on Natural Annotation

Wang, Qibo; Rao, Gaoqi; Xun, Endong

doi:10.1007/978-3-030-38189-9_34

Qibo Wang¹¹,
Gaoqi Rao¹² &
Endong Xun¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11831))

Included in the following conference series:

Workshop on Chinese Lexical Semantics

1533 Accesses

Abstract

The language model based on the n-gram grammar plays an important role in NLP tasks. In this paper, language models based on language boundary are proposed to conquer the challenge of the very big language data: intra-sentence boundary model and inter-sentence boundary model. We developed a training tool on the Hadoop platform based on MapReduce programming, and conducted the prefix tree to compress and store the model. We implemented our model in identifying the boundary in the syntactic parsing, achieving a good result.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 35(4), 505–512 (2009)
Article Google Scholar
Rao, G., et al.: Natural annotation research in large-scale corpora with a focus on Chinese word segmentation. Acta Sci. Nat. Univ. Pekin. 49(1), 140–146 (2013)
Google Scholar
Rosenfeld, R., Carbonell, J., Rudnicky, A., et al.: Adaptive statistical language modeling: a maximum entropy approach. A maximum entropy approach (1994)
Google Scholar
Huang, X., Alleva, F., Hon, H.W., et al.: The SPHINX-II speech recognition system: an overview. Comput. Speech Lang. 7(2), 137–148 (1992)
Article Google Scholar
Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)
Article Google Scholar
Brown, P.F., Desouza, P.V., Mercer, R.L., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Google Scholar
Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)
Article Google Scholar
Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of ACL, pp. 348–350 (1988)
Google Scholar
Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 219–228 (1992)
Google Scholar
Kuhn, R., Mori, R.D.: Correction to: a cache-based natural language model for speech re-production (1992)
Google Scholar
Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Interspeech, pp. 17–43 (2002)
Google Scholar
Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the 2nd WSMT, pp. 88–95. ACL (2007)
Google Scholar
Nguyen, P., Gao, J., Mahajan, M.: MSRLM: a scalable language modeling toolkit. Microsoft Research MSR-TR-2007-144 (2007)
Google Scholar
Zhang, R.: Research on Large Model and Its Application in Machine Translation, Ph.D thesis of Xiamen University (2009)
Google Scholar
Zhang, Y., Hildebrand, A.S., Vogel, S.: Distributed language modeling for N-best list re-ranking. In: EMNLP, pp. 216–223 (2007)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Yu, X.: Estimating language models using Hadoop and HBase. Ph.D thesis of University of Edinburgh (2008)
Google Scholar
Zhou, Q., Sun, M., Huang, C.: Automatic identification of Chinese maximal noun phrases. J. Softw. 11(2), 195–201 (2000)
Google Scholar
Zhao, J., Huang, C.: Chinese basic noun phrase recognition model based on conversion. J. Chin. Inf. Process. 13(2), 1–7 (1999)
Google Scholar
Li, H., Yang, F., Zhu, J.: Transductive HMM based text chunking. Comput. Sci. 31(2), 152–154 (2004)
Google Scholar
Ma, Y., Liu, Y.: Base noun phrase identification based on HMM and candidates sorting by weighted templates. In: Proceedings of CCL (2005)
Google Scholar
Liu, F., Zhao, T., Yu, H.: Statistics based Chinese chunk Parsin. J. Chin. Inf. Process. 14(6), 28–32 (2000)
Google Scholar
Huang, D., Wang, Y.: Chunk parsing based on SVM and error-driven learning methods. J. Chin. Inf. Process. 20(6), 17–24 (2006)
Google Scholar
Li, Y., Zhu, J., Yao, T.: Combined multiple classifiers based on a stacking algorithm and their application to Chinese text Chinese text chunking. J. Comput. Res. Dev. 42(5), 844–848 (2005)
Article Google Scholar
Liu, S., Li, Y., Zhang, L.: Chinese text chunking using co-training method. J. Chin. Inf. Process. 19(3), 73–79 (2005)
Google Scholar

Download references

Acknowledgements

This paper is supported by Research Project of National Language Committee (YBI135-90), MOE Key Research Center Project (16JJD740004) and Beijing Language and Culture University Research Project (19YJ130005).

Author information

Authors and Affiliations

College of Information Science, Beijing Language and Culture University, Beijing, China
Qibo Wang & Endong Xun
Research Institute of International Chinese Language Education, Beijing Language and Culture University, Beijing, China
Gaoqi Rao

Authors

Qibo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gaoqi Rao
View author publications
You can also search for this author in PubMed Google Scholar
Endong Xun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Gaoqi Rao or Endong Xun .

Editor information

Editors and Affiliations

National Taiwan Normal University, Taipei, Taiwan
Jia-Fei Hong
Beijing Information Science and Technology University, Beijing, China
Yangsen Zhang
Beijing Language and Culture University, Beijing, China
Pengyuan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Q., Rao, G., Xun, E. (2020). High Order N-gram Model Construction and Application Based on Natural Annotation. In: Hong, JF., Zhang, Y., Liu, P. (eds) Chinese Lexical Semantics. CLSW 2019. Lecture Notes in Computer Science(), vol 11831. Springer, Cham. https://doi.org/10.1007/978-3-030-38189-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-38189-9_34
Published: 04 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38188-2
Online ISBN: 978-3-030-38189-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics