Skip to main content

High Order N-gram Model Construction and Application Based on Natural Annotation

  • Conference paper
  • First Online:
Book cover Chinese Lexical Semantics (CLSW 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11831))

Included in the following conference series:

  • 1533 Accesses

Abstract

The language model based on the n-gram grammar plays an important role in NLP tasks. In this paper, language models based on language boundary are proposed to conquer the challenge of the very big language data: intra-sentence boundary model and inter-sentence boundary model. We developed a training tool on the Hadoop platform based on MapReduce programming, and conducted the prefix tree to compress and store the model. We implemented our model in identifying the boundary in the syntactic parsing, achieving a good result.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 35(4), 505–512 (2009)

    Article  Google Scholar 

  2. Rao, G., et al.: Natural annotation research in large-scale corpora with a focus on Chinese word segmentation. Acta Sci. Nat. Univ. Pekin. 49(1), 140–146 (2013)

    Google Scholar 

  3. Rosenfeld, R., Carbonell, J., Rudnicky, A., et al.: Adaptive statistical language modeling: a maximum entropy approach. A maximum entropy approach (1994)

    Google Scholar 

  4. Huang, X., Alleva, F., Hon, H.W., et al.: The SPHINX-II speech recognition system: an overview. Comput. Speech Lang. 7(2), 137–148 (1992)

    Article  Google Scholar 

  5. Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)

    Article  Google Scholar 

  6. Brown, P.F., Desouza, P.V., Mercer, R.L., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)

    Google Scholar 

  7. Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)

    Article  Google Scholar 

  8. Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of ACL, pp. 348–350 (1988)

    Google Scholar 

  9. Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 219–228 (1992)

    Google Scholar 

  10. Kuhn, R., Mori, R.D.: Correction to: a cache-based natural language model for speech re-production (1992)

    Google Scholar 

  11. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Interspeech, pp. 17–43 (2002)

    Google Scholar 

  12. Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the 2nd WSMT, pp. 88–95. ACL (2007)

    Google Scholar 

  13. Nguyen, P., Gao, J., Mahajan, M.: MSRLM: a scalable language modeling toolkit. Microsoft Research MSR-TR-2007-144 (2007)

    Google Scholar 

  14. Zhang, R.: Research on Large Model and Its Application in Machine Translation, Ph.D thesis of Xiamen University (2009)

    Google Scholar 

  15. Zhang, Y., Hildebrand, A.S., Vogel, S.: Distributed language modeling for N-best list re-ranking. In: EMNLP, pp. 216–223 (2007)

    Google Scholar 

  16. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  17. Yu, X.: Estimating language models using Hadoop and HBase. Ph.D thesis of University of Edinburgh (2008)

    Google Scholar 

  18. Zhou, Q., Sun, M., Huang, C.: Automatic identification of Chinese maximal noun phrases. J. Softw. 11(2), 195–201 (2000)

    Google Scholar 

  19. Zhao, J., Huang, C.: Chinese basic noun phrase recognition model based on conversion. J. Chin. Inf. Process. 13(2), 1–7 (1999)

    Google Scholar 

  20. Li, H., Yang, F., Zhu, J.: Transductive HMM based text chunking. Comput. Sci. 31(2), 152–154 (2004)

    Google Scholar 

  21. Ma, Y., Liu, Y.: Base noun phrase identification based on HMM and candidates sorting by weighted templates. In: Proceedings of CCL (2005)

    Google Scholar 

  22. Liu, F., Zhao, T., Yu, H.: Statistics based Chinese chunk Parsin. J. Chin. Inf. Process. 14(6), 28–32 (2000)

    Google Scholar 

  23. Huang, D., Wang, Y.: Chunk parsing based on SVM and error-driven learning methods. J. Chin. Inf. Process. 20(6), 17–24 (2006)

    Google Scholar 

  24. Li, Y., Zhu, J., Yao, T.: Combined multiple classifiers based on a stacking algorithm and their application to Chinese text Chinese text chunking. J. Comput. Res. Dev. 42(5), 844–848 (2005)

    Article  Google Scholar 

  25. Liu, S., Li, Y., Zhang, L.: Chinese text chunking using co-training method. J. Chin. Inf. Process. 19(3), 73–79 (2005)

    Google Scholar 

Download references

Acknowledgements

This paper is supported by Research Project of National Language Committee (YBI135-90), MOE Key Research Center Project (16JJD740004) and Beijing Language and Culture University Research Project (19YJ130005).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Gaoqi Rao or Endong Xun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Q., Rao, G., Xun, E. (2020). High Order N-gram Model Construction and Application Based on Natural Annotation. In: Hong, JF., Zhang, Y., Liu, P. (eds) Chinese Lexical Semantics. CLSW 2019. Lecture Notes in Computer Science(), vol 11831. Springer, Cham. https://doi.org/10.1007/978-3-030-38189-9_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-38189-9_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-38188-2

  • Online ISBN: 978-3-030-38189-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics