Skip to main content

Tibetan Multi-word Expressions Identification Framework Based on News Corpora

  • Conference paper
  • First Online:
Natural Language Understanding and Intelligent Applications (ICCPOL 2016, NLPCC 2016)

Abstract

This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao’s N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19(1), 143–177 (1993)

    Google Scholar 

  2. Dagan, I., Church, K.: Termight: identifying and translating technical terminology. In: Proceedings of 4th Conference on Applied Natural Language Processing, Stuttgart, German, pp. 34–40 (1994)

    Google Scholar 

  3. Daille, B.: Combined approach for terminology extraction: lexical statistics and linguistic filtering. Technical paper 5, UCREL, Lancaster University (1995)

    Google Scholar 

  4. McEnery, T., Langé, J.-M., Oakes, M., Véronis, J.: The exploitation of multilingual annotated corpora for term extraction. In: Garside, R., Leech, G., McEnery, A. (eds.) Corpus Annotation – Linguistic Information from Computer Text Corpora, pp. 220–230. Longman, London (1997)

    Google Scholar 

  5. Michiels, A., Dufour, N.: DEFI, a tool for automatic multi-word unit recognition, meaning assignment and translation selection. In: Proceedings of 1st International Conference on Language Resources & Evaluation, Granada, Spain, pp. 1179–1186 (1998)

    Google Scholar 

  6. Diana, M., Sophia, A.: Trucks: a model for automatic multiword term recognition. J. Nat. Lang. Process. 8(1), 101–126 (2000)

    Google Scholar 

  7. Merkel, M., Andersson, M.: Knowledge-lite extraction of multi-word units with language filters and entropy thresholds. In: Proceedings of 2000 Conference User-Oriented Content-Based Text and Image Handling (RIAO 2000), Paris, France, pp. 737–746 (2000)

    Google Scholar 

  8. Piao, S.S., McEnery, T.: Multi-word unit alignment in English-Chinese parallel corpora. In: Proceedings of Corpus Linguistics 2001, Lancaster, UK, pp. 466–475 (2001)

    Google Scholar 

  9. Sag, I.A., Baldwin, T., Bond, F., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: LinGO Working Paper No. 2001-03, Stanford University, CA (2001)

    Google Scholar 

  10. Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, pp. 89–96 (2003)

    Google Scholar 

  11. Dias, G.: Multiword unit hybrid extraction. In: Proceedings of Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, at ACL 2003, Sapporo, Japan, pp. 41–48 (2003)

    Google Scholar 

  12. Nivre, J., Nilsson, J.: Multiword units in syntactic parsing. In: Proceedings of LREC-2004 Workshop on Methodologies & Evaluation of Multiword Units in Real-world Applications, Lisbon, Portugal, pp. 37–46 (2004)

    Google Scholar 

  13. Pereira, R., Crocker, P., Dias, G.: A parallel multikey quicksort algorithm for mining multiword units. In: Proceedings of LREC-2004 Workshop on Methodologies & Evaluation of Multiword Units in Real-world Applications, Lisbon, Portugal, pp. 17–23 (2004)

    Google Scholar 

  14. Piao, S.S., Rayson, P., Archer, D., Wilson, A., McEnery, T.: Extracting multiword expressions with a semantic tagger. In: Proceedings of Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, at ACL 2003, Sapporo, Japan, pp. 49–56 (2003)

    Google Scholar 

  15. Piao, S.S., Rayson, P., Archer, D., McEnery, T.: Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Comput. Speech Lang. 19(4), 378–397 (2005)

    Article  Google Scholar 

  16. Rayson, P., Archer, D., Piao, S.S., McEnery, T.: The UCREL semantic analysis system. In: Proceedings of Workshop on Beyond Named Entity Recognition Semantic Labelling for NLP Tasks in Association with LREC 2004, Lisbon, Portugal, pp. 7–12 (2004)

    Google Scholar 

  17. Piao, S.S., Sun, G., Rayson, P., Yuan, Q.: Automatic extraction of Chinese multiword expressions with a statistical tool. In: Proceedings of 44th Annual Meeting of the Association for Computational Linguistics (2006)

    Google Scholar 

  18. Jiang, D.: On syntactic chunks and formal markers of Tibetan. Minor. Lang. China (3), 30–39 (2003a)

    Google Scholar 

  19. Jiang, D., Long, C.: The markers of non-finite VP of Tibetan and its automatic recognizing strategies. In: Proceedings of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL 2003) (2003b)

    Google Scholar 

  20. Huang, X., Sun, H., Jiang, D., Zhang, J., Tang, L.: The types and formal markers of nominal chunks in contemporary Tibetan. In: proceedings of 8th Joint Conference on Computational Linguistics (JSCL 2005) (2005)

    Google Scholar 

  21. Nuo, M., Liu, H., Ma, L., Wu, J., Ding, Z.: Construction of Chinese-Tibetan multi-word equivalence pair dictionary. J. Chin. Inf. Process. 26(3), 98–103 (2012)

    Google Scholar 

  22. Nuo, M., Liu, H., Zhao, W., Ma, L., Wu, J., Ding, Z.: Tibetan base noun phrase identification framework based on Chinese-Tibetan sentence aligned corpus. In: Proceedings of 26th International Conference on Computational Linguistics Conference, pp. 2141–2157 (2012)

    Google Scholar 

  23. Lü, X., Zhang, L., Hu, J.: Statistical substring reduction in linear time. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 320–327. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30211-7_34

    Chapter  Google Scholar 

  24. Nagao, M., Mori, S.: A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. In: COLING-1994 (1994)

    Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their critical and constructive comments and suggestions that helped us improve the quality of the paper. The research is partially supported by National Science Foundation (No. 61303165) and Informatization Project of the Chinese Academy of Sciences (No. XXH12504-1-10).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minghua Nuo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Nuo, M., Lun, C., Liu, H. (2016). Tibetan Multi-word Expressions Identification Framework Based on News Corpora. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50496-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50495-7

  • Online ISBN: 978-3-319-50496-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics