Abstract
This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao’s N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effectiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification framework can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19(1), 143–177 (1993)
Dagan, I., Church, K.: Termight: identifying and translating technical terminology. In: Proceedings of 4th Conference on Applied Natural Language Processing, Stuttgart, German, pp. 34–40 (1994)
Daille, B.: Combined approach for terminology extraction: lexical statistics and linguistic filtering. Technical paper 5, UCREL, Lancaster University (1995)
McEnery, T., Langé, J.-M., Oakes, M., Véronis, J.: The exploitation of multilingual annotated corpora for term extraction. In: Garside, R., Leech, G., McEnery, A. (eds.) Corpus Annotation – Linguistic Information from Computer Text Corpora, pp. 220–230. Longman, London (1997)
Michiels, A., Dufour, N.: DEFI, a tool for automatic multi-word unit recognition, meaning assignment and translation selection. In: Proceedings of 1st International Conference on Language Resources & Evaluation, Granada, Spain, pp. 1179–1186 (1998)
Diana, M., Sophia, A.: Trucks: a model for automatic multiword term recognition. J. Nat. Lang. Process. 8(1), 101–126 (2000)
Merkel, M., Andersson, M.: Knowledge-lite extraction of multi-word units with language filters and entropy thresholds. In: Proceedings of 2000 Conference User-Oriented Content-Based Text and Image Handling (RIAO 2000), Paris, France, pp. 737–746 (2000)
Piao, S.S., McEnery, T.: Multi-word unit alignment in English-Chinese parallel corpora. In: Proceedings of Corpus Linguistics 2001, Lancaster, UK, pp. 466–475 (2001)
Sag, I.A., Baldwin, T., Bond, F., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: LinGO Working Paper No. 2001-03, Stanford University, CA (2001)
Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, pp. 89–96 (2003)
Dias, G.: Multiword unit hybrid extraction. In: Proceedings of Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, at ACL 2003, Sapporo, Japan, pp. 41–48 (2003)
Nivre, J., Nilsson, J.: Multiword units in syntactic parsing. In: Proceedings of LREC-2004 Workshop on Methodologies & Evaluation of Multiword Units in Real-world Applications, Lisbon, Portugal, pp. 37–46 (2004)
Pereira, R., Crocker, P., Dias, G.: A parallel multikey quicksort algorithm for mining multiword units. In: Proceedings of LREC-2004 Workshop on Methodologies & Evaluation of Multiword Units in Real-world Applications, Lisbon, Portugal, pp. 17–23 (2004)
Piao, S.S., Rayson, P., Archer, D., Wilson, A., McEnery, T.: Extracting multiword expressions with a semantic tagger. In: Proceedings of Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, at ACL 2003, Sapporo, Japan, pp. 49–56 (2003)
Piao, S.S., Rayson, P., Archer, D., McEnery, T.: Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Comput. Speech Lang. 19(4), 378–397 (2005)
Rayson, P., Archer, D., Piao, S.S., McEnery, T.: The UCREL semantic analysis system. In: Proceedings of Workshop on Beyond Named Entity Recognition Semantic Labelling for NLP Tasks in Association with LREC 2004, Lisbon, Portugal, pp. 7–12 (2004)
Piao, S.S., Sun, G., Rayson, P., Yuan, Q.: Automatic extraction of Chinese multiword expressions with a statistical tool. In: Proceedings of 44th Annual Meeting of the Association for Computational Linguistics (2006)
Jiang, D.: On syntactic chunks and formal markers of Tibetan. Minor. Lang. China (3), 30–39 (2003a)
Jiang, D., Long, C.: The markers of non-finite VP of Tibetan and its automatic recognizing strategies. In: Proceedings of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL 2003) (2003b)
Huang, X., Sun, H., Jiang, D., Zhang, J., Tang, L.: The types and formal markers of nominal chunks in contemporary Tibetan. In: proceedings of 8th Joint Conference on Computational Linguistics (JSCL 2005) (2005)
Nuo, M., Liu, H., Ma, L., Wu, J., Ding, Z.: Construction of Chinese-Tibetan multi-word equivalence pair dictionary. J. Chin. Inf. Process. 26(3), 98–103 (2012)
Nuo, M., Liu, H., Zhao, W., Ma, L., Wu, J., Ding, Z.: Tibetan base noun phrase identification framework based on Chinese-Tibetan sentence aligned corpus. In: Proceedings of 26th International Conference on Computational Linguistics Conference, pp. 2141–2157 (2012)
Lü, X., Zhang, L., Hu, J.: Statistical substring reduction in linear time. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 320–327. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30211-7_34
Nagao, M., Mori, S.: A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. In: COLING-1994 (1994)
Acknowledgements
We thank the reviewers for their critical and constructive comments and suggestions that helped us improve the quality of the paper. The research is partially supported by National Science Foundation (No. 61303165) and Informatization Project of the Chinese Academy of Sciences (No. XXH12504-1-10).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Nuo, M., Lun, C., Liu, H. (2016). Tibetan Multi-word Expressions Identification Framework Based on News Corpora. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-50496-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50495-7
Online ISBN: 978-3-319-50496-4
eBook Packages: Computer ScienceComputer Science (R0)