Hostname: page-component-8448b6f56d-cfpbc Total loading time: 0 Render date: 2024-04-23T17:08:12.070Z Has data issue: false hasContentIssue false

Part-of-speech tagging of Modern Hebrew text

Published online by Cambridge University Press:  01 April 2008

ROY BAR-HAIM
Affiliation:
Dept. of Computer Science, Bar-Ilan University, Ramat-Gan 52900, Israel e-mail: barhair@cs.biu.ac.il
KHALIL SIMA'AN
Affiliation:
Institute for Logic, Language and Computation, Universiteit van Amsterdam, Amsterdam, The Netherlandssimaan@science.uva.nl
YOAD WINTER
Affiliation:
Dept. of Computer Science, Technion, Haifa 32000, Israelwinter@cs.technion.ac.il Netherlands Institute for Advanced Study, Meijboomlaan 1, 2242 PR Wassenaar, The Netherlands

Abstract

Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization models, we propose a novel, linguistically motivated, intermediate tokenization model that gives better performance for Hebrew over the two initial architectures. Our study is based on the well-known hidden Markov models (HMMs). We start out from a manually devised morphological analyzer and a very small annotated corpus, and describe how to adapt an HMM-based POS tagger for both tokenization architectures. We present an effective technique for smoothing the lexical probabilities using an untagged corpus, and a novel transformation for casting the segment-level tagger in terms of a standard, word-level HMM implementation. The results obtained using our model are on par with the best published results on Modern Standard Arabic, despite the much smaller annotated corpus available for Modern Hebrew.

Type
Papers
Copyright
Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adler, M. and Elhadad, M. 2006. An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 665–672. East Stroudsburg, PA: Association for Computational Linguistics.CrossRefGoogle Scholar
Bar-Haim, R., Sima'an, K. and Winter, Y. 2005. Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, pp. 39–46, MI. East Stroudsburg, PA: Association for Computational Linguistics.CrossRefGoogle Scholar
Baum, L. 1972. An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. In Inequalities III: Proceedings of the Third Symposium on Inequalities, University of California, Los Angeles, pp. 1–8.Google Scholar
Brants, T. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, WA.CrossRefGoogle Scholar
Brill, E. 1995. Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistic 21: 784789.Google Scholar
Buckwalter, T. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium (LDC). LDC Catalog No.: LDC2002L49, ISBN:1-58563-257-0.Google Scholar
Carmel, D. and Maarek, Y. 1999. Morphological disambiguation for Hebrew search systems. In Proceedings of the 4th international workshop, NGITS-99.Google Scholar
Charniak, E., Hendrickson, C., Jacobson, N. and Perkowitz, M. 1993. Equations for part-of-speech tagging. In National Conference on Artificial Intelligence, pp. 784–789.Google Scholar
Church, K.W. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proc. of the Second Conference on Applied Natural Language Processing, Austin, TX, pp. 136–143.Google Scholar
Cutting, D., Kupiec, J., Pedersen, J. and Sibun, P. 1992. A practical part-of-speech tagger. In Proceedings of the third conference on Applied natural language processing, Association for Computational Linguistics pp. 133–140.Google Scholar
Danon, G. 2001. Syntactic definiteness in the grammar of Modern Hebrew. Linguistics 39: 10711116.CrossRefGoogle Scholar
Daya, E., Roth, D. and Wintner, S. 2004. Learning Hebrew roots: machine learning with linguistic constraints. In Proceedings of EMNLP'04, Barcelona, Spain, pp. 357–364.Google Scholar
Dermatas, E. and Kokkinakis, G. 1995. Automatic stochastic tagging of natural language texts. Computational Linguistics 21 (2): 137163.Google Scholar
DeRose, S. J. 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics 14 (1): 3139.Google Scholar
Diab, M., Hacioglu, K. and Jurafsky, D. 2006. Automatic Tagging of arabic text: from raw text to base phrase chunks. In Dumais, D. M. S. and Roukos, S. (eds), HLT-NAACL 2004: Short Papers, Boston, MA, pp. 149152. East Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Elworthy, D. 1994. Does Baum-Welch re-estimation help taggers? In Proceedings of the fourth conference on Applied natural language processing, Morgan Kaufmann Publishers Inc. pp. 53–58.Google Scholar
Glinert, L. 1989. The Grammar of Modern Hebrew. Cambridge, England: Cambridge University Press.Google Scholar
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27 (2): 153198.CrossRefGoogle Scholar
Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40: 237264.CrossRefGoogle Scholar
Habash, N. and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pp. 573–580, Ann Arbor, MI. Association for Computational Linguistics.CrossRefGoogle Scholar
Hakkani-Tür, D., Oflazer, K. and Tür, G. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000).CrossRefGoogle Scholar
ISO. 1999. Information and documentation – conversion of Hebrew characters into Latin characters – part 3: Phonemic conversion, ISO/FDIS 259-3: (E).Google Scholar
Katz, S. M. 1987. Estimation of probabilities from sparse data from the language model component of a speech recognizer. IEEE Transactions of Acoustics, Speech and Signal Processing 35 (3): 400401.CrossRefGoogle Scholar
Lee, Y. S., Papineni, K., Roukos, S., Emam, O. and Hassan, H. 2003. Language model based arabic word segmentation. In ACL ‘03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, pp. 399–406. East Stroudsburg, PA: Association for Computational Linguistics.CrossRefGoogle Scholar
Levinger, M., Ornan, U. and Itai, A. 1995. Morphological disambiguation in Hebrew using a priori probabilities. Computational Linguistics 21: 383404.Google Scholar
Levinger, M. 1992. Morphological Disambiguation in Hebrew. Master's thesis, Computer Science Department, Technion, Haifa, Israel. In Hebrew.Google Scholar
Maamouri, M., Bies, A., Buckwalter, T. and Mekki, W. 2004. The Penn Arabic Treebank: building a large-scale annotated Arabic corpus. In NEMLAR International Conference on Arabic Language Resources and Tools, Cairo.Google Scholar
Merialdo, B. 1994. Tagging English text with a probabilistic model. Computational Linguistics 20 (2): 155171.Google Scholar
Nakagawa, T. 2004. Chinese and japanese word segmentation using word-level and character-level information. In Proceedings of Coling 2004, Geneva, Switzerland, pp. 466–472.Google Scholar
Nigam, K., Mccallum, A. K., Thrun, S. and Mitchell, T. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39 (2–3): 103134.CrossRefGoogle Scholar
Rogati, M., McCarley, S. and Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL'03), Sapporo, Japan, pp. 391–398.Google Scholar
Schone, P. and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, pp. 67–72.Google Scholar
Segal, E., 2000. Hebrew Morphological Analyzer for Hebrew Undotted Texts. Master's thesis. Computer Science Department, Technion, Haifa, Israel. http://www.cs.technion.ac.il/-~erelsgl/bxi/hmntx/teud.htmlGoogle Scholar
Sima'an, K., Itai, A., Winter, Y., Altman, A. and Nativ, N. 2001. Building a tree-bank of Modern Hebrew text. Traitment Automatique des Langues 42: 347380.Google Scholar
Stolcke, A. 2002. SRILM –- an extensible language modeling toolkit. In ICSLP, Denver, CO, pp. 901–904.Google Scholar
Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transaction of Information Theory IT-13 (2): 260269.CrossRefGoogle Scholar
Watson, J. C. E. 2002. The Phonology and Morphology of Arabic. Oxford University Press, Oxford.CrossRefGoogle Scholar
Weischedel, R., Schwartz, R., Palmucci, J., Meteer, M. and Ramshaw, L. 1993. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics 19 (2): 361382.Google Scholar
Wintner, S. 2000. Definiteness in the Hebrew noun phrase. Journal of Linguistics 36: 319363.CrossRefGoogle Scholar
Xue, N. 2003. Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese 8 (1): 2948.Google Scholar
Yarowsky, D. and Wicentowski, R. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of ACL-2000, Hong Kong, pp. 207–216.Google Scholar