Abstract
Stemming is useful for various natural language processing tasks, such as document indexing and text classification. Therefore, identification of the correct root of any given word is important. For Hebrew this is not a trivial task, due to the complex nature of Hebrew morphology and its orthography. Many Hebrew words are ambiguous in the sense that each one of them can be created from a few possible roots. However, for a given word in a specific context, each word has only one correct root or no root at all. We have developed a variety of features in order to find the correct root for a Hebrew ambiguous word. These features are classified into 3 distinct groups: root-based features, conjugation-based features and statistical features. Several common machine learning methods have been tested in order to find a successful integration of the features. The best result has been achieved by Naïve Bayes, with about 87% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abu-Salem, H., Al-Omari, M., Evens, M.W.: Stemming Methodologies over Individual Query Words for an Arabic Information Retrieval System. Journal of the American Society for Information Science 50(6), 524–529 (1999)
Al-Kharashi, I.A., Evens, M.W.: Comparing Words Roots, and Roots as Index Terms in an Arabic Information Retrieval System. Journal of the American Society for Information Science 45(8), 548–560 (2004)
The Academy of the Hebrew Language (2009), http://Hebrew-terms.huji.ac.il/odot.html
Carlson, A.J., Cumby, C.M., Rosen, J.L., Roth, D.: The SNoW Learning Architecture. Technical Report UIUCDCS-R-99-2101, UIUC Computer Science Department (1999)
Choueka, Y.: Full-Text Systems and Research in the Humanities. Computers and the Humanities 14, 153–169 (1980)
Choueka, Y.: Rav Milim, A Comprehensive Dictionary of Modern Hebrew (1997)
Choueka, Y., Conley, E.S., Dagan, I.: A Comprehensive Bilingual Word Alignment System: Application to Disparate Languages – Hebrew and English. In: Veronis, J. (ed.) Parallel Text Processing, pp. 69–96. Kluwer Academic Publishers (2000)
Daya, E., Roth, D., Wintner, S.: Learning Hebrew roots: Machine learning with linguistic constraints. In: Proceedings of EMNLP 2004, pp. 357–364 (2004)
Daya, E., Roth, D., Wintner, S.: Learning to Identify Semitic Roots. In: Abdelhadi, S., Neumann, G., van den Bosch, A. (eds.) Arabic Computational Morphology: Knowledge-based and Empirical Methods. Text, Speech and Language Technology, vol. 38, pp. 143–158. Springer (2007)
Daya, E., Roth, D., Wintner, S.: Learning Hebrew Roots: Machine Learning with Linguistic Constraints. Computational Linguistics 34(3), 429–448 (2008)
Even-Shoshan, A.: HaMillon HaHadash (The New Dictionary), Kiryat Sefer, Jerusalem (1993) (in Hebrew)
Frank, Y.: Dayka Namei: Dikduk for Talmud Bavli and Targum Onqelos, Jerusalem (1996) (in Hebrew)
Fox, B., Fox, C.J.: Efficient Rootmer Generation. Inf. Process. Manage. 38(4), 547–558 (2002)
Frakes, W.: Stemming Algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 131–161. Prentice-Hall, Englewood Cliffs (1992)
Glinert, L.: Hebrew – An Essential Grammar. Routledge, London (1994)
HaCohen-Kerner, Y., Badlov, A., Filgut, A.: Finding the Correct Root of a Hebrew Word. Technical report of a Graduation Project, Department of Computer Science, Jerusalem College of Technology (2004) (in Hebrew)
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets. In: Lavrač, N., Todorovski, L., Jantke, K.P. (eds.) DS 2006. LNCS (LNAI), vol. 4265, pp. 102–113. Springer, Heidelberg (2006)
HaCohen-Kerner, Y., Boger, Z., Beck, H., Yehudai, E.: Classifying Documents’ Authors to their Ethnic Group Using Roots. In: Proceedings of the 20th International Conference on Computer Applications in Industry and Engineering (CAINE 2007), San Francisco, California USA, pp. 5–11 (2007)
HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems 39(3), 213–228 (2008)
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 58–69. Springer, Heidelberg (2004)
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Combined one Sense Disambiguation of Abbreviations. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2008), Short Papers (Companion Volume), pp. 61–64 (2008)
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Abbreviation Disambiguation: Experiments with Various Variants of the One Sense per Discourse Hypothesis. In: Kapetanios, E., Sugumaran, V., Spiliopoulou, M. (eds.) NLDB 2008. LNCS, vol. 5039, pp. 27–39. Springer, Heidelberg (2008)
HaCohen-Kerner, Y., Stern, I., Korkus, D., Fredj, E.: Automatic Machine Learning of Keyphrase Extraction from Short Html Documents Written in Hebrew. Cybernetics and Systems 38(1), 1–21 (2007)
Hebrew Google (2009), http://www.google.co.il
Itai, A., Segal, E.: A Corpus Based Morphological Analyzer for Unvocalized Modern Hebrew. In: Proc. Workshop of Machine Translation for Semitic Languages, New Orleans, USA (2003)
Larkey, L.S., Ballesteros, L.: Connell. M.E.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–282. ACM Press, New York (2002)
Levinger, M.: Morphologic Disambiguation in Hebrew. Master’s Thesis, Technion, Haifa, Israel (1992) (in Hebrew)
Levinger, M., Ornan, U., Itai, A.: Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew. Computational Linguistics 21(3), 383–404 (1995)
Melamed, E.Z.: Aramaic-Hebrew-English Dictionary. Feldheim, Jerusalem (2005)
Morfix (2009), http://milon.morfix.co.il/
Rav-Milim dictionary (2009), http://www.ravmilim.co.il/naerr.asp
The Responsa Project (2009), http://www.biu.ac.il/ICJI/Responsa/index.html
Rosenthal, F.: Aramaic Studies During the Past Thirty Years. The Journal of Near Eastern Studies, 81–82 (1978)
Roth, D.: Learning to Resolve Natural Language Ambiguities: A Unified Approach. In: Proceedings of AAAI 1998 and IAAI 1998, Madison, Wisconsin, pp. 806–813 (1998)
Wartski, I.: Hebrew Grammar and Explanatory Notes. The Linguaphone Institute, London (1900)
Wintner, S.: Hebrew Computational Linguistics: Past and Future. Artificial Intelligence Review 21(2), 113–138 (2004)
Witten, I.H., Frank, E.: Weka 3: Machine Learning Software in Java (2009), http://www.cs.waikato.ac.nz/~ml/weka
Yelin, D.: Dikduk HaLason HaIvrit (Hebrew Grammar), Jerusalem (1970) (in Hebrew)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
HaCohen-Kerner, Y., Erlich, O.T. (2014). Identifying the Correct Root of an Ambiguous Hebrew Word. In: Dershowitz, N., Nissan, E. (eds) Language, Culture, Computation. Computational Linguistics and Linguistics. Lecture Notes in Computer Science, vol 8003. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45327-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-45327-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45326-7
Online ISBN: 978-3-642-45327-4
eBook Packages: Computer ScienceComputer Science (R0)