Identifying the Correct Root of an Ambiguous Hebrew Word

HaCohen-Kerner, Yaakov; Erlich, Ofir Tzvi

doi:10.1007/978-3-642-45327-4_3

Yaakov HaCohen-Kerner¹⁷ &
Ofir Tzvi Erlich¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8003))

1144 Accesses

Abstract

Stemming is useful for various natural language processing tasks, such as document indexing and text classification. Therefore, identification of the correct root of any given word is important. For Hebrew this is not a trivial task, due to the complex nature of Hebrew morphology and its orthography. Many Hebrew words are ambiguous in the sense that each one of them can be created from a few possible roots. However, for a given word in a specific context, each word has only one correct root or no root at all. We have developed a variety of features in order to find the correct root for a Hebrew ambiguous word. These features are classified into 3 distinct groups: root-based features, conjugation-based features and statistical features. Several common machine learning methods have been tested in order to find a successful integration of the features. The best result has been achieved by Naïve Bayes, with about 87% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abu-Salem, H., Al-Omari, M., Evens, M.W.: Stemming Methodologies over Individual Query Words for an Arabic Information Retrieval System. Journal of the American Society for Information Science 50(6), 524–529 (1999)
Article Google Scholar
Al-Kharashi, I.A., Evens, M.W.: Comparing Words Roots, and Roots as Index Terms in an Arabic Information Retrieval System. Journal of the American Society for Information Science 45(8), 548–560 (2004)
Article Google Scholar
The Academy of the Hebrew Language (2009), http://Hebrew-terms.huji.ac.il/odot.html
Carlson, A.J., Cumby, C.M., Rosen, J.L., Roth, D.: The SNoW Learning Architecture. Technical Report UIUCDCS-R-99-2101, UIUC Computer Science Department (1999)
Google Scholar
Choueka, Y.: Full-Text Systems and Research in the Humanities. Computers and the Humanities 14, 153–169 (1980)
Article Google Scholar
Choueka, Y.: Rav Milim, A Comprehensive Dictionary of Modern Hebrew (1997)
Google Scholar
Choueka, Y., Conley, E.S., Dagan, I.: A Comprehensive Bilingual Word Alignment System: Application to Disparate Languages – Hebrew and English. In: Veronis, J. (ed.) Parallel Text Processing, pp. 69–96. Kluwer Academic Publishers (2000)
Google Scholar
Daya, E., Roth, D., Wintner, S.: Learning Hebrew roots: Machine learning with linguistic constraints. In: Proceedings of EMNLP 2004, pp. 357–364 (2004)
Google Scholar
Daya, E., Roth, D., Wintner, S.: Learning to Identify Semitic Roots. In: Abdelhadi, S., Neumann, G., van den Bosch, A. (eds.) Arabic Computational Morphology: Knowledge-based and Empirical Methods. Text, Speech and Language Technology, vol. 38, pp. 143–158. Springer (2007)
Google Scholar
Daya, E., Roth, D., Wintner, S.: Learning Hebrew Roots: Machine Learning with Linguistic Constraints. Computational Linguistics 34(3), 429–448 (2008)
Article Google Scholar
Even-Shoshan, A.: HaMillon HaHadash (The New Dictionary), Kiryat Sefer, Jerusalem (1993) (in Hebrew)
Google Scholar
Frank, Y.: Dayka Namei: Dikduk for Talmud Bavli and Targum Onqelos, Jerusalem (1996) (in Hebrew)
Google Scholar
Fox, B., Fox, C.J.: Efficient Rootmer Generation. Inf. Process. Manage. 38(4), 547–558 (2002)
Article MATH Google Scholar
Frakes, W.: Stemming Algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 131–161. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
Glinert, L.: Hebrew – An Essential Grammar. Routledge, London (1994)
Google Scholar
HaCohen-Kerner, Y., Badlov, A., Filgut, A.: Finding the Correct Root of a Hebrew Word. Technical report of a Graduation Project, Department of Computer Science, Jerusalem College of Technology (2004) (in Hebrew)
Google Scholar
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets. In: Lavrač, N., Todorovski, L., Jantke, K.P. (eds.) DS 2006. LNCS (LNAI), vol. 4265, pp. 102–113. Springer, Heidelberg (2006)
Chapter Google Scholar
HaCohen-Kerner, Y., Boger, Z., Beck, H., Yehudai, E.: Classifying Documents’ Authors to their Ethnic Group Using Roots. In: Proceedings of the 20th International Conference on Computer Applications in Industry and Engineering (CAINE 2007), San Francisco, California USA, pp. 5–11 (2007)
Google Scholar
HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems 39(3), 213–228 (2008)
Article MATH Google Scholar
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 58–69. Springer, Heidelberg (2004)
Chapter Google Scholar
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Combined one Sense Disambiguation of Abbreviations. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2008), Short Papers (Companion Volume), pp. 61–64 (2008)
Google Scholar
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Abbreviation Disambiguation: Experiments with Various Variants of the One Sense per Discourse Hypothesis. In: Kapetanios, E., Sugumaran, V., Spiliopoulou, M. (eds.) NLDB 2008. LNCS, vol. 5039, pp. 27–39. Springer, Heidelberg (2008)
Chapter Google Scholar
HaCohen-Kerner, Y., Stern, I., Korkus, D., Fredj, E.: Automatic Machine Learning of Keyphrase Extraction from Short Html Documents Written in Hebrew. Cybernetics and Systems 38(1), 1–21 (2007)
Article MATH Google Scholar
Hebrew Google (2009), http://www.google.co.il
Itai, A., Segal, E.: A Corpus Based Morphological Analyzer for Unvocalized Modern Hebrew. In: Proc. Workshop of Machine Translation for Semitic Languages, New Orleans, USA (2003)
Google Scholar
Larkey, L.S., Ballesteros, L.: Connell. M.E.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–282. ACM Press, New York (2002)
Chapter Google Scholar
Levinger, M.: Morphologic Disambiguation in Hebrew. Master’s Thesis, Technion, Haifa, Israel (1992) (in Hebrew)
Google Scholar
Levinger, M., Ornan, U., Itai, A.: Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew. Computational Linguistics 21(3), 383–404 (1995)
Google Scholar
Melamed, E.Z.: Aramaic-Hebrew-English Dictionary. Feldheim, Jerusalem (2005)
Google Scholar
Morfix (2009), http://milon.morfix.co.il/
Rav-Milim dictionary (2009), http://www.ravmilim.co.il/naerr.asp
The Responsa Project (2009), http://www.biu.ac.il/ICJI/Responsa/index.html
Rosenthal, F.: Aramaic Studies During the Past Thirty Years. The Journal of Near Eastern Studies, 81–82 (1978)
Google Scholar
Roth, D.: Learning to Resolve Natural Language Ambiguities: A Unified Approach. In: Proceedings of AAAI 1998 and IAAI 1998, Madison, Wisconsin, pp. 806–813 (1998)
Google Scholar
Wartski, I.: Hebrew Grammar and Explanatory Notes. The Linguaphone Institute, London (1900)
Google Scholar
Wintner, S.: Hebrew Computational Linguistics: Past and Future. Artificial Intelligence Review 21(2), 113–138 (2004)
Article MATH Google Scholar
Witten, I.H., Frank, E.: Weka 3: Machine Learning Software in Java (2009), http://www.cs.waikato.ac.nz/~ml/weka
Yelin, D.: Dikduk HaLason HaIvrit (Hebrew Grammar), Jerusalem (1970) (in Hebrew)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Jerusalem College of Technology (Machon Lev), 21 Havaad Haleumi St., P.O.B. 16031, 9116001, Jerusalem, Israel
Yaakov HaCohen-Kerner & Ofir Tzvi Erlich

Authors

Yaakov HaCohen-Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Ofir Tzvi Erlich
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Tel Aviv University, Tel Aviv, Israel
Nachum Dershowitz
Goldsmiths College, Department of Computing, University of London, 25–27 St. James, New Cross, SE14 6NW, London, UK
Ephraim Nissan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

HaCohen-Kerner, Y., Erlich, O.T. (2014). Identifying the Correct Root of an Ambiguous Hebrew Word. In: Dershowitz, N., Nissan, E. (eds) Language, Culture, Computation. Computational Linguistics and Linguistics. Lecture Notes in Computer Science, vol 8003. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45327-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-45327-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45326-7
Online ISBN: 978-3-642-45327-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics