Abstract
Electronic Document Management (EDM) technology is being widely adopted as it makes for the efficient routing and retrieval of documents. Optical Character Recognition (OCR) is an important front end for such technology. Excellent OCR now exists for Latin based languages, but there are few systems that read Arabic, which limits the penetration of EDM into Arabic-speaking countries. In developing an OCR system for Arabic it is necessary to create a database of Arabic words. Such a database has many uses as well as in training and testing a recognition system. This paper provides a comprehensive study and analysis of Arabic words and explains how such a database was constructed. Unlike earlier studies, this paper describes a database developed using a large number of collected Arabic words (6 million). It also considers connected segments or Pieces of Arabic Words (PAWs) as well as Naked Pieces of Arabic Word (NPAWs); PAWS without diacritics. Background information concerning the Arabic language is also presented.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
INDEXES : United Nations Documentation, the Department of Public Information (DPI), Dag Hammarskjöld Library (DHL) (2007), http://www.un.org/Depts/dhl/resguide/itp.htm
INTERNET WORLD USERS BY LANGUAGE, Top Ten Languages Used in the Web,Internet World Stats, Usage and Population Statistics (2007)
T. U. o. C. UCLA, Los Angeles, “Arabic”, International Institute, Center for World Languages, Language Materials Project (2006)
David Graff, K.C., Kong, J., Maeda, K.: Arabic Gigaword Second Edition. Philadelphia: Linguistic Data Consortium, University of Pennsylvania (2006)
Schlosser, S.: ERIM Arabic Document Database, Environmental Research Institute of Michigan
Ramzi Abbes, J.D., Hassoun, M.: The Architecture of a Standard Arabic Lexical Database. Some Figures, Ratios and Categories from the DIINAR.1 Source Program. In: Workshop of Computational Approaches to Arabic Script-based Languages, Geneva, Switzerland (2004)
Beesley, K.R.: Arabic Finite-State Morphological Analysis and Generation. In: COLING, Copenhagen (1996)
Al-Ma’adeed, S., Elliman, D., Higgins, C.A.: A data base for Arabic handwritten text recognition research. In: Eighth International Workshop on Frontiers in Handwriting Recognition (2002)
Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT - DATABASE OF HANDWRITTEN ARABIC WORDS. In: 7th Colloque International Francophone sur l’Ecrit et le Document, CIFED 2002, Tunisia (2002)
Unicode, Arabic, Range: 0600-06FF, The Unicode Standard, Version 5 (2007), http://www.unicode.org/charts/PDF/U0600.pdf
The Unicode Consortium. The Unicode Standard, Version 4.1.0, Boston, MA, pp. 195–206. Addison-Wesley, Reading (2003)
Unicode, ”Arabic Shaping ” in Unicode 5.0.0 (1991-2006), http://unicode.org/Public/UNIDATA/ArabicShaping.txt
Liana, V.G., Lorigo, M.: Offline Arabic Handwriting Recognition: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 712–724 (2006)
Fahmy, M.M.M., Ali, S.A.: Automatic Recognition Of Handwritten Arabic Characters Using Their Geometrical Features. Journal of Studies in Informatics and Control with Emphasis on Useful Applications of Advanced Technology 10 (2001)
Amin, A.: Off line Arabic character recognition - a survey. In: Fourth International Conference on Document Analysis and Recognition, Germany (1997)
Harty, R., Ghaddar, C.: Arabic Text Recognition. The International Arab Journal of Information Technology 1, 156–163 (2004)
W. contributors, Code page. From Wikipedia, the free encyclopedia, Wikipedia, The Free Encyclopedia (2006)
arabo.com, Arabo Arab Search Engine & Dictionary (2005)
S. P. Ltd, WebZIP 7.0, 7.0 ed (2006)
Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In: 25th International Conference on Research and Development in Information Retrieval (SIGIR) (2002)
Buckwalter, T.: ARABIC WORD FREQUENCY COUNTS (2002), http://www.qamus.org/transliteration.htm
Mashali, S., Mahmoud, A., Elnemr, H., Ahmed, G., Osama, S.: Arabic OCR Database Development. In: Fifth Conference on Language Engineering, Egypt (2005)
Hyams, D.G.: CurveExpert 1.3, A comprehensive curve fitting system for Windows, 1.3 ed (2005)
Gu, B., Hu, F., Liu, H.: Modelling Classification Performance for Large Data Sets. In: Wang, X.S., Yu, G., Lu, H. (eds.) WAIM 2001. LNCS, vol. 2118, Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
AbdelRaouf, A., Higgins, C.A., Khalil, M. (2008). A Database for Arabic Printed Character Recognition. In: Campilho, A., Kamel, M. (eds) Image Analysis and Recognition. ICIAR 2008. Lecture Notes in Computer Science, vol 5112. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69812-8_56
Download citation
DOI: https://doi.org/10.1007/978-3-540-69812-8_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69811-1
Online ISBN: 978-3-540-69812-8
eBook Packages: Computer ScienceComputer Science (R0)