Abstract
Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.
Similar content being viewed by others
References
Kučera H., Francis W.N.: Computational analysis of present-day American English. Int. J. Am. Linguist. 35(1), 71–75 (1967)
Davies, M.: (1990-present) The corpus of contemporary American english (COCA), 410+ Million words. http://www.americancorpus.org (2008)
The British National Corpus: Oxford University. http://www.natcorp.ox.ac.uk (2005)
Time: Time archive 1923 to present. http://www.time.com/time/archive/ (2008)
Knight, D., Bayoumi, S., Mills, S., Crabtree, A., Adolphs, S., Pridmore, T., Carter, R.: Beyond the text: building and analysing multi-modal corpora. In: 2nd International Conference on E-Social Science. Manchester, UK (2006)
Indexes : United nations documentation.: the Department of Public Information (DPI), Dag Hammarskjöld Library (DHL). http://www.un.org/Depts/dhl/resguide/itp.htm (2007)
Internet world users by language: Top ten languages used in the web. Internet World Stats, Usage and Population Statistics. http://www.internetworldstats.com/stats7.htm. Accessed 22-01-07 (2007)
UCLA TUoC, Los Angeles: Arabic. International Institute, Center for World Languages, Language Materials Project. http://www.lmp.ucla.edu/Profile.aspx?LangID=210&menu=004 (2006)
Hamada, S.:
. In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)
Lorigo L.M., Govindaraju V.: Offline Arabic handwriting recognition: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 712–724 (2006)
Amin, A.: Off line Arabic character recognition—a survey. In: The Fourth International Conference on Document Analysis and Recognition, pp. 596–599. Ulm, Germany (1997)
AbdelRaouf, A., Higgins, C., Khalil, M.: A database for Arabic printed character recognition. In: The International Conference on Image Analysis and Recognition-ICIAR 2008, Póvoa de Varzim, Portugal, pp. 567–578 (2008)
IRIS: Readiris pro 10 (2004)
Parker R., Graff D., Chen K., Kong J., Maeda K.: Arabic Gigaword. Linguistic Data Consortium, University of Pennsylvania, Philadelphia (2009)
CLARA (Corpus Linguae Arabicae): Charles University, Prague (2001)
Al-Hayat newspaper, Al-Hayat Arabic data set, University of Essex, in collaboration with the Open University
An-Nahar newspaper: An-Nahar text corpus (2000)
Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: A new Arabic printed text image database and evaluation protocols. In: 10th International Conference on Document Analysis and Recognition, pp. 946–950. Barcelona, Spain (2009)
Abbes, R., Dichy, J., Hassoun, M.: The architecture of a standard Arabic lexical database. Some figures, ratios and categories from the DIINAR.1 source program. In: Workshop of Computational Approaches to Arabic Script-based Languages, pp. 15–22. Geneva, Switzerland (2004)
Beesley, K.R.: Arabic finite-state morphological analysis and generation. In: 16th International Conference on Computational Linguistics, pp. 89–94. Copenhagen (1996)
Alansary, S., Nagi, M., Adly, N.: Building an International Corpus of Arabic (ICA): progress of compilation stage. In: The Seventh Conference on Language Engineering. Cairo, Egypt (2007)
Wynne, M.: Corpus and text—basic principles. In: Developing Linguistic Corpora: A Guide to Good Practice. Oxbow Books, Oxford. Available online from http://ahds.ac.uk/linguistic-corpora/ (2005)
Dash, N.S., Chaudhuri, B.B.: Why do we need to develop corpora in Indian languages? In: the International Working Conference on Sharing Capability in Localisation and Human Language Technologies SCALLA-2001. Bangalore (2001)
Al-Shalabi, R., Evens, M.: A computational morphology system for Arabic. In: Workshop on Computational Approaches to Semitic Languages COLING-ACL98, pp. 66–72. Montreal (1998)
The Unicode consortium: Arabic, range: 0600-06ff. The Unicode Standard, Version 5 (2007)
The Unicode consortium: The Unicode standard, version 4.1.0. In: pp. 195–206. Boston, MA, Addison-Wesley (2003)
The Unicode consortium: Arabic shaping The Unicode Standard, Version 5 (2006)
Khorsheed M.S.: Off-line Arabic character recognition—a review. Pattern Anal. Appl. 5(1), 31–45 (2002)
Fahmy M.M.M., Ali S.A.: Automatic recognition of handwritten Arabic characters using their geometrical features. J. Stud. Inform. Control Emphasis Useful Appl. Adv. Technol. 10(2), 81–98 (2001)
Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: The 20th International Conference on Computational Linguistics, COLING 2004, pp. 31–34. Geneva, Switzerland (2004)
Harty R., Ghaddar C.: Arabic text recognition. Int. Arab. J. Inf. Technol. 1(2), 156–163 (2004)
Contributors, W.: Code page. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/w/index.php?title=Code_page&oldid=87192444. Accessed 22 /01/07 (2006)
Beebe N.H.F.: Character set encoding. TUGboat 11(2), 171–175 (1990)
arabo.com: Arabo Arab search engine and dictionary. http://www.arabo.com/. Accessed 12-01-07 (2005)
Ltd SP: Webzip 7.0. 7.0 edn (2006)
Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: 25th International Conference on Research and Development in Information Retrieval (SIGIR), pp. 269–274 (2002)
Kanungo, T., Resnik, P.: The bible, truth, and multilingual OCR evaluation. In: the SPIE Conference on Document Recognition and Retrieval VI, pp. 86–96. San Jose, CA (1999)
Chang Y., Chen D., Zhang Y., Yang J.: An image-based automatic Arabic translation system. Pattern Recognit. 42(9), 2127–2134 (2009)
Kanoun, S., Alimi, A.M., Lecourtier, Y.: Affixal approach for Arabic decomposable vocabulary recognition: A validation on printed word in only one font. In: The Eight International Conference on Document Analysis and Recognition (ICDAR’05), pp. 1025–1029. Seoul, Korea (2005)
Box G.E.P., Muller M.E.: A note on the generation of random normal deviates. Ann. Math. Stat. 29(2), 610–611 (1958)
Sonka, M., Hlavac, V., Boyle, R.: Image Processing: Analysis and Machine Vision, 2nd edition edn. Thomson Learning Vocational (1998)
Hartley, R.T., Crumpton, K.: Quality of OCR for degraded text images. In: The Fourth ACM Conference on Digital Libraries, pp. 228–229 Berkeley, California, United States (1999)
Leea C.H., Kanungob T.: The architecture of trueViz: a grounDTRUth=metadata editing and vIsualiZing toolKit. Pattern Recognit. 36(3), 811–825 (2003)
Mehran, R., Pirsiavash, H., Razzazi, F.: A front-end OCR for omni-font Persian/Arabic cursive printed documents. In: Digital Image Computing: Techniques and Applications (DICTA’05), pp 56–64. Cairns, Australia (2005)
Najoua, B.A., Noureddine, E.: A robust approach for Arabic printed character segmentation. In: Third International Conference on Document Analysis and Recognition (ICDAR’95), pp. 865–868. Montreal, Canada, (1995)
Bushofa B.M.F., Spann M.: Segmentation and recognition of Arabic characters by structural classification. Image Vis Comput. 15(3), 167–179 (1997)
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. D-Lib Mag. 15(7/8), (2009)
Buckwalter, T.: Arabic word frequency counts. http://www.qamus.org/wordlist.htm. Accessed 28/01/07 (2002)
Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: The 5th ACL Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)
AL-Ma’adeed, S., Elliman, D., Higgins, C.A.: A data base for Arabic handwritten text recognition research. In: Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 485–489 Ontario, Canada (2002)
Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H.: IFN/ENIT—Database of handwritten Arabic words. In: The 7th Colloque International Francophone sur l’Ecrit et le Document, CIFED 2002, pp. 129–136. Hammamet, Tunisia (2002)
Mashali, S., Mahmoud, A., Elnemr, H., Ahmed, G., Osama, S.: Arabic OCR database development. In: The Fifth Conference on Language Engineering, pp. 250–283. Cairo, Egypt (2005)
Hyams, D.G.: CurveExpert 1.3, a comprehensive curve fitting system for windows (2005)
Gu, B., Hu, F., Liu, H.: Modelling classification performance for large data sets, an empirical study. In: Advances in web-age information management: second international conference, waim 2001, pp. 317–328. xi’an, china (2001)
contributors, W.: Romanization of Arabic. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/wiki/Arabic_transliteration. Accessed 27/01/07 (2006)
contributors, W.: Arabic chat alphabet. From wikipedia, the free encyclopaedia. http://en.wikipedia.org/wiki/Arabic_Chat_Alphabet. Accessed 27/01/07 (2006)
Palfreyman, D., Khalil, M.a.: A funky language for teenzz to use: representing gulf Arabic in instant messaging. J. Comput. Mediat. Commun. 9(1) (2003)
Buckwalter, T.: Buckwalter Arabic transliteration. http://www.qamus.org/transliteration.htm. Accessed 28/01/07 (2002)
Ananthakrishnan, S., Bangalore, S., Narayanan, S.: Automatic diacritization of Arabic transcripts for automatic speech recognition. In: International Conference on Natural Language Processing. Kanpur, India (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
AbdelRaouf, A., Higgins, C.A., Pridmore, T. et al. Building a multi-modal Arabic corpus (MMAC). IJDAR 13, 285–302 (2010). https://doi.org/10.1007/s10032-010-0128-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-010-0128-2