Skip to main content

RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 Workshops (ICDAR 2021)

Abstract

The Arabic scripts raise numerous issues in text recognition and layout analysis. To overcome these, several datasets and methods have been proposed in recent years. Although the latter are focused on common scripts and layout, many Arabic writings and written traditions remain under-resourced. We therefore propose a new dataset comprising 300 images representative of the handwritten production of the Arabic Maghrebi scripts. This dataset is the achievement of a collaborative work undertaken in the first quarter of 2021, and it offers several levels of annotation and transcription. The article intends to shed light on the specificities of these writing and manuscripts, as well as highlight the challenges of the recognition. The collaborative tools used for the creation of the dataset are assessed and the dataset itself is evaluated with state of the art methods in layout analysis. The word-based text recognition method used and experimented on for these writings achieves CER of 4.8% on average. The pipeline described constitutes an experience feedback for the quick creation of data and the training of effective HTR systems for Arabic scripts and non-Latin scripts in general.

This work was carried out with the financial support of the French Ministry of Higher Education, Research and Innovation. It is in line with the scientific focus on digital humanities defined by the Research Consortium Middle-East and Muslim Worlds (GIS MOMM). We would also like to thank all the transcribers and people who took part in the hackathon and ensured its successful completion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The history and the origins of these scripts have been an important scientific open debate [3, 4]. The most recent works, in particular those of U. Bongianino, have foregrounded the different itineraries (from books to qurans, from al-Andalus to the Maghreb) followed by these writings between the 10th and the 13th century [4].

  2. 2.

    Characteristics are taken from U. Bongianino [4]; theoretical realizations are taken from the article of N. Van de Boogert upon which U. Bongianino draws [13].

  3. 3.

    https://github.com/calfa-co/rasam-dataset.

  4. 4.

    The BULAC holds the second biggest fund of Arabic manuscripts in France (2.458 identified documentary units). BULAC collections contains a substantial proportion of the manuscripts copied in Maghrebi script. 150 Arabic manuscripts are available online on the website of the BINA Digital library.

  5. 5.

      MS.ARA.1977, Collections patrimoniales numérisées de la BULAC.

  6. 6.

    Muḥammad b. Mubārak al-Barāšī is also the copyist of the second text. There is no mention for the third text: the paleographical characteristics of the pages lead us to assume that it is the work of another hand.

  7. 7.

      MS.ARA.609, Collections patrimoniales numérisées de la BULAC.

  8. 8.

    See bibliographic record on CALAMES.

  9. 9.

    https://vision.calfa.fr.

  10. 10.

    Numbering class is not kept in the v1.1 of the dataset, for which we notice a 9% gain in average for identification of catchword and table classes.

  11. 11.

    With better polygons (dataset v1.1), the CER decreases more quickly (16.6 for batch 1, then 15.87, 13.67, 11.52, and finally 6.67 for the last batch).

References

  1. Abdelhaleem, A., Droby, A., Asi, A., Kassis, M., Asam, R.A., El-sanaa, J.: WAHD: a database for writer identification of Arabic historical documents. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition, pp. 64–68 (2017)

    Google Scholar 

  2. Adam, K., Baig, A., Al-Maadeed, S., Bouridane, A., El-Menshawy, S.: KERTAS: dataset for automatic dating of ancient Arabic manuscripts. Int. J. Doc. Anal. Recogn. (IJDAR) 21(4), 283–290 (2018). https://doi.org/10.1007/s10032-018-0312-3

    Article  Google Scholar 

  3. Ben Azzouza, N.: Les corans de l’occident musulman médiéval : état des recherches et nouvelles perspectives. Perspectives 2, 104–130 (2017)

    Google Scholar 

  4. Bongianino, U.: The origins and developments of Maghribī rounds scripts, Arabic Paleography in the Islamic West (4th/10th-6th/12th centuries). Ph.D. thesis, University of Oxford (2017)

    Google Scholar 

  5. Camps, J.B., Vidal-Gorène, C., Vernet, M.: Handling heavily abbreviated manuscripts: HTR engines vs text normalisation approaches (2021). Accepted for IWCP workshop of ICDAR 2021

    Google Scholar 

  6. Clausner, C., Antonacopoulos, A., Mcgregor, N., Wilson-Nunn, D.: ICFHR 2018 competition on recognition of historical Arabic scientific manuscripts - RASM2018. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 471–476 (2018)

    Google Scholar 

  7. Clérice, T.: Evaluating deep learning methods for word segmentation of Scripta Continua texts in old French and Latin. J. Data Min. Digit. Humanit. 2020 (2020). https://jdmdh.episciences.org/6264

  8. Diem, M., Kleber, F., Sablatnig, R., Gatos, B.: cBAD: ICDAR2019 competition on baseline detection. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1494–1498 (2019)

    Google Scholar 

  9. Kassis, M., Abdalhaleem, A., Droby, A., Alaasam, R., El-Sana, J.: VML-HD: the historical Arabic documents dataset for recognition systems. In: 2017 1st International Workshop on Arabic Script Analysis and Recognition, pp. 11–14 (2017)

    Google Scholar 

  10. Kiessling, B., Ezra, D.S.B., Miller, M.T.: BADAM: a public dataset for baseline detection in Arabic-script manuscripts. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing. HIP 2019, pp. 13–18. Association for Computing Machinery (2019)

    Google Scholar 

  11. Milo, T., Martínez, A.G.: A new strategy for Arabic OCR: archigraphemes, letter blocks, script grammar, and shape synthesis. In: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. DATeCH2019, pp. 93–96. Association for Computing Machinery, New York (2019)

    Google Scholar 

  12. Pantke, W., Dennhardt, M., Fecker, D., Märgner, V., Fingscheidt, T.: An historical handwritten Arabic dataset for segmentation-free word spotting - HADARA80P. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 15–20 (2014)

    Google Scholar 

  13. Van Den Boogert, N.: Some notes on Maghribi script. Manuscripts Middle East 4, 30–43 (1989)

    Google Scholar 

  14. Vidal-Gorène, C., Dupin, B., Decours-Perez, A., Riccioli, T.: A modular and automated annotation platform for handwritings: evaluation on under-resourced languages (2021). Accepted for ICDAR 2021 Main Conference

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chahan Vidal-Gorène .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vidal-Gorène, C., Lucas, N., Salah, C., Decours-Perez, A., Dupin, B. (2021). RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12916. Springer, Cham. https://doi.org/10.1007/978-3-030-86198-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86198-8_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86197-1

  • Online ISBN: 978-3-030-86198-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics