Skip to main content
Log in

Automatic identification of noise in degraded historical documents

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

The classification of degradation in historical document images plays a pivotal role in their preservation and restoration. This paper introduces a novel approach for noise classification using classical machine learning techniques, specifically Multi-Layer Perceptrons (MLPs). We assembled a comprehensive dataset of historical documents from a range of public sources, from which global and local statistical features were extracted for MLP training and validation. Through extensive experimentation, we determined the optimal MLP architecture and evaluated its performance. The model was rigorously tested through both unblind and blind testing scenarios. Unblind testing, utilizing images from the same collections as the training set, achieved a robust accuracy of 97.22%. Blind testing, performed with the distinct PHIBD dataset, demonstrated a 90% accuracy, outperforming current state-of-the-art deep learning models. These results affirm the model’s robustness and its potential for practical application in historical document analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

No datasets were generated or analysed during the current study.

Notes

  1. https://gallica.bnf.fr/

  2. https://www.bibliotheque.nat.tn/

  3. https://www.bibalex.org/Manuscriptscenter/ar/home/index.aspx

References

  1. Lins, R.D., Banergee, S., Thielo, M.: Automatically detecting and classifying noises in document images. In: Proceedings of the 2010 ACM symposium on applied computing. pp. 33–39. Association for computing machinery, New York, NY, USA (2010)

  2. Messaoud, I.B., El Abed, H., Amiri, H., Märgner, V.: New method for the selection of binarization parameters based on noise features of historical documents. In: Proceedings of the 2011 joint workshop on multilingual OCR and analytics for noisy unstructured text data. pp. 1–8. Association for computing machinery, New York, NY, USA (2011)

  3. Shamqoli, M., Khosravi, H.: Border detection of document images scanned from large books. In: 2013 8th Iranian conference on machine vision and image processing (MVIP). pp. 84–88 (2013)

  4. Arnia, F., Fardian, Muchallil, S., Munadi, K.: Noise characterization in ancient document images based on DCT coefficient distribution. In: 2015 13th international conference on document analysis and recognition (ICDAR). pp. 971–975 (2015)

  5. Ghomrassi, A., Charrada, M.A., Essoukri Ben Amara, N.: Restoration of ancient colored documents foreground/background separation. In: 2015 IEEE 12th international multi-conference on systems, signals & devices (SSD15). pp. 1–6 (2015)

  6. Shahkolaei, A., Beghdadi, A., Cheriet, M.: Blind quality assessment metric and degradation classification for degraded document images. Signal Process. Image Commun. 76, 11–21 (2019). https://doi.org/10.1016/j.image.2019.04.009

    Article  MATH  Google Scholar 

  7. Endo, K., Tanaka, M., Okutomi, M.: Classifying degraded images over various levels of degradation. In: 2020 IEEE international conference on image processing (ICIP). pp. 1691–1695 (2020)

  8. Endo, K., Tanaka, M., Okutomi, M., Tanaka, M., Okutomi, M.: CNN-based classification of degraded images. Electron. Imaging. 32, 1–7 (2020). https://doi.org/10.2352/ISSN.2470-1173.2020.10.IPAS-028

    Article  MATH  Google Scholar 

  9. Saddami, K., Munadi, K., Arnia, F.: Degradation classification on ancient document image based on deep neural networks. In: 2020 3rd international conference on information and communications technology (ICOIACT). pp. 405–410 (2020)

  10. Lu, T., Dooms, A.: Bayesian damage recognition in document images based on a joint global and local homogeneity model. Pattern Recognit. 118, 108034 (2021). https://doi.org/10.1016/j.patcog.2021.108034

    Article  MATH  Google Scholar 

  11. Arnia, F., Saddami, K., Munadi, K.: DCNet: noise-robust convolutional neural networks for degradation classification on ancient documents. J. Imaging. 7, 114 (2021). https://doi.org/10.3390/jimaging7070114

    Article  MATH  Google Scholar 

  12. Gatos, B., Ntirogiannis, K., Pratikakis, I.: ICDAR 2009 Document image binarization contest (DIBCO 2009). In: 2009 10th international conference on document analysis and recognition. pp. 1375–1382 (2009)

  13. Ntirogiannis, K., Gatos, B., Pratikakis, I.: ICFHR2014 competition on handwritten document image binarization (H-DIBCO 2014). In: 2014 14th international conference on frontiers in handwriting recognition. pp. 809–813 (2014)

  14. Pratikakis, I., Zagoris, K., Barlas, G., Gatos, B.: ICFHR2016 handwritten document image binarization contest (H-DIBCO 2016). In: 2016 15th international conference on frontiers in handwriting recognition (ICFHR). pp. 619–623 (2016)

  15. Pratikakis, I., Zagoris, K., Barlas, G., Gatos, B.: ICDAR2017 Competition on document image binarization (DIBCO 2017). In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). pp. 1395–1403 (2017)

  16. Pratikakis, I., Zagori, K., Kaddas, P., Gatos, B.: ICFHR 2018 competition on handwritten document image binarization (H-DIBCO 2018). In: 2018 16th international conference on frontiers in handwriting recognition (ICFHR). pp. 489–493 (2018)

  17. Pratikakis, I., Zagoris, K., Karagiannis, X., Tsochatzidis, L., Mondal, T., Marthot-Santaniello, I.: ICDAR 2019 competition on document image binarization (DIBCO 2019). In: 2019 international conference on document analysis and recognition (ICDAR). pp. 1547–1556 (2019)

  18. Pratikakis, I., Gatos, B., Ntirogiannis, K.: H-DIBCO 2010 - Handwritten document image binarization competition. In: 2010 12th international conference on frontiers in handwriting recognition. pp. 727–732 (2010)

  19. Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICDAR 2011 document image binarization contest (DIBCO 2011). In: 2011 international conference on document analysis and recognition. pp. 1506–1510 (2011)

  20. Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICFHR 2012 competition on handwritten document image binarization (H-DIBCO 2012). In: 2012 international conference on frontiers in handwriting recognition. pp. 817–822 (2012)

  21. Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICDAR 2013 Document image binarization contest (DIBCO 2013). In: 2013 12th international conference on document analysis and recognition. pp. 1471–1476 (2013)

  22. Cheddad, A., Kusetogullari, H., Hilmkil, A., Sundin, L., Yavariabdi, A., Aouache, M., Hall, J.: SHIBR—the Swedish historical birth records: a semi-annotated dataset. Neural Comput. Appl. 33, 15863–15875 (2021). https://doi.org/10.1007/s00521-021-06207-z

    Article  Google Scholar 

  23. Kurar Barakat, B., El-Sana, J., Rabaev, I.: The Pinkas Dataset. In: 2019 International conference on document analysis and recognition (ICDAR). pp. 732–737 (2019)

  24. Fiel, S., Kleber, F., Diem, M., Christlein, V., Louloudis, G., Stamatopoulos, N., Gatos, B.: ScriptNet: ICDAR2017 competition on historical document writer identification (Historical-WI), https://zenodo.org/records/1324999, (2017)

  25. Andreas Kölsch: Handwritten Annotation Detection Dataset (AnnotationDB), https://tc11.cvc.uab.es/datasets/AnnotationDB_1, (2018)

  26. Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: READ dataset Bozen, https://zenodo.org/records/218236, (2016)

  27. Fotini Simistira: DIVA-HisDB historical document image database (DIVA-HisDB), URL:https://tc11.cvc.uab.es/datasets/DIVA-HisDB_1, (2016)

  28. Kassis, M., Abdalhaleem, A., Droby, A., Alaasam, R., El-Sana, J.: VML-HD: The historical Arabic documents dataset for recognition systems. In: 2017 1st international workshop on Arabic script analysis and recognition (ASAR). pp. 11–14 (2017)

  29. Vu, M.T., Le, V.L., Beurton-Aimar, M.: IHR-NomDB: the old degraded vietnamese handwritten script archive database. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document analysis and recognition – ICDAR 2021, pp. 85–99. Springer International Publishing, Cham (2021)

    Chapter  MATH  Google Scholar 

  30. Ayatollahi, S.M., Ziaei Nafchi, H.: Persian heritage image binarization competition (PHIBC 2012). In: 2013 first Iranian conference on pattern recognition and image analysis (PRIA). pp. 1–4 (2013)

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the data collection and analysis. The study conception and design were performed by A.K and I.B. The first draft of the manuscript was written by A.K, A.H, and C.F, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Abderrahmane Kefali.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kefali, A., Bouacha, I., Haddad, A.A. et al. Automatic identification of noise in degraded historical documents. SIViP 19, 95 (2025). https://doi.org/10.1007/s11760-024-03725-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03725-w

Keywords

Navigation