Abstract
The classification of degradation in historical document images plays a pivotal role in their preservation and restoration. This paper introduces a novel approach for noise classification using classical machine learning techniques, specifically Multi-Layer Perceptrons (MLPs). We assembled a comprehensive dataset of historical documents from a range of public sources, from which global and local statistical features were extracted for MLP training and validation. Through extensive experimentation, we determined the optimal MLP architecture and evaluated its performance. The model was rigorously tested through both unblind and blind testing scenarios. Unblind testing, utilizing images from the same collections as the training set, achieved a robust accuracy of 97.22%. Blind testing, performed with the distinct PHIBD dataset, demonstrated a 90% accuracy, outperforming current state-of-the-art deep learning models. These results affirm the model’s robustness and its potential for practical application in historical document analysis.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03725-w/MediaObjects/11760_2024_3725_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03725-w/MediaObjects/11760_2024_3725_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03725-w/MediaObjects/11760_2024_3725_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03725-w/MediaObjects/11760_2024_3725_Fig4_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
No datasets were generated or analysed during the current study.
References
Lins, R.D., Banergee, S., Thielo, M.: Automatically detecting and classifying noises in document images. In: Proceedings of the 2010 ACM symposium on applied computing. pp. 33–39. Association for computing machinery, New York, NY, USA (2010)
Messaoud, I.B., El Abed, H., Amiri, H., Märgner, V.: New method for the selection of binarization parameters based on noise features of historical documents. In: Proceedings of the 2011 joint workshop on multilingual OCR and analytics for noisy unstructured text data. pp. 1–8. Association for computing machinery, New York, NY, USA (2011)
Shamqoli, M., Khosravi, H.: Border detection of document images scanned from large books. In: 2013 8th Iranian conference on machine vision and image processing (MVIP). pp. 84–88 (2013)
Arnia, F., Fardian, Muchallil, S., Munadi, K.: Noise characterization in ancient document images based on DCT coefficient distribution. In: 2015 13th international conference on document analysis and recognition (ICDAR). pp. 971–975 (2015)
Ghomrassi, A., Charrada, M.A., Essoukri Ben Amara, N.: Restoration of ancient colored documents foreground/background separation. In: 2015 IEEE 12th international multi-conference on systems, signals & devices (SSD15). pp. 1–6 (2015)
Shahkolaei, A., Beghdadi, A., Cheriet, M.: Blind quality assessment metric and degradation classification for degraded document images. Signal Process. Image Commun. 76, 11–21 (2019). https://doi.org/10.1016/j.image.2019.04.009
Endo, K., Tanaka, M., Okutomi, M.: Classifying degraded images over various levels of degradation. In: 2020 IEEE international conference on image processing (ICIP). pp. 1691–1695 (2020)
Endo, K., Tanaka, M., Okutomi, M., Tanaka, M., Okutomi, M.: CNN-based classification of degraded images. Electron. Imaging. 32, 1–7 (2020). https://doi.org/10.2352/ISSN.2470-1173.2020.10.IPAS-028
Saddami, K., Munadi, K., Arnia, F.: Degradation classification on ancient document image based on deep neural networks. In: 2020 3rd international conference on information and communications technology (ICOIACT). pp. 405–410 (2020)
Lu, T., Dooms, A.: Bayesian damage recognition in document images based on a joint global and local homogeneity model. Pattern Recognit. 118, 108034 (2021). https://doi.org/10.1016/j.patcog.2021.108034
Arnia, F., Saddami, K., Munadi, K.: DCNet: noise-robust convolutional neural networks for degradation classification on ancient documents. J. Imaging. 7, 114 (2021). https://doi.org/10.3390/jimaging7070114
Gatos, B., Ntirogiannis, K., Pratikakis, I.: ICDAR 2009 Document image binarization contest (DIBCO 2009). In: 2009 10th international conference on document analysis and recognition. pp. 1375–1382 (2009)
Ntirogiannis, K., Gatos, B., Pratikakis, I.: ICFHR2014 competition on handwritten document image binarization (H-DIBCO 2014). In: 2014 14th international conference on frontiers in handwriting recognition. pp. 809–813 (2014)
Pratikakis, I., Zagoris, K., Barlas, G., Gatos, B.: ICFHR2016 handwritten document image binarization contest (H-DIBCO 2016). In: 2016 15th international conference on frontiers in handwriting recognition (ICFHR). pp. 619–623 (2016)
Pratikakis, I., Zagoris, K., Barlas, G., Gatos, B.: ICDAR2017 Competition on document image binarization (DIBCO 2017). In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). pp. 1395–1403 (2017)
Pratikakis, I., Zagori, K., Kaddas, P., Gatos, B.: ICFHR 2018 competition on handwritten document image binarization (H-DIBCO 2018). In: 2018 16th international conference on frontiers in handwriting recognition (ICFHR). pp. 489–493 (2018)
Pratikakis, I., Zagoris, K., Karagiannis, X., Tsochatzidis, L., Mondal, T., Marthot-Santaniello, I.: ICDAR 2019 competition on document image binarization (DIBCO 2019). In: 2019 international conference on document analysis and recognition (ICDAR). pp. 1547–1556 (2019)
Pratikakis, I., Gatos, B., Ntirogiannis, K.: H-DIBCO 2010 - Handwritten document image binarization competition. In: 2010 12th international conference on frontiers in handwriting recognition. pp. 727–732 (2010)
Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICDAR 2011 document image binarization contest (DIBCO 2011). In: 2011 international conference on document analysis and recognition. pp. 1506–1510 (2011)
Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICFHR 2012 competition on handwritten document image binarization (H-DIBCO 2012). In: 2012 international conference on frontiers in handwriting recognition. pp. 817–822 (2012)
Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICDAR 2013 Document image binarization contest (DIBCO 2013). In: 2013 12th international conference on document analysis and recognition. pp. 1471–1476 (2013)
Cheddad, A., Kusetogullari, H., Hilmkil, A., Sundin, L., Yavariabdi, A., Aouache, M., Hall, J.: SHIBR—the Swedish historical birth records: a semi-annotated dataset. Neural Comput. Appl. 33, 15863–15875 (2021). https://doi.org/10.1007/s00521-021-06207-z
Kurar Barakat, B., El-Sana, J., Rabaev, I.: The Pinkas Dataset. In: 2019 International conference on document analysis and recognition (ICDAR). pp. 732–737 (2019)
Fiel, S., Kleber, F., Diem, M., Christlein, V., Louloudis, G., Stamatopoulos, N., Gatos, B.: ScriptNet: ICDAR2017 competition on historical document writer identification (Historical-WI), https://zenodo.org/records/1324999, (2017)
Andreas Kölsch: Handwritten Annotation Detection Dataset (AnnotationDB), https://tc11.cvc.uab.es/datasets/AnnotationDB_1, (2018)
Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: READ dataset Bozen, https://zenodo.org/records/218236, (2016)
Fotini Simistira: DIVA-HisDB historical document image database (DIVA-HisDB), URL:https://tc11.cvc.uab.es/datasets/DIVA-HisDB_1, (2016)
Kassis, M., Abdalhaleem, A., Droby, A., Alaasam, R., El-Sana, J.: VML-HD: The historical Arabic documents dataset for recognition systems. In: 2017 1st international workshop on Arabic script analysis and recognition (ASAR). pp. 11–14 (2017)
Vu, M.T., Le, V.L., Beurton-Aimar, M.: IHR-NomDB: the old degraded vietnamese handwritten script archive database. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document analysis and recognition – ICDAR 2021, pp. 85–99. Springer International Publishing, Cham (2021)
Ayatollahi, S.M., Ziaei Nafchi, H.: Persian heritage image binarization competition (PHIBC 2012). In: 2013 first Iranian conference on pattern recognition and image analysis (PRIA). pp. 1–4 (2013)
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
All authors contributed to the data collection and analysis. The study conception and design were performed by A.K and I.B. The first draft of the manuscript was written by A.K, A.H, and C.F, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kefali, A., Bouacha, I., Haddad, A.A. et al. Automatic identification of noise in degraded historical documents. SIViP 19, 95 (2025). https://doi.org/10.1007/s11760-024-03725-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03725-w