SPEdu: A Toolbox for Processing Digitized Historical Documents

Gomes Rocha, Fabio; Rodriguez, Guillermo

doi:10.1007/978-3-030-60887-3_32

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12469))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

781 Accesses
2 Citations
1 Altmetric

Abstract

Historical-educational documentary sources have gained considerable attention in educational contexts. However, some sources suffer from serious problems such as inadequate infrastructure, poor preservation, and lack of qualified personnel. In addition, a large part of documents is not digitilized, making research difficult. As a consequence, there is a need for transcription, digitalization, and cataloging sources of information for the analysis of large volumes of data. To deal with this issue, we present SPEdu, a tool to digitalize sources of information demanded by research on the History of Education. The workflow of SPEdu is divided into three steps. Firstly, SPEdu adquires images from an information source. Secondly, the tool prepocesses the images and extracts features from them. Finally, a supervised machine learning module was built to classify images between text and non-text. To assess the viability of SPEdu, we used the Official Gazette of the State of Sergipe. Regarding the third step, we evaluated the performance of classification algorithms, such as J48, Logistic Regression, Multi-layered Perceptron (MLP), Naive Bayes, Random Forest, and Random Tree. Results have revealed that Random Forest outperformed remaining techniques with an average rate of 95\(\%\) of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zeni, M., Weldemariam, K.: Extracting information from newspaper archives in Africa. IBM J. Res. Dev. 61(6), 12:1 (2017)
Article Google Scholar
Jana, S., Das, N., Sarkar, R., Nasipuri, M.: Recognition system to separate text graphics from Indian newspaper. In: Kar, S., Maulik, U., Li, X. (eds.) FOTA 2016. SPMS, vol. 225, pp. 185–194. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7814-9_14
Chapter Google Scholar
Rajeswari, S., Magapu, S.B.: Development and customization of in-house developed OCR and its evaluation. Electron. Libr. (2018)
Google Scholar
Vasilopoulos, N., Kavallieratou, E.: Complex layout analysis based on contour classification and morphological operations. Eng. Appl. Artif. Intell. 65, 220–229 (2017)
Article Google Scholar
Kaur, R.P., Jindal, M.K.: Headline and column segmentation in printed Gurumukhi script newspapers. In: Panigrahi, B.K., Trivedi, M.C., Mishra, K.K., Tiwari, S., Singh, P.K. (eds.) Smart Innovations in Communication and Computational Sciences. AISC, vol. 670, pp. 59–67. Springer, Singapore (2019). https://doi.org/10.1007/978-981-10-8971-8_6
Chapter Google Scholar
Bukhari, S.S., Al Azawi, M.I.A., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190 (2010)
Google Scholar
Palfray, T., Hebert, D., Nicolas, S., Tranouez, P., Paquet, T.: Logical segmentation for article extraction in digitized old newspapers. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 129–132 (2012)
Google Scholar
Hebert, D., Palfray, T., Nicolas, S., Tranouez, P., Paquet, T.: PIVAJ: displaying and augmenting digitized newspapers on the web experimental feedback from the “Journal de Rouen” collection. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 173–178 (2014)
Google Scholar
Pramanik, R., Bag, S.: Shape decomposition-based handwritten compound character recognition for Bangla OCR. J. Vis. Commun. Image Represent. 50, 123–134 (2018)
Article Google Scholar
Chathuranga, R.S., Ranathunga, L.: Procedural approach for content segmentation of old newspaper pages. In: 2017 IEEE International Conference on Industrial and Information Systems (ICIIS), pp. 1–6. IEEE (2017)
Google Scholar
Vasilopoulos, N., Wasfi, Y., Kavallieratou, E.: Automatic text extraction from Arabic newspapers. In: Campilho, A., Karray, F., ter Haar Romeny, B. (eds.) ICIAR 2018. LNCS, vol. 10882, pp. 505–510. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93000-8_57
Chapter Google Scholar
De Mello, C.A.B., de Oliveira, A.L.I., Dos Santos, W.P.: Digital Document Analysis and Processing. Nova Science Publishers, Hauppauge (2012)
Google Scholar
Gllavata, J., Ewerth, R., Freisleben, B.: A robust algorithm for text detection in images. In: Proceedings of the 3rd International Symposium on Image and Signal Processing and Analysis, 2003, ISPA 2003, vol. 2, pp. 611–616. IEEE (2003)
Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Article Google Scholar
Sauvola, J., Seppanen, T., Haapakoski, S., Pietikainen, M.: Adaptive document binarization. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 1, pp. 147–152. IEEE (1997)
Google Scholar
Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Pearson, London (2018)
Google Scholar
Suzuki, S., et al.: Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 30(1), 32–46 (1985)
Article Google Scholar
Quddus, A., Cheikh, F.A., Gabbouj, M.: Wavelet-based multi-level object retrieval in contour images. In: Proceedings of the International Workshop on Very Low Bit Rate Video Coding, pp. 1–5 (1999)
Google Scholar
Ramesh Kumar, P., Sailaja, K.L., Mehatab Begum, S.: Human identification based on ear image contour and its properties. In: Pandian, D., Fernando, X., Baig, Z., Shi, F. (eds.) ISMAC 2018. LNCVB, vol. 30, pp. 1527–1536. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-00665-5_143
Chapter Google Scholar
Duchesne, P., Rémillard, B.: Statistical Modeling and Analysis for Complex Data Problems, vol. 1. Springer, Heidelberg (2005). https://doi.org/10.1007/b105993
Book MATH Google Scholar
Goldschmidt, R., Passos, E., Bezerra, E.: Data Mining. Elsevier, Brazil (2015)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Recuperação de Informação-: Conceitos e Tecnologia das Máquinas de Busca. Bookman Editora (2013)
Google Scholar
Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover classification. Pattern Recogn. Lett. 27(4), 294–300 (2006)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Universidade Tiradentes, Aracaju, Sergipe, Brazil
Fabio Gomes Rocha
Instituto de Tecnologia e Pesquisa - ITP, Aracaju, Sergipe, Brazil
Fabio Gomes Rocha
ISISTAN (UNICEN-CONICET) Research Institute, Tandil, Buenos Aires, Argentina
Guillermo Rodriguez

Authors

Fabio Gomes Rocha
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillermo Rodriguez .

Editor information

Editors and Affiliations

Facultad de Ingeniería, Universidad Panamericana, Mexico City, Mexico
Lourdes Martínez-Villaseñor
Universidad Autónoma Metropolitana, Mexico City, Mexico
Oscar Herrera-Alcántara
Facultad de Ingeniería, Universidad Panamericana, Mexico City, Mexico
Hiram Ponce
Universidad Autónoma del Estado de Hidalgo, Hidalgo, Mexico
Félix A. Castro-Espinoza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gomes Rocha, F., Rodriguez, G. (2020). SPEdu: A Toolbox for Processing Digitized Historical Documents. In: Martínez-Villaseñor, L., Herrera-Alcántara, O., Ponce, H., Castro-Espinoza, F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science(), vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-60887-3_32
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60886-6
Online ISBN: 978-3-030-60887-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics