Abstract
Historical-educational documentary sources have gained considerable attention in educational contexts. However, some sources suffer from serious problems such as inadequate infrastructure, poor preservation, and lack of qualified personnel. In addition, a large part of documents is not digitilized, making research difficult. As a consequence, there is a need for transcription, digitalization, and cataloging sources of information for the analysis of large volumes of data. To deal with this issue, we present SPEdu, a tool to digitalize sources of information demanded by research on the History of Education. The workflow of SPEdu is divided into three steps. Firstly, SPEdu adquires images from an information source. Secondly, the tool prepocesses the images and extracts features from them. Finally, a supervised machine learning module was built to classify images between text and non-text. To assess the viability of SPEdu, we used the Official Gazette of the State of Sergipe. Regarding the third step, we evaluated the performance of classification algorithms, such as J48, Logistic Regression, Multi-layered Perceptron (MLP), Naive Bayes, Random Forest, and Random Tree. Results have revealed that Random Forest outperformed remaining techniques with an average rate of 95\(\%\) of accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zeni, M., Weldemariam, K.: Extracting information from newspaper archives in Africa. IBM J. Res. Dev. 61(6), 12:1 (2017)
Jana, S., Das, N., Sarkar, R., Nasipuri, M.: Recognition system to separate text graphics from Indian newspaper. In: Kar, S., Maulik, U., Li, X. (eds.) FOTA 2016. SPMS, vol. 225, pp. 185–194. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7814-9_14
Rajeswari, S., Magapu, S.B.: Development and customization of in-house developed OCR and its evaluation. Electron. Libr. (2018)
Vasilopoulos, N., Kavallieratou, E.: Complex layout analysis based on contour classification and morphological operations. Eng. Appl. Artif. Intell. 65, 220–229 (2017)
Kaur, R.P., Jindal, M.K.: Headline and column segmentation in printed Gurumukhi script newspapers. In: Panigrahi, B.K., Trivedi, M.C., Mishra, K.K., Tiwari, S., Singh, P.K. (eds.) Smart Innovations in Communication and Computational Sciences. AISC, vol. 670, pp. 59–67. Springer, Singapore (2019). https://doi.org/10.1007/978-981-10-8971-8_6
Bukhari, S.S., Al Azawi, M.I.A., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190 (2010)
Palfray, T., Hebert, D., Nicolas, S., Tranouez, P., Paquet, T.: Logical segmentation for article extraction in digitized old newspapers. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 129–132 (2012)
Hebert, D., Palfray, T., Nicolas, S., Tranouez, P., Paquet, T.: PIVAJ: displaying and augmenting digitized newspapers on the web experimental feedback from the “Journal de Rouen” collection. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 173–178 (2014)
Pramanik, R., Bag, S.: Shape decomposition-based handwritten compound character recognition for Bangla OCR. J. Vis. Commun. Image Represent. 50, 123–134 (2018)
Chathuranga, R.S., Ranathunga, L.: Procedural approach for content segmentation of old newspaper pages. In: 2017 IEEE International Conference on Industrial and Information Systems (ICIIS), pp. 1–6. IEEE (2017)
Vasilopoulos, N., Wasfi, Y., Kavallieratou, E.: Automatic text extraction from Arabic newspapers. In: Campilho, A., Karray, F., ter Haar Romeny, B. (eds.) ICIAR 2018. LNCS, vol. 10882, pp. 505–510. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93000-8_57
De Mello, C.A.B., de Oliveira, A.L.I., Dos Santos, W.P.: Digital Document Analysis and Processing. Nova Science Publishers, Hauppauge (2012)
Gllavata, J., Ewerth, R., Freisleben, B.: A robust algorithm for text detection in images. In: Proceedings of the 3rd International Symposium on Image and Signal Processing and Analysis, 2003, ISPA 2003, vol. 2, pp. 611–616. IEEE (2003)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Sauvola, J., Seppanen, T., Haapakoski, S., Pietikainen, M.: Adaptive document binarization. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 1, pp. 147–152. IEEE (1997)
Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Pearson, London (2018)
Suzuki, S., et al.: Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 30(1), 32–46 (1985)
Quddus, A., Cheikh, F.A., Gabbouj, M.: Wavelet-based multi-level object retrieval in contour images. In: Proceedings of the International Workshop on Very Low Bit Rate Video Coding, pp. 1–5 (1999)
Ramesh Kumar, P., Sailaja, K.L., Mehatab Begum, S.: Human identification based on ear image contour and its properties. In: Pandian, D., Fernando, X., Baig, Z., Shi, F. (eds.) ISMAC 2018. LNCVB, vol. 30, pp. 1527–1536. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-00665-5_143
Duchesne, P., Rémillard, B.: Statistical Modeling and Analysis for Complex Data Problems, vol. 1. Springer, Heidelberg (2005). https://doi.org/10.1007/b105993
Goldschmidt, R., Passos, E., Bezerra, E.: Data Mining. Elsevier, Brazil (2015)
Baeza-Yates, R., Ribeiro-Neto, B.: Recuperação de Informação-: Conceitos e Tecnologia das Máquinas de Busca. Bookman Editora (2013)
Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover classification. Pattern Recogn. Lett. 27(4), 294–300 (2006)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Gomes Rocha, F., Rodriguez, G. (2020). SPEdu: A Toolbox for Processing Digitized Historical Documents. In: Martínez-Villaseñor, L., Herrera-Alcántara, O., Ponce, H., Castro-Espinoza, F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science(), vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-60887-3_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60886-6
Online ISBN: 978-3-030-60887-3
eBook Packages: Computer ScienceComputer Science (R0)