Skip to main content

SPEdu: A Toolbox for Processing Digitized Historical Documents

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2020)

Abstract

Historical-educational documentary sources have gained considerable attention in educational contexts. However, some sources suffer from serious problems such as inadequate infrastructure, poor preservation, and lack of qualified personnel. In addition, a large part of documents is not digitilized, making research difficult. As a consequence, there is a need for transcription, digitalization, and cataloging sources of information for the analysis of large volumes of data. To deal with this issue, we present SPEdu, a tool to digitalize sources of information demanded by research on the History of Education. The workflow of SPEdu is divided into three steps. Firstly, SPEdu adquires images from an information source. Secondly, the tool prepocesses the images and extracts features from them. Finally, a supervised machine learning module was built to classify images between text and non-text. To assess the viability of SPEdu, we used the Official Gazette of the State of Sergipe. Regarding the third step, we evaluated the performance of classification algorithms, such as J48, Logistic Regression, Multi-layered Perceptron (MLP), Naive Bayes, Random Forest, and Random Tree. Results have revealed that Random Forest outperformed remaining techniques with an average rate of 95\(\%\) of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zeni, M., Weldemariam, K.: Extracting information from newspaper archives in Africa. IBM J. Res. Dev. 61(6), 12:1 (2017)

    Article  Google Scholar 

  2. Jana, S., Das, N., Sarkar, R., Nasipuri, M.: Recognition system to separate text graphics from Indian newspaper. In: Kar, S., Maulik, U., Li, X. (eds.) FOTA 2016. SPMS, vol. 225, pp. 185–194. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7814-9_14

    Chapter  Google Scholar 

  3. Rajeswari, S., Magapu, S.B.: Development and customization of in-house developed OCR and its evaluation. Electron. Libr. (2018)

    Google Scholar 

  4. Vasilopoulos, N., Kavallieratou, E.: Complex layout analysis based on contour classification and morphological operations. Eng. Appl. Artif. Intell. 65, 220–229 (2017)

    Article  Google Scholar 

  5. Kaur, R.P., Jindal, M.K.: Headline and column segmentation in printed Gurumukhi script newspapers. In: Panigrahi, B.K., Trivedi, M.C., Mishra, K.K., Tiwari, S., Singh, P.K. (eds.) Smart Innovations in Communication and Computational Sciences. AISC, vol. 670, pp. 59–67. Springer, Singapore (2019). https://doi.org/10.1007/978-981-10-8971-8_6

    Chapter  Google Scholar 

  6. Bukhari, S.S., Al Azawi, M.I.A., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190 (2010)

    Google Scholar 

  7. Palfray, T., Hebert, D., Nicolas, S., Tranouez, P., Paquet, T.: Logical segmentation for article extraction in digitized old newspapers. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 129–132 (2012)

    Google Scholar 

  8. Hebert, D., Palfray, T., Nicolas, S., Tranouez, P., Paquet, T.: PIVAJ: displaying and augmenting digitized newspapers on the web experimental feedback from the “Journal de Rouen” collection. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 173–178 (2014)

    Google Scholar 

  9. Pramanik, R., Bag, S.: Shape decomposition-based handwritten compound character recognition for Bangla OCR. J. Vis. Commun. Image Represent. 50, 123–134 (2018)

    Article  Google Scholar 

  10. Chathuranga, R.S., Ranathunga, L.: Procedural approach for content segmentation of old newspaper pages. In: 2017 IEEE International Conference on Industrial and Information Systems (ICIIS), pp. 1–6. IEEE (2017)

    Google Scholar 

  11. Vasilopoulos, N., Wasfi, Y., Kavallieratou, E.: Automatic text extraction from Arabic newspapers. In: Campilho, A., Karray, F., ter Haar Romeny, B. (eds.) ICIAR 2018. LNCS, vol. 10882, pp. 505–510. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93000-8_57

    Chapter  Google Scholar 

  12. De Mello, C.A.B., de Oliveira, A.L.I., Dos Santos, W.P.: Digital Document Analysis and Processing. Nova Science Publishers, Hauppauge (2012)

    Google Scholar 

  13. Gllavata, J., Ewerth, R., Freisleben, B.: A robust algorithm for text detection in images. In: Proceedings of the 3rd International Symposium on Image and Signal Processing and Analysis, 2003, ISPA 2003, vol. 2, pp. 611–616. IEEE (2003)

    Google Scholar 

  14. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  Google Scholar 

  15. Sauvola, J., Seppanen, T., Haapakoski, S., Pietikainen, M.: Adaptive document binarization. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 1, pp. 147–152. IEEE (1997)

    Google Scholar 

  16. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Pearson, London (2018)

    Google Scholar 

  17. Suzuki, S., et al.: Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 30(1), 32–46 (1985)

    Article  Google Scholar 

  18. Quddus, A., Cheikh, F.A., Gabbouj, M.: Wavelet-based multi-level object retrieval in contour images. In: Proceedings of the International Workshop on Very Low Bit Rate Video Coding, pp. 1–5 (1999)

    Google Scholar 

  19. Ramesh Kumar, P., Sailaja, K.L., Mehatab Begum, S.: Human identification based on ear image contour and its properties. In: Pandian, D., Fernando, X., Baig, Z., Shi, F. (eds.) ISMAC 2018. LNCVB, vol. 30, pp. 1527–1536. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-00665-5_143

    Chapter  Google Scholar 

  20. Duchesne, P., Rémillard, B.: Statistical Modeling and Analysis for Complex Data Problems, vol. 1. Springer, Heidelberg (2005). https://doi.org/10.1007/b105993

    Book  MATH  Google Scholar 

  21. Goldschmidt, R., Passos, E., Bezerra, E.: Data Mining. Elsevier, Brazil (2015)

    Google Scholar 

  22. Baeza-Yates, R., Ribeiro-Neto, B.: Recuperação de Informação-: Conceitos e Tecnologia das Máquinas de Busca. Bookman Editora (2013)

    Google Scholar 

  23. Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover classification. Pattern Recogn. Lett. 27(4), 294–300 (2006)

    Article  Google Scholar 

  24. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillermo Rodriguez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gomes Rocha, F., Rodriguez, G. (2020). SPEdu: A Toolbox for Processing Digitized Historical Documents. In: Martínez-Villaseñor, L., Herrera-Alcántara, O., Ponce, H., Castro-Espinoza, F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science(), vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60887-3_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60886-6

  • Online ISBN: 978-3-030-60887-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics