Skip to main content
Log in

Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

There have been extensive studies and rapid improvements in automated document categorization, document retrieval, document recommendations, etc. These trendy and essential tasks are associated with information retrieval or data extraction. Also, the document organization process is gradually becoming fully automated for storage in archives. The categorization and indexing of scholarly articles remain a challenge and a real need with a rapid increase in the volume of scholarly articles. Also, there is a need of automation for proper indexing and retrieval of the old scholarly articles in libraries that are available in thousands as print versions. In this paper, we propose a method for simple and robust generation of text handles from the scanned images of scholarly articles to manage them in digital archives efficiently. We have also proposed a Delaunay triangulation based feature set for the associated categorization work. The theme of the proposed work is mainly based on the idea of tracking the locality of emphasized (italic) words. We have primarily considered the articles’ titles and reference pages for crucial information extraction to find handles. The detection of italics is proposed using Principal Component Analysis (PCA). The PCA is applied to a selective subset of object boundary pixels representing the vertical or column edges. We have shown how efficiently this proposed method can generate text handles for indexing scholarly articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Algorithm 1
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Antonacopoulos A, Clausner C, Papadopoulos C, Pletschacher S (2011) Historical document layout analysis competition. In: 2011 International conference on document analysis and recognition, IEEE, pp 1516–1520

  2. Appiani E, Cesarini F, Colla AM, Diligenti M, Gori M, Marinai S, Soda G (2001) Automatic document classification and indexing in high-volume applications. Int J Doc Anal Recognit 4(2):69–83

    Article  Google Scholar 

  3. Audebert N, Herold C, Slimani K, Vidal C (2019) Multimodal deep networks for text and image-based document classification. arXiv:1907.06370

  4. Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern Information Retrieval, vol 463. ACM Press, New York

    Google Scholar 

  5. BinMakhashen GM, Mahmoud SA (2020) Historical document layout analysis using anisotropic diffusion and geometric features. Int J Digit Libr, pp 1–14

  6. Binmakhashen GM, Mahmoud SA (2019) Document layout analysis: A comprehensive survey. ACM Comput Surv(CSUR) 52(6):1–36

    Google Scholar 

  7. Boukhari K, Omri MN (2020) DL-VSM based document indexing approach for information retrieval. J Ambient Intell Humaniz Comput, pp 1–12

  8. Chen N, Blostein D (2007) A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int J Doc Anal Recognit (IJDAR) 10(1):1–16

    Article  Google Scholar 

  9. Chen J, Gao L, Tang Z (2016) Information extraction from resume documents in pdf format. Electron Imaging 2016(17):1–8

    Google Scholar 

  10. Esser D, Schuster D, Muthmann K, Berger M, Schill A (2012) Automatic indexing of scanned documents: a layout-based approach. In: Document recognition and retrieval XIX, vol 8297, international society for optics and Photonics, pp 82970H

  11. Garain U, Chaudhuri BB (1999) Extraction of type style based meta-information from imaged documents. In: Fifth Intl. Conf. on document analysis and recognition (ICDAR), pp 41–344

  12. Gatos B, Pratikakis I (2009) Segmentation-free word spotting in historical printed documents. In: Proc. 10th international conference on document analysis and recognition (ICDAR), pp 271–275

  13. Gupta JD, Chanda B (2014) An efficient slope and slant correction technique for off-line handwritten text word. In: 2014 Fourth international conference of emerging applications of information technology, IEEE, pp 204–208

  14. Hu J, Kashi R, Wilfong G (2000) Comparison and classification of documents based on layout similarity. Inf Retr 2(2):227–243

    Article  Google Scholar 

  15. Jain AK, Bhattacharjee SK (1992) Text segmentation using gabor filters for automatic document processing. Mach Vis Appl 5(3):169–184

    Article  Google Scholar 

  16. Kar R, Saha S, Bera SK, Kavallieratou E, Bhateja V, Sarkar R (2019) Novel approaches towards slope and slant correction for tri-script handwritten word images. Imaging Sci J 67(3):159–170

    Article  Google Scholar 

  17. Kim S, Jeong CB, Kwag HK, Suen CY (2002) Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proc 16th Intl Conf on pattern recognition (ICPR), pp 320–323

  18. Kise K (2014) Page segmentation techniques in document analysis. In: Handbook of Document Image Processing and Recognition, Springer London, p. 135–175

  19. Kumar J, Ye P, Doermann D (2014) Structural similarity for document image classification and retrieval. Pattern Recogn Lett 43:119–126

    Article  Google Scholar 

  20. Lee Y, Koo H, Jeong C (2006) A straight line detection using principal component analysis. Pattern Recogn Lett 27(14):1744–1754

    Article  Google Scholar 

  21. Liu J, Li H, Zhang S, Liang W (2011) A novel italic detection and rectification method for chinese advertising images. In: 2011 International conference on document analysis and recognition, IEEE, pp 698–702

  22. Lu Y, Tan CL (2004) Information retrieval in document image databases. IEEE transactions on knowledge and data engineering 16(11):1398–1410

    Article  Google Scholar 

  23. Marinai S, Marino E, Cesarini F, Soda G (2004) A general system for the retrieval of document images from digital libraries. In: First international workshop on document image analysis for libraries, 2004. Proceedings, IEEE, pp 150–173

  24. Nanba H, Kando N, Okumura M (2000) Classification of research papers using citation links and citation types: towards automatic review article generation. Adv Classif Research Online 11(1):117–134

    Google Scholar 

  25. Nguyen TH, Shirai K (2013) Text classification of technical papers based on text segmentation. In: International conference on application of natural language to information systems. Springer, pp 278–284

  26. Papavassiliou V, Stafylakis T, Katsouros V, Carayannis G (2010) Handwritten document image segmentation into text lines and words. Pattern Recognit 43(1):369–377

    Article  MATH  Google Scholar 

  27. Rivest M, Vignola-Gagné E, Archambault É (2021) level classification of scientific publications: A comparison of deep learning direct citation and bibliographic coupling. PloS one 16(5):e0251493

    Article  Google Scholar 

  28. Sauvola JJ, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236

    Article  Google Scholar 

  29. Shahid M, Ahmed A, Mushtaq MF, Ullah S, Akram U et al (2020) Automatic patents classification using supervised machine learning. In: International conference on soft computing and data mining. Springer, pp 297–307

  30. Su B, Lu S, Tan CL (2010) Binarization of historical document images using the local maximum and minimum. In: Proceedings of the 9th IAPR international workshop on document analysis systems, DAS ’10, pp 159–166

  31. Taheriyan M (2011) Subject classification of research papers based on interrelationships analysis. In: Proceedings of the 2011 workshop on knowledge discovery, modeling and simulation, KDMS ’11, pp 39–44

  32. Tian S, Lu S, Su B, Tan CL (2015) Robust text segmentation using graph cut. In: Proc 13th Intl Conf on document analysis and recognition (ICDAR), pp 331–335

  33. Yujian L, Bo L (2007) A normalized levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095

    Article  Google Scholar 

  34. Zhang L, Lu Y, Tan CL (2004) Italic font recognition using stroke pattern analysis on wavelet decomposed word images. In: Proc 17th Intl Conf on pattern recognition (ICPR), pp 835–838

  35. Zhang P, Xu Y, Cheng Z, Pu S, Lu J, Qiao L, Niu Y, Wu F (2020) Trie: end-to-end text reading and information extraction for document understanding, arXiv:2005.13118

  36. de Berg M, Cheong O, van Kreveld MJ, Overmars MH (2008) Computational geometry: algorithms and applications, 3rd Edn. Springer

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanjoy Pratihar.

Ethics declarations

Conflict of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ajij, M., Roy, D.S. & Pratihar, S. Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive. Multimed Tools Appl 82, 22373–22404 (2023). https://doi.org/10.1007/s11042-022-13974-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13974-x

Keywords

Navigation