Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive

Ajij, Md.; Roy, Diptendu Sinha; Pratihar, Sanjoy

doi:10.1007/s11042-022-13974-x

Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive

Published: 10 November 2022

Volume 82, pages 22373–22404, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

197 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

There have been extensive studies and rapid improvements in automated document categorization, document retrieval, document recommendations, etc. These trendy and essential tasks are associated with information retrieval or data extraction. Also, the document organization process is gradually becoming fully automated for storage in archives. The categorization and indexing of scholarly articles remain a challenge and a real need with a rapid increase in the volume of scholarly articles. Also, there is a need of automation for proper indexing and retrieval of the old scholarly articles in libraries that are available in thousands as print versions. In this paper, we propose a method for simple and robust generation of text handles from the scanned images of scholarly articles to manage them in digital archives efficiently. We have also proposed a Delaunay triangulation based feature set for the associated categorization work. The theme of the proposed work is mainly based on the idea of tracking the locality of emphasized (italic) words. We have primarily considered the articles’ titles and reference pages for crucial information extraction to find handles. The detection of italics is proposed using Principal Component Analysis (PCA). The PCA is applied to a selective subset of object boundary pixels representing the vertical or column edges. We have shown how efficiently this proposed method can generate text handles for indexing scholarly articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Page Segmentation Techniques in Document Analysis

An Image Based Approach for Content Analysis in Document Collections

Chinese Historic Image Threshold Using Adaptive K-means Cluster and Bradley’s

References

Antonacopoulos A, Clausner C, Papadopoulos C, Pletschacher S (2011) Historical document layout analysis competition. In: 2011 International conference on document analysis and recognition, IEEE, pp 1516–1520
Appiani E, Cesarini F, Colla AM, Diligenti M, Gori M, Marinai S, Soda G (2001) Automatic document classification and indexing in high-volume applications. Int J Doc Anal Recognit 4(2):69–83
Article Google Scholar
Audebert N, Herold C, Slimani K, Vidal C (2019) Multimodal deep networks for text and image-based document classification. arXiv:1907.06370
Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern Information Retrieval, vol 463. ACM Press, New York
Google Scholar
BinMakhashen GM, Mahmoud SA (2020) Historical document layout analysis using anisotropic diffusion and geometric features. Int J Digit Libr, pp 1–14
Binmakhashen GM, Mahmoud SA (2019) Document layout analysis: A comprehensive survey. ACM Comput Surv(CSUR) 52(6):1–36
Google Scholar
Boukhari K, Omri MN (2020) DL-VSM based document indexing approach for information retrieval. J Ambient Intell Humaniz Comput, pp 1–12
Chen N, Blostein D (2007) A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int J Doc Anal Recognit (IJDAR) 10(1):1–16
Article Google Scholar
Chen J, Gao L, Tang Z (2016) Information extraction from resume documents in pdf format. Electron Imaging 2016(17):1–8
Google Scholar
Esser D, Schuster D, Muthmann K, Berger M, Schill A (2012) Automatic indexing of scanned documents: a layout-based approach. In: Document recognition and retrieval XIX, vol 8297, international society for optics and Photonics, pp 82970H
Garain U, Chaudhuri BB (1999) Extraction of type style based meta-information from imaged documents. In: Fifth Intl. Conf. on document analysis and recognition (ICDAR), pp 41–344
Gatos B, Pratikakis I (2009) Segmentation-free word spotting in historical printed documents. In: Proc. 10th international conference on document analysis and recognition (ICDAR), pp 271–275
Gupta JD, Chanda B (2014) An efficient slope and slant correction technique for off-line handwritten text word. In: 2014 Fourth international conference of emerging applications of information technology, IEEE, pp 204–208
Hu J, Kashi R, Wilfong G (2000) Comparison and classification of documents based on layout similarity. Inf Retr 2(2):227–243
Article Google Scholar
Jain AK, Bhattacharjee SK (1992) Text segmentation using gabor filters for automatic document processing. Mach Vis Appl 5(3):169–184
Article Google Scholar
Kar R, Saha S, Bera SK, Kavallieratou E, Bhateja V, Sarkar R (2019) Novel approaches towards slope and slant correction for tri-script handwritten word images. Imaging Sci J 67(3):159–170
Article Google Scholar
Kim S, Jeong CB, Kwag HK, Suen CY (2002) Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proc 16th Intl Conf on pattern recognition (ICPR), pp 320–323
Kise K (2014) Page segmentation techniques in document analysis. In: Handbook of Document Image Processing and Recognition, Springer London, p. 135–175
Kumar J, Ye P, Doermann D (2014) Structural similarity for document image classification and retrieval. Pattern Recogn Lett 43:119–126
Article Google Scholar
Lee Y, Koo H, Jeong C (2006) A straight line detection using principal component analysis. Pattern Recogn Lett 27(14):1744–1754
Article Google Scholar
Liu J, Li H, Zhang S, Liang W (2011) A novel italic detection and rectification method for chinese advertising images. In: 2011 International conference on document analysis and recognition, IEEE, pp 698–702
Lu Y, Tan CL (2004) Information retrieval in document image databases. IEEE transactions on knowledge and data engineering 16(11):1398–1410
Article Google Scholar
Marinai S, Marino E, Cesarini F, Soda G (2004) A general system for the retrieval of document images from digital libraries. In: First international workshop on document image analysis for libraries, 2004. Proceedings, IEEE, pp 150–173
Nanba H, Kando N, Okumura M (2000) Classification of research papers using citation links and citation types: towards automatic review article generation. Adv Classif Research Online 11(1):117–134
Google Scholar
Nguyen TH, Shirai K (2013) Text classification of technical papers based on text segmentation. In: International conference on application of natural language to information systems. Springer, pp 278–284
Papavassiliou V, Stafylakis T, Katsouros V, Carayannis G (2010) Handwritten document image segmentation into text lines and words. Pattern Recognit 43(1):369–377
Article MATH Google Scholar
Rivest M, Vignola-Gagné E, Archambault É (2021) level classification of scientific publications: A comparison of deep learning direct citation and bibliographic coupling. PloS one 16(5):e0251493
Article Google Scholar
Sauvola JJ, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236
Article Google Scholar
Shahid M, Ahmed A, Mushtaq MF, Ullah S, Akram U et al (2020) Automatic patents classification using supervised machine learning. In: International conference on soft computing and data mining. Springer, pp 297–307
Su B, Lu S, Tan CL (2010) Binarization of historical document images using the local maximum and minimum. In: Proceedings of the 9th IAPR international workshop on document analysis systems, DAS ’10, pp 159–166
Taheriyan M (2011) Subject classification of research papers based on interrelationships analysis. In: Proceedings of the 2011 workshop on knowledge discovery, modeling and simulation, KDMS ’11, pp 39–44
Tian S, Lu S, Su B, Tan CL (2015) Robust text segmentation using graph cut. In: Proc 13th Intl Conf on document analysis and recognition (ICDAR), pp 331–335
Yujian L, Bo L (2007) A normalized levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095
Article Google Scholar
Zhang L, Lu Y, Tan CL (2004) Italic font recognition using stroke pattern analysis on wavelet decomposed word images. In: Proc 17th Intl Conf on pattern recognition (ICPR), pp 835–838
Zhang P, Xu Y, Cheng Z, Pu S, Lu J, Qiao L, Niu Y, Wu F (2020) Trie: end-to-end text reading and information extraction for document understanding, arXiv:2005.13118
de Berg M, Cheong O, van Kreveld MJ, Overmars MH (2008) Computational geometry: algorithms and applications, 3rd Edn. Springer

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Meghalaya, Shillong, India
Md. Ajij & Diptendu Sinha Roy
Department of Computer Science and Engineering, Indian Institute of Information Technology Kalyani, Kalyani, India
Sanjoy Pratihar

Authors

Md. Ajij
View author publications
You can also search for this author in PubMed Google Scholar
Diptendu Sinha Roy
View author publications
You can also search for this author in PubMed Google Scholar
Sanjoy Pratihar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanjoy Pratihar.

Ethics declarations

Conflict of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ajij, M., Roy, D.S. & Pratihar, S. Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive. Multimed Tools Appl 82, 22373–22404 (2023). https://doi.org/10.1007/s11042-022-13974-x

Download citation

Received: 28 August 2020
Revised: 29 April 2022
Accepted: 12 September 2022
Published: 10 November 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11042-022-13974-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive

Abstract

Access this article

Similar content being viewed by others

Page Segmentation Techniques in Document Analysis

An Image Based Approach for Content Analysis in Document Collections

Chinese Historic Image Threshold Using Adaptive K-means Cluster and Bradley’s

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive

Abstract

Access this article

Similar content being viewed by others

Page Segmentation Techniques in Document Analysis

An Image Based Approach for Content Analysis in Document Collections

Chinese Historic Image Threshold Using Adaptive K-means Cluster and Bradley’s

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation