Abstract
There have been extensive studies and rapid improvements in automated document categorization, document retrieval, document recommendations, etc. These trendy and essential tasks are associated with information retrieval or data extraction. Also, the document organization process is gradually becoming fully automated for storage in archives. The categorization and indexing of scholarly articles remain a challenge and a real need with a rapid increase in the volume of scholarly articles. Also, there is a need of automation for proper indexing and retrieval of the old scholarly articles in libraries that are available in thousands as print versions. In this paper, we propose a method for simple and robust generation of text handles from the scanned images of scholarly articles to manage them in digital archives efficiently. We have also proposed a Delaunay triangulation based feature set for the associated categorization work. The theme of the proposed work is mainly based on the idea of tracking the locality of emphasized (italic) words. We have primarily considered the articles’ titles and reference pages for crucial information extraction to find handles. The detection of italics is proposed using Principal Component Analysis (PCA). The PCA is applied to a selective subset of object boundary pixels representing the vertical or column edges. We have shown how efficiently this proposed method can generate text handles for indexing scholarly articles.
Similar content being viewed by others
References
Antonacopoulos A, Clausner C, Papadopoulos C, Pletschacher S (2011) Historical document layout analysis competition. In: 2011 International conference on document analysis and recognition, IEEE, pp 1516–1520
Appiani E, Cesarini F, Colla AM, Diligenti M, Gori M, Marinai S, Soda G (2001) Automatic document classification and indexing in high-volume applications. Int J Doc Anal Recognit 4(2):69–83
Audebert N, Herold C, Slimani K, Vidal C (2019) Multimodal deep networks for text and image-based document classification. arXiv:1907.06370
Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern Information Retrieval, vol 463. ACM Press, New York
BinMakhashen GM, Mahmoud SA (2020) Historical document layout analysis using anisotropic diffusion and geometric features. Int J Digit Libr, pp 1–14
Binmakhashen GM, Mahmoud SA (2019) Document layout analysis: A comprehensive survey. ACM Comput Surv(CSUR) 52(6):1–36
Boukhari K, Omri MN (2020) DL-VSM based document indexing approach for information retrieval. J Ambient Intell Humaniz Comput, pp 1–12
Chen N, Blostein D (2007) A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int J Doc Anal Recognit (IJDAR) 10(1):1–16
Chen J, Gao L, Tang Z (2016) Information extraction from resume documents in pdf format. Electron Imaging 2016(17):1–8
Esser D, Schuster D, Muthmann K, Berger M, Schill A (2012) Automatic indexing of scanned documents: a layout-based approach. In: Document recognition and retrieval XIX, vol 8297, international society for optics and Photonics, pp 82970H
Garain U, Chaudhuri BB (1999) Extraction of type style based meta-information from imaged documents. In: Fifth Intl. Conf. on document analysis and recognition (ICDAR), pp 41–344
Gatos B, Pratikakis I (2009) Segmentation-free word spotting in historical printed documents. In: Proc. 10th international conference on document analysis and recognition (ICDAR), pp 271–275
Gupta JD, Chanda B (2014) An efficient slope and slant correction technique for off-line handwritten text word. In: 2014 Fourth international conference of emerging applications of information technology, IEEE, pp 204–208
Hu J, Kashi R, Wilfong G (2000) Comparison and classification of documents based on layout similarity. Inf Retr 2(2):227–243
Jain AK, Bhattacharjee SK (1992) Text segmentation using gabor filters for automatic document processing. Mach Vis Appl 5(3):169–184
Kar R, Saha S, Bera SK, Kavallieratou E, Bhateja V, Sarkar R (2019) Novel approaches towards slope and slant correction for tri-script handwritten word images. Imaging Sci J 67(3):159–170
Kim S, Jeong CB, Kwag HK, Suen CY (2002) Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proc 16th Intl Conf on pattern recognition (ICPR), pp 320–323
Kise K (2014) Page segmentation techniques in document analysis. In: Handbook of Document Image Processing and Recognition, Springer London, p. 135–175
Kumar J, Ye P, Doermann D (2014) Structural similarity for document image classification and retrieval. Pattern Recogn Lett 43:119–126
Lee Y, Koo H, Jeong C (2006) A straight line detection using principal component analysis. Pattern Recogn Lett 27(14):1744–1754
Liu J, Li H, Zhang S, Liang W (2011) A novel italic detection and rectification method for chinese advertising images. In: 2011 International conference on document analysis and recognition, IEEE, pp 698–702
Lu Y, Tan CL (2004) Information retrieval in document image databases. IEEE transactions on knowledge and data engineering 16(11):1398–1410
Marinai S, Marino E, Cesarini F, Soda G (2004) A general system for the retrieval of document images from digital libraries. In: First international workshop on document image analysis for libraries, 2004. Proceedings, IEEE, pp 150–173
Nanba H, Kando N, Okumura M (2000) Classification of research papers using citation links and citation types: towards automatic review article generation. Adv Classif Research Online 11(1):117–134
Nguyen TH, Shirai K (2013) Text classification of technical papers based on text segmentation. In: International conference on application of natural language to information systems. Springer, pp 278–284
Papavassiliou V, Stafylakis T, Katsouros V, Carayannis G (2010) Handwritten document image segmentation into text lines and words. Pattern Recognit 43(1):369–377
Rivest M, Vignola-Gagné E, Archambault É (2021) level classification of scientific publications: A comparison of deep learning direct citation and bibliographic coupling. PloS one 16(5):e0251493
Sauvola JJ, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236
Shahid M, Ahmed A, Mushtaq MF, Ullah S, Akram U et al (2020) Automatic patents classification using supervised machine learning. In: International conference on soft computing and data mining. Springer, pp 297–307
Su B, Lu S, Tan CL (2010) Binarization of historical document images using the local maximum and minimum. In: Proceedings of the 9th IAPR international workshop on document analysis systems, DAS ’10, pp 159–166
Taheriyan M (2011) Subject classification of research papers based on interrelationships analysis. In: Proceedings of the 2011 workshop on knowledge discovery, modeling and simulation, KDMS ’11, pp 39–44
Tian S, Lu S, Su B, Tan CL (2015) Robust text segmentation using graph cut. In: Proc 13th Intl Conf on document analysis and recognition (ICDAR), pp 331–335
Yujian L, Bo L (2007) A normalized levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095
Zhang L, Lu Y, Tan CL (2004) Italic font recognition using stroke pattern analysis on wavelet decomposed word images. In: Proc 17th Intl Conf on pattern recognition (ICPR), pp 835–838
Zhang P, Xu Y, Cheng Z, Pu S, Lu J, Qiao L, Niu Y, Wu F (2020) Trie: end-to-end text reading and information extraction for document understanding, arXiv:2005.13118
de Berg M, Cheong O, van Kreveld MJ, Overmars MH (2008) Computational geometry: algorithms and applications, 3rd Edn. Springer
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ajij, M., Roy, D.S. & Pratihar, S. Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive. Multimed Tools Appl 82, 22373–22404 (2023). https://doi.org/10.1007/s11042-022-13974-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13974-x