Paper
8 February 2015 Cross-reference identification within a PDF document
Sida Li, Liangcai Gao, Zhi Tang, Yinyan Yu
Author Affiliations +
Proceedings Volume 9402, Document Recognition and Retrieval XXII; 940209 (2015) https://doi.org/10.1117/12.2076237
Event: SPIE/IS&T Electronic Imaging, 2015, San Francisco, California, United States
Abstract
Cross-references, such like footnotes, endnotes, figure/table captions, references, are a common and useful type of page elements to further explain their corresponding entities in the target document. In this paper, we focus on cross-reference identification in a PDF document, and present a robust method as a case study of identifying footnotes and figure references. The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document. A number of novel features within a PDF document, i.e., page layout, font information, lexical and linguistic features of cross-references, are utilized for the task. Clustering is adopted to handle the features that are stable in one document but varied in different kinds of documents so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify cross-reference in a PDF document.
© (2015) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Sida Li, Liangcai Gao, Zhi Tang, and Yinyan Yu "Cross-reference identification within a PDF document", Proc. SPIE 9402, Document Recognition and Retrieval XXII, 940209 (8 February 2015); https://doi.org/10.1117/12.2076237
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Feature extraction

Visualization

Data storage

Computer science

Lithium

Raster graphics

Analytical research

RELATED CONTENT

Resolution and bit depth: how much is enough?
Proceedings of SPIE (June 02 2000)
Graphics extraction in a PDF document
Proceedings of SPIE (January 13 2003)
An efficient low-complexity approach to color trapping
Proceedings of SPIE (January 28 2008)
Artistic halftoning: between technology and art
Proceedings of SPIE (December 21 1999)

Back to Top