Abstract
Discovering significant meta-information from document collections is a critical factor for knowledge distribution and preservation. This paper presents a system that implements intelligent document processing techniques, by combining strategies for the layout analysis of electronic documents with incremental first-order learning in order to automatically classify the documents and their layout components according to their semantics. Indeed, an in-deep analysis of specific layout components can allow the extraction of useful information to improve the semantic-based document storage and retrieval tasks. The viability of the proposed approach is confirmed by experiments run in the real-world application domain of scientific papers.
Keywords
- Basic Block
- Topological Relation
- Inductive Logic Programming
- Electronic Document
- Portable Document Format
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adobe Systems Incorporated. PostScript language reference manual, 2nd edn. Addison Wesley, Reading (1990)
Besagni, D., Belaid, A.: Citation recognition for scientific publications in digital libraries. In: Proceedings of the 1st International Workshop on Document Image Analysis for Libraries (DIAL 2004), Palo Alto, CA, USA, January 23-24, pp. 244–252. IEEE Computer Society, Los Alamitos (2004)
Breuel, T.M.: Two geometric algorithms for layout analysis. In: Workshop on Document Analysis Systems (2002)
Egenhofer, M.: Reasoning about binary topological relations. In: Günther, O., Schek, H.-J. (eds.) SSD 1991. LNCS, vol. 525, pp. 143–160. Springer, Heidelberg (1991)
Esposito, F., Ferilli, S., Fanizzi, N., Basile, T.M.A., Di Mauro, N.: Incremental multistrategy learning for document processing. Applied Artificial Intelligence: An Internationa Journal 17(8/9), 859–883 (2003)
Glunz, W.: Pstoedit - a tool converting postscript and pdf files into various vector graphic formats, http://www.pstoedit.net
Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19/20, 629–679 (1994)
Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)
Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. International Journal of Geographical Information Science 11(2), 111–138 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N. (2005). Semantic-Based Access to Digital Document Databases. In: Hacid, MS., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds) Foundations of Intelligent Systems. ISMIS 2005. Lecture Notes in Computer Science(), vol 3488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11425274_39
Download citation
DOI: https://doi.org/10.1007/11425274_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25878-0
Online ISBN: 978-3-540-31949-8
eBook Packages: Computer ScienceComputer Science (R0)