Document Layout Analysis for Semantic Information Extraction

Adrian, Weronika T.; Leone, Nicola; Manna, Marco; Marte, Cinzia

doi:10.1007/978-3-319-70169-1_20

Weronika T. Adrian^17,18,
Nicola Leone¹⁷,
Marco Manna¹⁷ &
…
Cinzia Marte¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10640))

Included in the following conference series:

Conference of the Italian Association for Artificial Intelligence

1679 Accesses
4 Citations
3 Altmetric

Abstract

Using machines to automatically extract relevant information from unstructured and semi-structured sources has practical significance in todays life and business. In this context, although understanding the meaning of words is important, the process of identifying self-consistent geometric and logical regions of interest—blocks, cells, columns and tables, as well as paragraphs, titles and captions, only to mention a few—is of paramount importance too. This complex process goes under the name of document layout analysis. In this work, we discuss newly designed techniques to solve this problem effectively, by combining both syntactic and semantic document aspects. These techniques described here are at the basis of KnowRex, a comprehensive system for ontology-driven Information Extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Attributed Paths for Layout-Based Document Retrieval

Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs

OntoHuman: Ontology-Based Information Extraction Tools with Human-in-the-Loop Interaction

References

Adrian, W.T., Leone, N., Manna, M.: Semantic views of homogeneous unstructured data. In: ten Cate, B., Mileo, A. (eds.) RR 2015. LNCS, vol. 9209, pp. 19–29. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22002-4_3
Chapter Google Scholar
Anantharangachar, R., Ramani, S., Rajagopalan, S.: Ontology guided information extraction from unstructured text. CoRR abs/1302.1335 (2013)
Google Scholar
Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: Historical document layout analysis competition. In: Proceedings of ICDAR 2011, pp. 1516–1520. IEEE (2011)
Google Scholar
Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: Proceedings of EMNLP, pp. 1924–1929 (2014)
Google Scholar
Baird, H.S., Jones, S.E., Fortune, S.J.: Image segmentation by shape-directed covers. In: Proceedings of ICPR, vol. 1, pp. 820–825. IEEE (1990)
Google Scholar
Balke, W.T.: Introduction to information extraction: basic notions and current trends. Datenbank-Spektrum 12(2), 81–88 (2012)
Article Google Scholar
Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011)
Article Google Scholar
Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of ICDAR 2007, vol. 1, pp. 392–396. IEEE (2007)
Google Scholar
Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. In: IRST, Trento, Italy (1998)
Google Scholar
Corbelli, A., Baraldi, L., Grana, C., Cucchiara, R.: Historical document digitization through layout analysis and deep content classification. In: Proceedings of ICPR 2016, pp. 4077–4082. IEEE (2016)
Google Scholar
Della Penna, G., Orefice, S.: Supporting information extraction from visual documents. J. Comput. Commun. 4(06), 36 (2016)
Article Google Scholar
Flesca, S., Masciari, E., Tagarelli, A.: A fuzzy logic approach to wrapping pdf documents. IEEE Trans. Knowl. Data Eng. 23(12), 1826–1841 (2011)
Article Google Scholar
Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 294–308 (1998)
Article Google Scholar
Jiang, J.: Information extraction from text. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 11–41. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_2
Chapter Google Scholar
Karkaletsis, V., Fragkou, P., Petasis, G., Iosif, E.: Ontology based information extraction from text. In: Paliouras, G., Spyropoulos, C.D., Tsatsaronis, G. (eds.) Knowledge-Driven Multimedia Information Extraction and Ontology Evolution. LNCS, vol. 6050, pp. 89–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20795-2_4
Chapter Google Scholar
Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Photonics West 1998 Electronic Imaging, pp. 22–32. International Society for Optics and Photonics (1998)
Google Scholar
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)
Article Google Scholar
Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In: Proceedings of JCDL 2013, pp. 385–386. ACM, New York (2013)
Google Scholar
Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992)
Article Google Scholar
Namboodiri, A.M., Jain, A.K.: Document structure and layout analysis. In: Chaudhuri, B.B. (ed.) Digital Document Processing, pp. 29–48. Springer, London (2007). https://doi.org/10.1007/978-1-84628-726-8_2
Chapter Google Scholar
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)
Article Google Scholar
Oren, E., Möller, K., Scerri, S., Handschuh, S., Sintek, M.: What are semantic annotations. Relatório técnico. DERI Galway 9, 62 (2006)
Google Scholar
Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization, pp. 23–49. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-28569-1_2
Chapter Google Scholar
Simon, A., Pret, J.C., Johnson, A.P.: A fast algorithm for bottom-up document layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 19(3), 273–277 (1997)
Article Google Scholar
Singh, M., Barua, B., Palod, P., Garg, M., Satapathy, S., Bushi, S., Ayush, K., Rohith, K.S., Gamidi, T., Goyal, P., et al.: OCR++: a robust framework for information extraction from scholarly articles. arXiv preprint arXiv:1609.06423 (2016)
Toepfer, M., Corovic, H., Fette, G., Klügl, P., Störk, S., Puppe, F.: Fine-grained information extraction from German transthoracic echocardiography reports. BMC Med. Inform. Decis. Mak. 15(1), 91 (2015)
Article Google Scholar
Vasilopoulos, N., Kavallieratou, E.: Unified layout analysis and text localization framework. J. Electron. Imaging 26(1), 013009 (2017)
Article Google Scholar
Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM J. Res. Dev. 26(6), 647–656 (1982)
Article Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Calabria, 87036, Rende, Italy
Weronika T. Adrian, Nicola Leone, Marco Manna & Cinzia Marte
AGH University of Science and Technology, Krakow, Poland
Weronika T. Adrian

Authors

Weronika T. Adrian
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Leone
View author publications
You can also search for this author in PubMed Google Scholar
Marco Manna
View author publications
You can also search for this author in PubMed Google Scholar
Cinzia Marte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weronika T. Adrian .

Editor information

Editors and Affiliations

University of Bari, Bari, Italy
Floriana Esposito
University of Rome Tor Vergata, Rome, Italy
Roberto Basili
University of Bari, Bari, Italy
Stefano Ferilli
University of Bari, Bari, Italy
Francesca A. Lisi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adrian, W.T., Leone, N., Manna, M., Marte, C. (2017). Document Layout Analysis for Semantic Information Extraction. In: Esposito, F., Basili, R., Ferilli, S., Lisi, F. (eds) AI*IA 2017 Advances in Artificial Intelligence. AI*IA 2017. Lecture Notes in Computer Science(), vol 10640. Springer, Cham. https://doi.org/10.1007/978-3-319-70169-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-70169-1_20
Published: 07 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70168-4
Online ISBN: 978-3-319-70169-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics