Skip to main content

Document Layout Analysis for Semantic Information Extraction

  • Conference paper
  • First Online:
Book cover AI*IA 2017 Advances in Artificial Intelligence (AI*IA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10640))

Included in the following conference series:

Abstract

Using machines to automatically extract relevant information from unstructured and semi-structured sources has practical significance in todays life and business. In this context, although understanding the meaning of words is important, the process of identifying self-consistent geometric and logical regions of interest—blocks, cells, columns and tables, as well as paragraphs, titles and captions, only to mention a few—is of paramount importance too. This complex process goes under the name of document layout analysis. In this work, we discuss newly designed techniques to solve this problem effectively, by combining both syntactic and semantic document aspects. These techniques described here are at the basis of KnowRex, a comprehensive system for ontology-driven Information Extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adrian, W.T., Leone, N., Manna, M.: Semantic views of homogeneous unstructured data. In: ten Cate, B., Mileo, A. (eds.) RR 2015. LNCS, vol. 9209, pp. 19–29. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22002-4_3

    Chapter  Google Scholar 

  2. Anantharangachar, R., Ramani, S., Rajagopalan, S.: Ontology guided information extraction from unstructured text. CoRR abs/1302.1335 (2013)

    Google Scholar 

  3. Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: Historical document layout analysis competition. In: Proceedings of ICDAR 2011, pp. 1516–1520. IEEE (2011)

    Google Scholar 

  4. Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: Proceedings of EMNLP, pp. 1924–1929 (2014)

    Google Scholar 

  5. Baird, H.S., Jones, S.E., Fortune, S.J.: Image segmentation by shape-directed covers. In: Proceedings of ICPR, vol. 1, pp. 820–825. IEEE (1990)

    Google Scholar 

  6. Balke, W.T.: Introduction to information extraction: basic notions and current trends. Datenbank-Spektrum 12(2), 81–88 (2012)

    Article  Google Scholar 

  7. Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011)

    Article  Google Scholar 

  8. Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of ICDAR 2007, vol. 1, pp. 392–396. IEEE (2007)

    Google Scholar 

  9. Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. In: IRST, Trento, Italy (1998)

    Google Scholar 

  10. Corbelli, A., Baraldi, L., Grana, C., Cucchiara, R.: Historical document digitization through layout analysis and deep content classification. In: Proceedings of ICPR 2016, pp. 4077–4082. IEEE (2016)

    Google Scholar 

  11. Della Penna, G., Orefice, S.: Supporting information extraction from visual documents. J. Comput. Commun. 4(06), 36 (2016)

    Article  Google Scholar 

  12. Flesca, S., Masciari, E., Tagarelli, A.: A fuzzy logic approach to wrapping pdf documents. IEEE Trans. Knowl. Data Eng. 23(12), 1826–1841 (2011)

    Article  Google Scholar 

  13. Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 294–308 (1998)

    Article  Google Scholar 

  14. Jiang, J.: Information extraction from text. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 11–41. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_2

    Chapter  Google Scholar 

  15. Karkaletsis, V., Fragkou, P., Petasis, G., Iosif, E.: Ontology based information extraction from text. In: Paliouras, G., Spyropoulos, C.D., Tsatsaronis, G. (eds.) Knowledge-Driven Multimedia Information Extraction and Ontology Evolution. LNCS, vol. 6050, pp. 89–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20795-2_4

    Chapter  Google Scholar 

  16. Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Photonics West 1998 Electronic Imaging, pp. 22–32. International Society for Optics and Photonics (1998)

    Google Scholar 

  17. Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)

    Article  Google Scholar 

  18. Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In: Proceedings of JCDL 2013, pp. 385–386. ACM, New York (2013)

    Google Scholar 

  19. Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992)

    Article  Google Scholar 

  20. Namboodiri, A.M., Jain, A.K.: Document structure and layout analysis. In: Chaudhuri, B.B. (ed.) Digital Document Processing, pp. 29–48. Springer, London (2007). https://doi.org/10.1007/978-1-84628-726-8_2

    Chapter  Google Scholar 

  21. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  22. Oren, E., Möller, K., Scerri, S., Handschuh, S., Sintek, M.: What are semantic annotations. Relatório técnico. DERI Galway 9, 62 (2006)

    Google Scholar 

  23. Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization, pp. 23–49. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-28569-1_2

    Chapter  Google Scholar 

  24. Simon, A., Pret, J.C., Johnson, A.P.: A fast algorithm for bottom-up document layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 19(3), 273–277 (1997)

    Article  Google Scholar 

  25. Singh, M., Barua, B., Palod, P., Garg, M., Satapathy, S., Bushi, S., Ayush, K., Rohith, K.S., Gamidi, T., Goyal, P., et al.: OCR++: a robust framework for information extraction from scholarly articles. arXiv preprint arXiv:1609.06423 (2016)

  26. Toepfer, M., Corovic, H., Fette, G., Klügl, P., Störk, S., Puppe, F.: Fine-grained information extraction from German transthoracic echocardiography reports. BMC Med. Inform. Decis. Mak. 15(1), 91 (2015)

    Article  Google Scholar 

  27. Vasilopoulos, N., Kavallieratou, E.: Unified layout analysis and text localization framework. J. Electron. Imaging 26(1), 013009 (2017)

    Article  Google Scholar 

  28. Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM J. Res. Dev. 26(6), 647–656 (1982)

    Article  Google Scholar 

  29. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weronika T. Adrian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Adrian, W.T., Leone, N., Manna, M., Marte, C. (2017). Document Layout Analysis for Semantic Information Extraction. In: Esposito, F., Basili, R., Ferilli, S., Lisi, F. (eds) AI*IA 2017 Advances in Artificial Intelligence. AI*IA 2017. Lecture Notes in Computer Science(), vol 10640. Springer, Cham. https://doi.org/10.1007/978-3-319-70169-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-70169-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-70168-4

  • Online ISBN: 978-3-319-70169-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics