Skip to main content

VBSRL: A Semantic Frame-Based Approach for Data Extraction from Unstructured Business Documents

  • Conference paper
  • First Online:
Intelligent Computing

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 283))

  • 2358 Accesses

Abstract

The definition of alternative processing techniques as applied to business documents is inevitably at odds with long-standing issues derived by the unstructured nature of most business-related information. In particular, more and more refined methods for automated data extraction have been investigated over the years. The last frontier in this sense is Semantic Role Labeling (SRL), which extracts relevant information purely based on the overall meaning of sentences. This is carried out by mapping specific situations described in the text into more general scenarios (semantic frames). FrameNet originated as a semantic frame repository by applying SRL techniques to large textual corpora, but its adaptation to languages other than English has been proven a difficult task. In this paper, we introduce a new implementation of SRL called Verb-Based SRL (VBSRL) for information extraction. VBSRL relies on a different conceptual theory used in the context of natural language understanding, which is language-independent and dramatically elevates the importance of verbs to abstract from real-life situations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.tableau.com/reports/gartner.

  2. 2.

    https://framenet.icsi.berkeley.edu/fndrupal/.

  3. 3.

    See https://framenet.icsi.berkeley.edu/fndrupal/framenets_in_other_languages for a complete summary of all undergoing projects.

  4. 4.

    In the following, all references and examples written in Italian shall be reported in italics with the corresponding English translation in regular typeset.

References

  1. Aggarwal, C.C.: Data Mining - The Textbook. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14142-8

    Book  MATH  Google Scholar 

  2. Basili, R., Brambilla, S., Croce, D., Tamburini, F.: Developing a large scale FrameNet for Italian: the IFrameNet experience. In: Basili, R., Nissim, M., Satta, G. (eds.) Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it, Collana dell’Associazione Italiana di Linguistica Computazionale, Rome, pp. 59–64. Associazione Italiana di Linguistica Computazionale, Accademia University Press, December 2017

    Google Scholar 

  3. Cristani, M., Tomazzoli, C.: A multimodal approach to exploit similarity in documents. In: Ali, M., Pan, J.-S., Chen, S.-M., Horng, M.-F. (eds.) IEA/AIE 2014. LNCS (LNAI), vol. 8481, pp. 490–499. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07455-9_51

    Chapter  Google Scholar 

  4. Cristani, M., Bertolaso, A., Scannapieco, S., Tomazzoli, C.: Future paradigms of automated processing of business documents. IJIM 40, 67–75 (2018)

    Google Scholar 

  5. Cristani, M., Cuel, R.: A survey on ontology creation methodologies. Int. J. Semantic Web Inf. Syst. 1(2), 49–69 (2005)

    Article  Google Scholar 

  6. Cristani, M., Tomazzoli, C.: A multimodal approach to relevance and pertinence of documents. In: Fujita, H., Ali, M., Selamat, A., Sasaki, J., Kurematsu, M. (eds.) IEA/AIE 2016. LNCS (LNAI), vol. 9799, pp. 157–168. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42007-3_14

    Chapter  Google Scholar 

  7. Fillmore, C.J.: Frame Semantics, pp. 111–137. Hanshin Publ. Co., Seoul (1982)

    Google Scholar 

  8. Huynh, D.T., Zhou, X.: Exploiting a proximity-based positional model to improve the quality of information extraction by text segmentation. In: Wang, H., Zhang, R (eds.) Proceedings of the Twenty-Fourth Australasian Database Conference. ADC 2013, Adelaide, Australia, vol. 137, pp. 23–31. Australian Computer Society Inc, January 2013

    Google Scholar 

  9. Kabak, Y., Dogac, A.: A survey and analysis of electronic business document standards. ACM Comput. Surv. 42(3), 11:1–11:31 (2010)

    Google Scholar 

  10. Lenci, A.: Distributional semantics in linguistic and cognitive research. Rivista di Linguistica 20(1), 1–31 (2008)

    Google Scholar 

  11. Lenci, A., Johnson, M., Lapesa, G.: Building an Italian FrameNet through semi-automatic corpus analysis. In: Proceedings of International Conference on Language Resources and Evaluation (LREC), Valletta, Malta (2010)

    Google Scholar 

  12. Laxmi Lydia, E., Kannan, S., Suman Rajest, S., Satyanarayana, S.: Correlative study and analysis for hidden patterns in text analytics unstructured data using supervised and unsupervised learning techniques. Int. J. Cloud Comput. 9(2/3), 150–162 (2020)

    Google Scholar 

  13. MacGillivray, C., Reinsel, D.: Worldwide global DataSphere IoT device and data forecast, 2019–2023. Technical report US45066919, International Data Corporation (IDC), Framingham, MA, USA, May 2019

    Google Scholar 

  14. Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6495–6504. Association for Computational Linguistics, July 2020

    Google Scholar 

  15. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  16. Montemagni, S., et al.: Building the Italian syntactic-semantic treebank. In: Abeille, A. (ed.) Treebanks, pp. 189–210. Springer, Dordrecht (2003). https://doi.org/10.1007/978-94-010-0201-1_11

    Chapter  Google Scholar 

  17. Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)

    Article  Google Scholar 

  18. Scannapieco, S., Tomazzoli, C.: Shoo the spectre of ignorance with QA2SPR - an open domain question answering architecture with semantic prioritisation of roles. In: Armano, G., Bozzon, A., Cristani, M., Giuliani, A. (eds.) Proceedings of the 3rd International Workshop on Knowledge Discovery on the WEB. CEUR Workshop Proceedings, vol. 1959, Cagliari, Italy. CEUR-WS.org, September 2017

    Google Scholar 

  19. Schank, R.C.: The fourteen primitive actions and their inferences. Technical report, Stanford University, Stanford, CA, USA (1973)

    Google Scholar 

  20. Shilakes, C.C., Tylman, J.: Enterprise information portals. Techreport, Merrill Lynch (1998)

    Google Scholar 

  21. Tonelli, S., Pighin, D., Giuliano, C., Pianta, E.: Semi-Automatic Development of FrameNet for Italian (2009)

    Google Scholar 

  22. Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Bender, E.M., Derczynski, L., Isabelle, P. (eds.) Proceedings of the 27th International Conference on Computational Linguistics COLING, Santa Fe, New Mexico, USA, pp. 2145–2158. Association for Computational Linguistics, August 2018

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simone Scannapieco .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Scannapieco, S., Ponza, A., Tomazzoli, C. (2022). VBSRL: A Semantic Frame-Based Approach for Data Extraction from Unstructured Business Documents. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 283. Springer, Cham. https://doi.org/10.1007/978-3-030-80119-9_68

Download citation

Publish with us

Policies and ethics