Skip to main content

Semantic Multi-concept Annotation for Tabular Data in Financial Documents

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14762))

  • 524 Accesses

Abstract

Tables in financial documents provide structured data for various analyses, such as the company’s financial health. However, their heterogeneous structures complicate data extraction and narrow the scopes of the analyses. Semantic annotation solves this problem by standardizing the meanings of tabular data, making it fully structured and machine-readable. Although previous research has explored and enhanced semantic annotation, they mainly focus on singular or hierarchical concepts within a table cell, which is insufficient to annotate financial filings. Therefore, we present a more challenging task of annotating multiple non-hierarchical concepts in financial tables. This new task requires a model to identify different concepts describing a table cell. We created a dataset of 10,000 samples and benchmarked seven language models through prompting and fine-tuning. The results demonstrate the challenges of the task, even for large language models, offering the opportunity for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available upon request.

  2. 2.

    https://www.cs.ox.ac.uk/isg/challenges/sem-tab/.

  3. 3.

    https://www.sec.gov.

  4. 4.

    https://www.xbrl.org.

  5. 5.

    https://www.sec.gov/edgar/information-for-filers/standard-taxonomies.

  6. 6.

    https://openai.com.

  7. 7.

    https://replicate.com/meta/llama-2-70b.

  8. 8.

    https://huggingface.co/RUCAIBox/mtl-data-to-text.

References

  1. Abdelmageed, N., Schindler, S., König-Ries, B.: BiodivTab: a table annotation benchmark based on biodiversity research data. In: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, pp. 13–18 (2021)

    Google Scholar 

  2. Bhagavatula, C.S., Noraset, T., Downey, D.: TabEL: entity linking in web tables. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 425–441. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_25

  3. Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough tables: carefully evaluating entity linking for tabular data. In: Pan, J.Z., et al.: (eds.) ISWC 2020. LNCS, vol. 12507, pp. 328–343. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_21

  4. Fan, G., Wang, J., Li, Y., Zhang, D., Miller, R.J.: Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning. VLDB Endowment 16(7), 1726–1739 (2023)

    Article  Google Scholar 

  5. Kertkeidkachorn, N., Nararatwong, R., Xu, Z., Ichise, R.: FinKG: a core financial knowledge graph for financial analysis. In: Proceedings of the IEEE International Conference on Semantic Computing, pp. 90–93. IEEE (2023)

    Google Scholar 

  6. Khurana, U., Galhotra, S.: Semantic concept annotation for tabular data. In: Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 844–853 (2021)

    Google Scholar 

  7. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)

  8. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. VLDB Endowment 3(1–2), 1338–1347 (2010)

    Article  Google Scholar 

  9. Nararatwong, R., Kertkeidkachorn, N., Ichise, R.: Evaluating tabular and textual entity linking in financial documents. In: Proceedings of the IEEE International Conference on Semantic Computing, pp. 130–133. IEEE (2024)

    Google Scholar 

  10. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab: matching tabular data to knowledge graph using probability models. In: Proceedings of the 14th International Workshop on Ontology Matching (2019)

    Google Scholar 

  11. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab4D: semantic annotation of tabular data with dbpedia. Semantic Web (Preprint), pp. 1–25 (2022)

    Google Scholar 

  12. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    Google Scholar 

  13. Ritze, D., Bizer, C.: Matching web tables to dbpedia-a feature utility study. In: Proceedings of the International Conference on Extending Database Technology, pp. 210–221 (2017)

    Google Scholar 

  14. Tang, T., Li, J., Zhao, W.X., Wen, J.R.: MVP: multi-task supervised pre-training for natural language generation. In: Findings of the Association for Computational Linguistics, pp. 8758–8794 (2023)

    Google Scholar 

Download references

Acknowledgment

This paper is partially supported by the New Energy and Industrial Technology Development Organization (NEDO).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rungsiman Nararatwong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nararatwong, R., Shi, Y., Kertkeidkachorn, N., Ichise, R. (2024). Semantic Multi-concept Annotation for Tabular Data in Financial Documents. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70239-6_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70238-9

  • Online ISBN: 978-3-031-70239-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics