Abstract
Tables in financial documents provide structured data for various analyses, such as the company’s financial health. However, their heterogeneous structures complicate data extraction and narrow the scopes of the analyses. Semantic annotation solves this problem by standardizing the meanings of tabular data, making it fully structured and machine-readable. Although previous research has explored and enhanced semantic annotation, they mainly focus on singular or hierarchical concepts within a table cell, which is insufficient to annotate financial filings. Therefore, we present a more challenging task of annotating multiple non-hierarchical concepts in financial tables. This new task requires a model to identify different concepts describing a table cell. We created a dataset of 10,000 samples and benchmarked seven language models through prompting and fine-tuning. The results demonstrate the challenges of the task, even for large language models, offering the opportunity for future research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available upon request.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Abdelmageed, N., Schindler, S., König-Ries, B.: BiodivTab: a table annotation benchmark based on biodiversity research data. In: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, pp. 13–18 (2021)
Bhagavatula, C.S., Noraset, T., Downey, D.: TabEL: entity linking in web tables. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 425–441. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_25
Cutrona, V., Bianchi, F., Jiménez-Ruiz, E., Palmonari, M.: Tough tables: carefully evaluating entity linking for tabular data. In: Pan, J.Z., et al.: (eds.) ISWC 2020. LNCS, vol. 12507, pp. 328–343. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_21
Fan, G., Wang, J., Li, Y., Zhang, D., Miller, R.J.: Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning. VLDB Endowment 16(7), 1726–1739 (2023)
Kertkeidkachorn, N., Nararatwong, R., Xu, Z., Ichise, R.: FinKG: a core financial knowledge graph for financial analysis. In: Proceedings of the IEEE International Conference on Semantic Computing, pp. 90–93. IEEE (2023)
Khurana, U., Galhotra, S.: Semantic concept annotation for tabular data. In: Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 844–853 (2021)
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. VLDB Endowment 3(1–2), 1338–1347 (2010)
Nararatwong, R., Kertkeidkachorn, N., Ichise, R.: Evaluating tabular and textual entity linking in financial documents. In: Proceedings of the IEEE International Conference on Semantic Computing, pp. 130–133. IEEE (2024)
Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab: matching tabular data to knowledge graph using probability models. In: Proceedings of the 14th International Workshop on Ontology Matching (2019)
Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab4D: semantic annotation of tabular data with dbpedia. Semantic Web (Preprint), pp. 1–25 (2022)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Ritze, D., Bizer, C.: Matching web tables to dbpedia-a feature utility study. In: Proceedings of the International Conference on Extending Database Technology, pp. 210–221 (2017)
Tang, T., Li, J., Zhao, W.X., Wen, J.R.: MVP: multi-task supervised pre-training for natural language generation. In: Findings of the Association for Computational Linguistics, pp. 8758–8794 (2023)
Acknowledgment
This paper is partially supported by the New Energy and Industrial Technology Development Organization (NEDO).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nararatwong, R., Shi, Y., Kertkeidkachorn, N., Ichise, R. (2024). Semantic Multi-concept Annotation for Tabular Data in Financial Documents. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-70239-6_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70238-9
Online ISBN: 978-3-031-70239-6
eBook Packages: Computer ScienceComputer Science (R0)