Abstract
Validating OCR-extracted financial entities against a financial organization’s codified data is a very challenging semantic textual similarity task because of limited context, short text, presence of abbreviations/acronyms, and OCR errors. To study this problem, we built a synthetic dataset with images that contain pseudo financial entities for OCR; model-generated short names with abbreviations/acronyms for validation. With fine-tuned BERT and data augmentation, we achieved top notch performance on this task and compared with several baseline systems.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
List of colloquial names for universities and colleges in the United States from Wikipedia, https://rb.gy/bqoklh.
References
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005)
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7
Han, M., Zhang, X., Yuan, X., Jiang, J., Yun, W., Gao, C.: A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency Comput. Pract. Exp. 33(5), e5971 (2021)
Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003)
Lawrie, D., Mayfield, J., Etter, D.: Building OCR/NER test collections. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4639–4646. European Language Resources Association, May 2020
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Nakov, P., et al.: Semeval-2017 task 3: community question answering. arXiv preprint arXiv:1912.00730 (2019)
Wibisono Prakoso, D., Abdi, A., Amrit, C.: Short text similarity measurement methods: a review. Soft Computing, pp. 1–25 (2021)
Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48. Citeseer (2003)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of EMNLP-IJCNLP (2019)
Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison of named entity recognition tools for raw ocr text. In: Konvens, pp. 410–414 (2012)
Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814 (2017)
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Acknowledgments
We thank Ryan A. Griffin for pulling the internal data, Victor Lo and Alec Bethune for their valuable support. We thank all reviewers for their insightful comments!
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, C., Xu, Y., Su, H. (2023). Visual Named Entity Validation for Short Names in Financial Domain with Fine-Tuned BERT and Data Augmentation. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-031-28076-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-28076-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28075-7
Online ISBN: 978-3-031-28076-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)