Visual Named Entity Validation for Short Names in Financial Domain with Fine-Tuned BERT and Data Augmentation

Lin, Chen; Xu, Yourong; Su, Hui

doi:10.1007/978-3-031-28076-4_4

Visual Named Entity Validation for Short Names in Financial Domain with Fine-Tuned BERT and Data Augmentation

Chen Lin¹⁰,
Yourong Xu¹⁰ &
Hui Su¹⁰

Conference paper
First Online: 27 February 2023

597 Accesses

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 651))

Abstract

Validating OCR-extracted financial entities against a financial organization’s codified data is a very challenging semantic textual similarity task because of limited context, short text, presence of abbreviations/acronyms, and OCR errors. To study this problem, we built a synthetic dataset with images that contain pseudo financial entities for OCR; model-generated short names with abbreviations/acronyms for validation. With fine-tuned BERT and data augmentation, we achieved top notch performance on this task and compared with several baseline systems.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://github.com/Belval/TextRecognitionDataGenerator.
2.
https://aws.amazon.com/textract/.
3.
List of colloquial names for universities and colleges in the United States from Wikipedia, https://rb.gy/bqoklh.

References

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005)
Google Scholar
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7
Chapter Google Scholar
Han, M., Zhang, X., Yuan, X., Jiang, J., Yun, W., Gao, C.: A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency Comput. Pract. Exp. 33(5), e5971 (2021)
Article Google Scholar
Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003)
Google Scholar
Lawrie, D., Mayfield, J., Etter, D.: Building OCR/NER test collections. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4639–4646. European Language Resources Association, May 2020
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Article Google Scholar
Nakov, P., et al.: Semeval-2017 task 3: community question answering. arXiv preprint arXiv:1912.00730 (2019)
Wibisono Prakoso, D., Abdi, A., Amrit, C.: Short text similarity measurement methods: a review. Soft Computing, pp. 1–25 (2021)
Google Scholar
Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48. Citeseer (2003)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of EMNLP-IJCNLP (2019)
Google Scholar
Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison of named entity recognition tools for raw ocr text. In: Konvens, pp. 410–414 (2012)
Google Scholar
Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814 (2017)
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

Download references

Acknowledgments

We thank Ryan A. Griffin for pulling the internal data, Victor Lo and Alec Bethune for their valuable support. We thank all reviewers for their insightful comments!

Author information

Authors and Affiliations

Fidelity Investments, Boston, MA, 02210, USA
Chen Lin, Yourong Xu & Hui Su

Authors

Chen Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yourong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Lin .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, C., Xu, Y., Su, H. (2023). Visual Named Entity Validation for Short Names in Financial Domain with Fine-Tuned BERT and Data Augmentation. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-031-28076-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-28076-4_4
Published: 27 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28075-7
Online ISBN: 978-3-031-28076-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics