Skip to main content

Visual Named Entity Validation for Short Names in Financial Domain with Fine-Tuned BERT and Data Augmentation

  • Conference paper
  • First Online:
  • 597 Accesses

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 651))

Abstract

Validating OCR-extracted financial entities against a financial organization’s codified data is a very challenging semantic textual similarity task because of limited context, short text, presence of abbreviations/acronyms, and OCR errors. To study this problem, we built a synthetic dataset with images that contain pseudo financial entities for OCR; model-generated short names with abbreviations/acronyms for validation. With fine-tuned BERT and data augmentation, we achieved top notch performance on this task and compared with several baseline systems.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/Belval/TextRecognitionDataGenerator.

  2. 2.

    https://aws.amazon.com/textract/.

  3. 3.

    List of colloquial names for universities and colleges in the United States from Wikipedia, https://rb.gy/bqoklh.

References

  1. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805

  2. Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005)

    Google Scholar 

  3. Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7

    Chapter  Google Scholar 

  4. Han, M., Zhang, X., Yuan, X., Jiang, J., Yun, W., Gao, C.: A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency Comput. Pract. Exp. 33(5), e5971 (2021)

    Article  Google Scholar 

  5. Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003)

    Google Scholar 

  6. Lawrie, D., Mayfield, J., Etter, D.: Building OCR/NER test collections. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4639–4646. European Language Resources Association, May 2020

    Google Scholar 

  7. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)

    Article  Google Scholar 

  8. Nakov, P., et al.: Semeval-2017 task 3: community question answering. arXiv preprint arXiv:1912.00730 (2019)

  9. Wibisono Prakoso, D., Abdi, A., Amrit, C.: Short text similarity measurement methods: a review. Soft Computing, pp. 1–25 (2021)

    Google Scholar 

  10. Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242, pp. 29–48. Citeseer (2003)

    Google Scholar 

  11. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of EMNLP-IJCNLP (2019)

    Google Scholar 

  12. Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison of named entity recognition tools for raw ocr text. In: Konvens, pp. 410–414 (2012)

    Google Scholar 

  13. Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814 (2017)

  14. Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16

  15. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

Download references

Acknowledgments

We thank Ryan A. Griffin for pulling the internal data, Victor Lo and Alec Bethune for their valuable support. We thank all reviewers for their insightful comments!

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, C., Xu, Y., Su, H. (2023). Visual Named Entity Validation for Short Names in Financial Domain with Fine-Tuned BERT and Data Augmentation. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-031-28076-4_4

Download citation

Publish with us

Policies and ethics