Abstract
Recently, due to the rapid development of deep learning, especially Transformer, many Transformer-based methods have been studied and proven to be very powerful for table recognition. However, Transformer-based models usually struggle to process big tables due to the limitation of their global attention mechanism. In this paper, we propose a local attention mechanism to address the limitation of the global attention mechanism. We also present an end-to-end local attention-based model for recognizing both table structure and table cell content from a table image. The proposed model consists of four main components: 1) an encoder for feature extraction; 2) the three decoders for the three sub-tasks of the table recognition problem. In the experiments, we evaluate the performance of the proposed model and the effectiveness of the local attention mechanism on the two large-scale datasets: PubTabNet and FinTabNet. The experiment results show that the proposed model outperforms the state-of-the-art methods on all benchmark datasets. Furthermore, we demonstrate the effectiveness of the local attention mechanism for table recognition, especially for big table recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jimeno Yepes, A., Zhong, P., Burdick, D.: ICDAR 2021 competition on scientific literature parsing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science, vol. 12824, pp. 605–617. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_40
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds.) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_34/TABLES/3
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: a benchmark dataset for table detection and recognition (2019). https://doi.org/10.48550/arxiv.1903.01949
Deng, Y., Rosenberg, D., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp. 894–901 (2019). https://doi.org/10.1109/ICDAR.2019.00148
Kayal, P., Anand, M., Desai, H., Singh, M.: ICDAR 2021 competition on scientific table image recognition to LaTeX. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science, vol. 12824, pp. 754–766. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_50
Itonori, K.: Table structure recognition based on textblock arrangement and ruled line position. In: Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR ’93), pp. 765–768 (1993). https://doi.org/10.1109/ICDAR.1993.395625
Kieninger, T.G.: Table structure recognition based on robust block segmentation, vol. 3305, pp. 22–32 (1998). https://doi.org/10.1117/12.304642
Wang, Y., Phillips, I.T., Haralick, R.M.: Table structure understanding and its performance evaluation. Pattern Recognit. 37, 1479–1497 (2004). https://doi.org/10.1016/J.PATCOG.2004.01.012
Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, June 2020, pp. 2439–2447 (2020). https://doi.org/10.48550/arxiv.2004.12629
Raja, S., Mondal, A., Jawahar, C.V.: Table structure recognition using top-down and bottom-Up Cues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds.) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol. 12373, pp. 70–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_5/FIGURES/8
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: deep learning for detection and structure recognition of tables in document images. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, vol. 1, pp. 1162–1167 (2017). https://doi.org/10.1109/ICDAR.2017.192
Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7/TABLES/4
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems. Neural information processing systems foundation, pp. 5999–6009 (2017)
Nassar, A., Livathinos, N., Lysak, M., Staar, P.: TableFormer: table structure understanding with transformers (2022). https://doi.org/10.48550/arxiv.2203.01017
Ye, J., et al.: PingAn-VCGroup’s solution for ICDAR 2021 competition on scientific literature parsing task B: table recognition to HTML (2021). https://doi.org/10.48550/arxiv.2105.01848
Ly, N.T., Takasu, A., Nguyen, P., Takeda, H.: Rethinking image-based table recognition using weakly supervised methods. In: In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), Lisbon, Portugal (2023)
Ly, N.T., Takasu, A.: An end-to-end multi-task learning model for image-based table recognition. In: In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) (2023)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv (2020)
Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 4365–4374 (2019). https://doi.org/10.18653/V1/D19-1445
Sperber, M., Niehues, J., Neubig, G., Stüker, S., Waibel, A.: Self-attentional acoustic models. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, September 2018, pp. 3723–3727 (2018). https://doi.org/10.21437/INTERSPEECH.2018-1910
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-Attention with relative position representations. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 2, pp. 464–468 (2018). https://doi.org/10.18653/V1/N18-2074
Lu, N., et al.: MASTER: multi-aspect non-local network for scene text recognition. Pattern Recognit. 117, 107980 (2021). https://doi.org/10.1016/J.PATCOG.2021.107980
Ly, N.T., Nguyen, H.T., Nakagawa, M.: 2D self-attention convolutional recurrent network for offline handwritten text recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol. 12821, pp. 191–204. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_13/COVER
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2015). https://doi.org/10.48550/arxiv.1506.01497
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 07–12 June 2015, pp. 3431–3440 (2015). https://doi.org/10.1109/CVPR.2015.7298965
Zhang, Z., Zhang, J., Du, J., Wang, F.: Split, embed and merge: an accurate table structure recognizer. Pattern Recognit. 126, 108565 (2022). https://doi.org/10.1016/J.PATCOG.2022.108565
Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: 34th International Conference on Machine Learning, ICML 2017, vol. 3, pp. 1631–1640 (2016). https://doi.org/10.48550/arxiv.1609.04938
Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 697–706 (2021). https://doi.org/10.1109/WACV48630.2021.00074
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016). https://doi.org/10.1109/CVPR.2016.90
MMCV Contributors: {MMCV: OpenMMLab} Computer Vision Foundation (2018). https://github.com/open-mmlab/mmcv
Acknowledgments
This work was supported by the Cross-ministerial Strategic Innovation Promotion Program (SIP) Second Phase and “Big-data and AI-enabled Cyberspace Technologies” by New Energy and Industrial Technology Development Organization (NEDO) and partially supported by ROIS NII Open Collaborative Research 2023(23S0102).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ly, N.T., Takasu, A. (2023). An End-to-End Local Attention Based Model for Table Recognition. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-41679-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)