Skip to main content

RCAM-Transformer: A Novel Approach to Table Reconstruction Using Row-Column Attention Mechanism

  • Conference paper
  • First Online:
Document Analysis Systems (DAS 2024)

Abstract

Table reconstruction, a critical task in the field of table structure recognition (TSR), plays a vital role in various domains, such as data mining, machine learning, and information retrieval. While many existing TSR methods employ transformer-based models with generally impressive performance, a gap remains in transformer models specifically designed to handle the distinct attributes of table rows and columns. Moreover, there is a lack of robust table reconstruction strategies based on object detection models. To address these issues, we introduce the Row-Column Attention Mechanism (RCAM). When combined with a transformer model and integrated with partial global attention, it forms the RCAM-Transformer. This model is tailored to effectively process the unique properties of tabular data. In addition, we have developed a novel table reconstruction strategy that leverages object detection models, which improves the recognition and treatment of tabular data. Our experiments, conducted using the PubTables-1M and FinTabNet dataset, along with our self-constructed Annual Report TableSet, not only validated the effectiveness of the RCAM but also demonstrated the improved accuracy of table reconstruction with the use of our RCAM-Transformer. Such outcomes highlight the potential of the RCAM-Transformer to advance table extraction in various fields.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    When the i-th patch and the j-th patch are in the same row, the values of i and j, when divided by w and rounded down, are identical. Conversely, when the i-th patch and the j-th patch are in the same column, the division of i and j by w results in congruent remainders.

References

  1. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  2. Chi, Z., Huang, H., Xu, H.D., Yu, H., Yin, W., Mao, X.L.: Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)

  3. Nassar, A., Livathinos, N., Lysak, M., Staar, P.: Tableformer: table structure understanding with transformers. arXiv e-prints (2022)

    Google Scholar 

  4. Zhang, Z., Zhang, J., Du, J., Wang, F.: Split, embed and merge: an accurate table structure recognizer. Pattern Recogn. 126, 108565 (2022)

    Article  Google Scholar 

  5. Lin, W., et al.: TSRFormer: table structure recognition with transformers. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6473–6482 (2022)

    Google Scholar 

  6. Ly, N.T., Takasu, A., Nguyen, P., Takeda, H.: Rethinking image-based table recognition using weakly supervised methods. arXiv preprint arXiv:2303.07641 (2023)

  7. Ly, N.T., Takasu, A.: An end-to-end multi-task learning model for image-based table recognition. arXiv preprint arXiv:2303.08648 (2023)

  8. Beltagy, I., Peters, M.E., Cohan, A.: LongFormer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  9. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  10. Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 572–573 (2020)

    Google Scholar 

  11. Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7

    Chapter  Google Scholar 

  12. Raja, S., Mondal, A., Jawahar, C.V.: Table structure recognition using top-down and bottom-up cues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 70–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_5

    Chapter  Google Scholar 

  13. Long, R., et al.: Parsing table structures in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 944–952 (2021)

    Google Scholar 

  14. Hashmi, K.A., Stricker, D., Liwicki, M., Afzal, M.N., Afzal, M.Z.: Guided table structure recognition through anchor optimization. IEEE Access 9, 113521–113534 (2021)

    Article  Google Scholar 

  15. Ye, J., et al.: PingAn-VCGroup’s solution for ICDAR 2021 competition on scientific literature parsing task B: table recognition to HTML. arXiv preprint arXiv:2105.01848 (2021)

  16. Ma, C., Lin, W., Sun, L., Huo, Q.: Robust table detection and structure recognition from heterogeneous document images. Pattern Recogn. 133, 109006 (2023)

    Article  Google Scholar 

  17. Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4634–4642 (2022)

    Google Scholar 

  18. Xing, H., et al.: Lore: logical location regression network for table structure recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2992–3000 (2023)

    Google Scholar 

  19. Long, R., et al.: Lore++: logical location regression network for table structure recognition with pre-training. arXiv preprint arXiv:2401.01522 (2024)

  20. Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: International Conference on Machine Learning, pp. 980–989. PMLR (2017)

    Google Scholar 

  21. Lysak, M., Nassar, A., Livathinos, N., Auer, C., Staar, P.: Optimized table tokenization for table structure recognition. arXiv preprint arXiv:2305.03393 (2023)

  22. Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table recognition using graph neural networks. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 142–147. IEEE (2019)

    Google Scholar 

  23. Li, Y., Huang, Z., Yan, J., Zhou, Y., Ye, F., Liu, X.: GFTE: graph-based financial table extraction. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12662, pp. 644–658. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68790-8_50

    Chapter  Google Scholar 

  24. Xue, W., Yu, B., Wang, W., Tao, D., Li, Q.: TGRNet: a table graph reconstruction network for table structure recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1295–1304 (2021)

    Google Scholar 

  25. Liu, H., Li, X., Liu, B., Jiang, D., Liu, Y., Ren, B.: Neural collaborative graph machines for table structure recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4533–4542 (2022)

    Google Scholar 

  26. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  27. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  28. Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15, 11–15 (1972)

    Article  Google Scholar 

  29. Zheng, X., Burdick, D., Popa, L., Zhong, P., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: Winter Conference for Applications in Computer Vision (WACV) (2021)

    Google Scholar 

  30. Smock, B., Pesala, R., Abraham, R.: Aligning benchmark datasets for table structure recognition (2023)

    Google Scholar 

  31. Smock, B., Pesala, R., Abraham, R.: GriTS: grid table similarity metric for table structure recognition. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14191, pp. 535–549. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41734-4_33

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zezhong Guo .

Editor information

Editors and Affiliations

A Appendix

A Appendix

This appendix presents details of table reconstruction. Generally, we require the results of the object detection model, that is, accurate coordinates and attributes of the cells. We then proceed to the post-processing steps for table reconstruction. Several details are worth discussing further. One such detail is that the object detection model often outputs overlapping rows or columns, even if the confidence limit of the output is increased. Addressing this necessitates the design of a reasonable IoU to filter these overlaps, because failing to do so would result in redundant table lines in the final table reconstruction results. Another detail pertains to the results of the Hough Line Transform. Interference is often encountered, such as small illustrations, colored backgrounds, or closely connected characters, all of which generate noise. This necessitates the design of more reasonable image morphological operations and subsequent filtering to remove horizontal and vertical lines detected via Hough Line Transform that account for less than a certain proportion of the total length and width of the table. Moreover, blank rows may appear after the results from the model recognition are aligned with Hough Line Transform. Although these factors do not affect the table structure, they can influence the downstream tasks. Therefore, to remove blank rows, it is necessary to lock the results of Hough Line Transform and remove the horizontal lines below them. All the examples shown in Fig. 7 have undergone blank-row processing. During the merging of cross-cells, headers or non-headers can be selectively merged according to the cell attributes. However, we cannot effectively merge areas without text, such as the top-left corners of Figs. 4d, 7a, 7c, 7e, 7g, 7i, and 7k. If we let the model recognize a blank area as a spanning cell, this category of recognition would become unstable.

Fig. 7.
figure 7

Table reconstruction results.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, Z., Zhang, Y., Chen, S., Wei, C. (2024). RCAM-Transformer: A Novel Approach to Table Reconstruction Using Row-Column Attention Mechanism. In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham. https://doi.org/10.1007/978-3-031-70442-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70442-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70441-3

  • Online ISBN: 978-3-031-70442-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics