Abstract
Table reconstruction, a critical task in the field of table structure recognition (TSR), plays a vital role in various domains, such as data mining, machine learning, and information retrieval. While many existing TSR methods employ transformer-based models with generally impressive performance, a gap remains in transformer models specifically designed to handle the distinct attributes of table rows and columns. Moreover, there is a lack of robust table reconstruction strategies based on object detection models. To address these issues, we introduce the Row-Column Attention Mechanism (RCAM). When combined with a transformer model and integrated with partial global attention, it forms the RCAM-Transformer. This model is tailored to effectively process the unique properties of tabular data. In addition, we have developed a novel table reconstruction strategy that leverages object detection models, which improves the recognition and treatment of tabular data. Our experiments, conducted using the PubTables-1M and FinTabNet dataset, along with our self-constructed Annual Report TableSet, not only validated the effectiveness of the RCAM but also demonstrated the improved accuracy of table reconstruction with the use of our RCAM-Transformer. Such outcomes highlight the potential of the RCAM-Transformer to advance table extraction in various fields.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
When the i-th patch and the j-th patch are in the same row, the values of i and j, when divided by w and rounded down, are identical. Conversely, when the i-th patch and the j-th patch are in the same column, the division of i and j by w results in congruent remainders.
References
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Chi, Z., Huang, H., Xu, H.D., Yu, H., Yin, W., Mao, X.L.: Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)
Nassar, A., Livathinos, N., Lysak, M., Staar, P.: Tableformer: table structure understanding with transformers. arXiv e-prints (2022)
Zhang, Z., Zhang, J., Du, J., Wang, F.: Split, embed and merge: an accurate table structure recognizer. Pattern Recogn. 126, 108565 (2022)
Lin, W., et al.: TSRFormer: table structure recognition with transformers. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6473–6482 (2022)
Ly, N.T., Takasu, A., Nguyen, P., Takeda, H.: Rethinking image-based table recognition using weakly supervised methods. arXiv preprint arXiv:2303.07641 (2023)
Ly, N.T., Takasu, A.: An end-to-end multi-task learning model for image-based table recognition. arXiv preprint arXiv:2303.08648 (2023)
Beltagy, I., Peters, M.E., Cohan, A.: LongFormer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 572–573 (2020)
Qiao, L., et al.: LGPMA: complicated table structure recognition with local and global pyramid mask alignment. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 99–114. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_7
Raja, S., Mondal, A., Jawahar, C.V.: Table structure recognition using top-down and bottom-up cues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 70–86. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_5
Long, R., et al.: Parsing table structures in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 944–952 (2021)
Hashmi, K.A., Stricker, D., Liwicki, M., Afzal, M.N., Afzal, M.Z.: Guided table structure recognition through anchor optimization. IEEE Access 9, 113521–113534 (2021)
Ye, J., et al.: PingAn-VCGroup’s solution for ICDAR 2021 competition on scientific literature parsing task B: table recognition to HTML. arXiv preprint arXiv:2105.01848 (2021)
Ma, C., Lin, W., Sun, L., Huo, Q.: Robust table detection and structure recognition from heterogeneous document images. Pattern Recogn. 133, 109006 (2023)
Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4634–4642 (2022)
Xing, H., et al.: Lore: logical location regression network for table structure recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2992–3000 (2023)
Long, R., et al.: Lore++: logical location regression network for table structure recognition with pre-training. arXiv preprint arXiv:2401.01522 (2024)
Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: International Conference on Machine Learning, pp. 980–989. PMLR (2017)
Lysak, M., Nassar, A., Livathinos, N., Auer, C., Staar, P.: Optimized table tokenization for table structure recognition. arXiv preprint arXiv:2305.03393 (2023)
Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table recognition using graph neural networks. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 142–147. IEEE (2019)
Li, Y., Huang, Z., Yan, J., Zhou, Y., Ye, F., Liu, X.: GFTE: graph-based financial table extraction. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12662, pp. 644–658. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68790-8_50
Xue, W., Yu, B., Wang, W., Tao, D., Li, Q.: TGRNet: a table graph reconstruction network for table structure recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1295–1304 (2021)
Liu, H., Li, X., Liu, B., Jiang, D., Liu, Y., Ren, B.: Neural collaborative graph machines for table structure recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4533–4542 (2022)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15, 11–15 (1972)
Zheng, X., Burdick, D., Popa, L., Zhong, P., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: Winter Conference for Applications in Computer Vision (WACV) (2021)
Smock, B., Pesala, R., Abraham, R.: Aligning benchmark datasets for table structure recognition (2023)
Smock, B., Pesala, R., Abraham, R.: GriTS: grid table similarity metric for table structure recognition. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14191, pp. 535–549. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41734-4_33
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
This appendix presents details of table reconstruction. Generally, we require the results of the object detection model, that is, accurate coordinates and attributes of the cells. We then proceed to the post-processing steps for table reconstruction. Several details are worth discussing further. One such detail is that the object detection model often outputs overlapping rows or columns, even if the confidence limit of the output is increased. Addressing this necessitates the design of a reasonable IoU to filter these overlaps, because failing to do so would result in redundant table lines in the final table reconstruction results. Another detail pertains to the results of the Hough Line Transform. Interference is often encountered, such as small illustrations, colored backgrounds, or closely connected characters, all of which generate noise. This necessitates the design of more reasonable image morphological operations and subsequent filtering to remove horizontal and vertical lines detected via Hough Line Transform that account for less than a certain proportion of the total length and width of the table. Moreover, blank rows may appear after the results from the model recognition are aligned with Hough Line Transform. Although these factors do not affect the table structure, they can influence the downstream tasks. Therefore, to remove blank rows, it is necessary to lock the results of Hough Line Transform and remove the horizontal lines below them. All the examples shown in Fig. 7 have undergone blank-row processing. During the merging of cross-cells, headers or non-headers can be selectively merged according to the cell attributes. However, we cannot effectively merge areas without text, such as the top-left corners of Figs. 4d, 7a, 7c, 7e, 7g, 7i, and 7k. If we let the model recognize a blank area as a spanning cell, this category of recognition would become unstable.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, Z., Zhang, Y., Chen, S., Wei, C. (2024). RCAM-Transformer: A Novel Approach to Table Reconstruction Using Row-Column Attention Mechanism. In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham. https://doi.org/10.1007/978-3-031-70442-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-70442-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70441-3
Online ISBN: 978-3-031-70442-0
eBook Packages: Computer ScienceComputer Science (R0)