skip to main content
10.1145/3573128.3604901acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation

Published: 22 August 2023 Publication History

Abstract

Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.

References

[1]
R. Barzilay and M. Lapata. 2005. Collective content selection for concept-to-text generation. In EMNLP. ACL, 331--8.
[2]
Z. Chi, H. Huang, H.-D. Xu, et al. 2019. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019).
[3]
B. Coüasnon and A. Lemaitre. 2014. Recognition of tables and forms. In Handb. Doc. Image Process. Recognit., D. Doermann and K. Tombre (Eds.). Springer, 647--77.
[4]
Y. Deng, D. Rosenberg, and G. Mann. 2019. Challenges in end-to-end neural scientific table recognition. In ICDAR. IEEE, 894--901.
[5]
P. A. Duboue and K. R. McKeown. 2003. Statistical acquisition of content selection rules for natural language generation. In EMNLP. 121--8.
[6]
J. Fang, X. Tao, Z. Tang, et al. 2012. Dataset, ground-truth and performance metrics for table detection evaluation. In DAS. IEEE, 445--9.
[7]
P. Fischer, A. Smajic, G. Abrami, et al. 2021. Multi-Type-TD-TSR--Extracting tables from document images using a multi-stage pipeline for table detection and table structure recognition: From OCR to structured table representations. In KI 2021: Adv. Artif. Intell., LNAI, Vol. 12873. Springer, 95--108.
[8]
L. Gao, Y. Huang, H. Déjean, et al. 2019. ICDAR 2019 competition on table detection and recognition (cTDaR). In ICDAR. IEEE, 1510--5.
[9]
L. Gao, X. Yi, Z. Jiang, et al. 2017. ICDAR2017 competition on page object detection. In ICDAR, Vol. 1. IEEE, 1417--22.
[10]
A. Gatt and E. Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J. Artif. Intell. Res. 61 (2018), 65--170.
[11]
M. Göbel, T. Hassan, E. Oro, et al. 2013. ICDAR 2013 table competition. In ICDAR. IEEE, 1449--53.
[12]
K. A. Hashmi, M. Liwicki, D. Stricker, et al. 2021. Current status and performance analysis of table recognition in document images with deep neural networks. IEEE Access 9 (2021), 87663--85.
[13]
K. A. Hashmi, A. Pagani, M. Liwicki, et al. 2021. CasTabDetectoRS: Cascade network for table detection in document images with recursive feature pyramid and switchable atrous convolution. J. Imaging 7, 10 (2021), 214.
[14]
P. Jain, A. Laha, K. Sankaranarayanan, et al. 2018. A mixed hierarchical attention based encoder-decoder approach for standard table summarization. In NAACL, Volume 2. ACL, 622--7.
[15]
A. Jimeno Yepes, P. Zhong, and D. Burdick. 2021. ICDAR 2021 competition on scientific literature parsing. In ICDAR. Springer, 605--17.
[16]
P. Kayal, M. Anand, H. Desai, et al. 2021. ICDAR 2021 competition on scientific table image recognition to LaTeX. In ICDAR. Springer, 754--66.
[17]
P. Kayal, M. Anand, H. Desai, et al. 2022. Tables to LaTeX: structure and content extraction from scientific tables. Int. J. Doc. Anal. Recognit. (2022), 1--10.
[18]
T. Kazdar, W. S. Mseddi, M. A. Akhloufi, et al. 2023. DCTable: A dilated CNN with optimizing anchors for accurate table detection. J. Imaging 9, 3 (2023), 62.
[19]
I. Konstas and M. Lapata. 2013. A global model for concept-to-text generation. J. Artif. Intell. Res. 48 (2013), 305--46.
[20]
R. Lebret, D. Grangier, and M. Auli. 2016. Neural text generation from structured data with application to the biography domain. In EMNLP. 1203--13.
[21]
E. Lee, J. Park, H. I. Koo, et al. 2022. Deep-learning and graph-based approach to table structure recognition. Multimed. Tools Appl. 81, 4 (2022), 5827--48.
[22]
M. Li, L. Cui, S. Huang, et al. 2020. TableBank: Table benchmark for image-based table detection and recognition. In LREC. 1918--25.
[23]
P. Liang, M. I. Jordan, and D. Klein. 2009. Learning semantic correspondences with less supervision. In AFNLP. ACL and AFNLP, 91--9.
[24]
C.-Y. Lin and E. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In HLT-NAACL. 150--7.
[25]
J. Liu, X. Liu, J. Sheng, et al. 2019. Pyramid mask text detector. arXiv preprint arXiv:1903.11800 (2019).
[26]
S. Liu, J. Cao, R. Yang, et al. 2022. Long text and multi-table summarization: Dataset and method. In EMNLP. 1995--2010.
[27]
D. Lopresti and G. Nagy. 2001. A tabular survey of automated table processing. In GREC. Springer, 93--120.
[28]
N. Lu, W. Yu, X. Qi, et al. 2021. MASTER: Multi-aspect non-local network for scene text recognition. Pattern Recognit. 117 (2021), 107980.
[29]
C. Ma, W. Lin, L. Sun, et al. 2023. Robust table detection and structure recognition from heterogeneous document images. Pattern Recognit. 133 (2023), 109006.
[30]
H. Mei, M. Bansal, and M. R. Walter. 2016. What to talk about and how? Selective generation using LSTMs with coarse-to-fine alignment. In NAACL. 720--30.
[31]
K. Papineni, S. Roukos, T. Ward, et al. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. 311--8.
[32]
A. Parikh, X. Wang, S. Gehrmann, et al. 2020. ToTTo: A controlled table-to-text generation dataset. In EMNLP. 1173--86.
[33]
R. Puduppully, L. Dong, and M. Lapata. 2019. Data-to-text generation with content selection and planning. In AAAI, Vol. 33. 6908--15.
[34]
R. Puduppully, L. Dong, and M. Lapata. 2019. Data-to-text generation with entity modeling. In ACL. 2023--35.
[35]
R. Puduppully and M. Lapata. 2021. Data-to-text generation with macro planning. Trans. Assoc. Comput. Linguist. 9 (2021), 510--27.
[36]
S. R. Qasim, H. Mahmood, and F. Shafait. 2019. Rethinking table recognition using graph neural networks. In ICDAR. IEEE, 142--7.
[37]
L. Qiao, Z. Li, Z. Cheng, et al. 2021. LGPMA: Complicated table structure recognition with local and global pyramid mask alignment. In ICDAR. Springer, 99--114.
[38]
C. Rebuffel, M. Roberti, L. Soulier, et al. 2022. Controlling hallucinations at word level in data-to-text generation. Data Min. Knowl. Discov. (2022), 1--37.
[39]
E. Reiter, S. Sripada, J. Hunter, et al. 2005. Choosing words in computer-generated weather forecasts. Artif. Intell. 167, 1-2 (2005), 137--69.
[40]
K. Selçuk Candan, H. Cao, Y. Qi, et al. 2009. AlphaSum: Size-constrained table summarization using value lattices. In EDBT. 96--107.
[41]
L. Sha, L. Mou, T. Liu, et al. 2018. Order-planning neural text generation from structured data. In AAAI, Vol. 32. 5414--21.
[42]
A. Shahab, F. Shafait, T. Kieninger, et al. 2010. An open approach towards the benchmarking of table structure recognition systems. In DAS. 113--20.
[43]
A. Shigarov. 2023. Table understanding: Problem overview. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 13, 1 (2023), e1482.
[44]
S. A. Siddiqui, I. A. Fateh, S. T. R. Rizvi, et al. 2019. DeepTabStR: Deep learning based table structure recognition. In ICDAR. IEEE, 1403--9.
[45]
S. A. Siddiqui, M. I. Malik, S. Agne, et al. 2018. DeCNT: Deep deformable CNN for table detection. IEEE Access 6 (2018), 74151--61.
[46]
B. Smock, R. Pesala, and R. Abraham. 2022. PubTables-1M: Towards comprehensive table extraction from unstructured documents. In CVPR. 4634--42.
[47]
T. Tang, J. Li, Z. Chen, et al. 2022. TextBox 2.0: a text generation library with pre-trained language models. In EMNLP. ACL, 435--44.
[48]
T. Tang, J. Li, W. X. Zhao, and other. 2022. MVP: Multi-task supervised pre-training for natural language generation. arXiv preprint arXiv:2206.12131 (2022).
[49]
C. van der Lee, E. Krahmer, and S. Wubben. 2017. PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences. In NLG. 95--104.
[50]
A. Vaswani, N. Shazeer, N. Parmar, et al. 2017. Attention is all you need. Adv. Neural. Inf. Process. Syst. 30 (2017).
[51]
W. Wang, E. Xie, X. Li, et al. 2019. Shape robust text detection with progressive scale expansion network. In CVPR. 9336--45.
[52]
S. Wiseman, S. M. Shieber, and A. M. Rush. 2017. Challenges in data-to-document generation. In EMNLP. 2253--63.
[53]
Y. W. Wong and R. Mooney. 2007. Generation by inverting a semantic parser that uses statistical machine translation. In HLT-NAACL. 172--9.
[54]
F. Yang, L. Hu, X. Liu, et al. 2023. A large-scale dataset for end-to-end table recognition in the wild. Sci. Data 10, 1 (2023), 110.
[55]
J. Ye, X. Qi, Y. He, et al. 2021. PingAn-VCGroup's solution for ICDAR 2021 competition on scientific literature parsing task B: Table recognition to HTML. arXiv preprint arXiv:2105.01848 (2021).
[56]
X. Zheng, D. Burdick, L. Popa, et al. 2021. Global table extractor (GTE): A framework for joint table identification and cell structure recognition using visual context. In WACV. 697--706.
[57]
X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes. 2020. Image-based table recognition: data, model, and evaluation. In ECCV. Springer, 564--80.

Index Terms

  1. WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023
          August 2023
          187 pages
          ISBN:9798400700279
          DOI:10.1145/3573128
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 22 August 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. data-to-text generation
          2. datasets
          3. document collection management
          4. table recognition
          5. table summarization

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          DocEng '23
          Sponsor:
          DocEng '23: ACM Symposium on Document Engineering 2023
          August 22 - 25, 2023
          Limerick, Ireland

          Acceptance Rates

          DocEng '23 Paper Acceptance Rate 9 of 27 submissions, 33%;
          Overall Acceptance Rate 194 of 564 submissions, 34%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 113
            Total Downloads
          • Downloads (Last 12 months)48
          • Downloads (Last 6 weeks)4
          Reflects downloads up to 16 Feb 2025

          Other Metrics

          Citations

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media