Skip to main content

WikiDT: Visual-Based Table Recognition and Question Answering Dataset

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14804))

Included in the following conference series:

  • 804 Accesses

Abstract

Companies and organizations grapple with the daily burden of document processing. As manual handling is tedious and error-prune, automating this process is a significant goal. In response to this demand, research on table extraction and information extraction from scanned documents in gaining increasing traction. These extractions are fulfilled by machine learning models that require large-scale and realistic datasets for development. However, despite the clear need, acquiring high-quality and comprehensive dataset can be costly. In this work, we introduce the WikiDT, a TableVQA dataset with hierarchical labels for model diagnosis and potentially benefit the research on sub-tasks, e.g. table recognition. This dataset boasts a massive collection of 70,919 images paired with a diverse set of 159,905 tables, providing an extensive corpus for tacking question-answering tasks. The creation of WikiDT is by extending the existing non-synthetic QA datasets, with a fully automated process with verified heuristics and manual quality inspections, and therefore minimizes labeling effort and human errors. A novel focus of WikiDT and its design goal is to answer questions that require locating the target information fragment and in-depth reasoning, given web-style document images. We established the baseline performance on the TableVQA, table extraction, and table retrieval task with recent state-of-the-art models. The results illustrate that WikiDT is yet solved by the existing models that work moderately well on other VQA tasks, and also introduce advanced challenges on table extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The IoU in the subscript is in \(10^{-2}\), e.g. AP\(_{75}\) is the average precision under IoU threshold of 0.75.

  2. 2.

    https://github.com/puppeteer/puppeteer.

  3. 3.

    https://aws.amazon.com/textract/.

  4. 4.

    The annotations are obtained before this update and recognize no merged cell.

    https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-textract-updates-tables-check-detection/.

References

  1. Intelligent document processing market size and forecast. https://www.verifiedmarketresearch.com/product/intelligent-document-processing-market/. Accessed 30 Oct 2022

  2. Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: LaTr: layout-aware transformer for scene-text VQA. In: CVPR 2022 (2022)

    Google Scholar 

  3. Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4291–4301 (2019)

    Google Scholar 

  4. Borchmann, Ł., et al.: Due: end-to-end document understanding benchmark. In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)

    Google Scholar 

  5. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  7. Chen, W., et al.: Tabfact: a large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164 (2019)

  8. Chi, Z., Huang, H., Xu, H.-D., Yu, H., Yin, W., Mao, X.-L.: Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)

  9. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  10. Gao, L., et al.: ICDAR 2019 competition on table detection and recognition (CTDAR). In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515. IEEE (2019)

    Google Scholar 

  11. Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453. IEEE (2013)

    Google Scholar 

  12. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  14. Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349 (2020)

  15. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  16. Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., Keutzer, K.: Densenet: implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869 (2014)

  17. Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)

    Google Scholar 

  18. Kafle, K., Price, B., Cohen, S., Kanan, C.: DVQA: understanding data visualizations via question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5648–5656 (2018)

    Google Scholar 

  19. Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? Textbook question answering for multimodal machine comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4999–5007 (2017)

    Google Scholar 

  20. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: a benchmark dataset for table detection and recognition. arXiv preprint arXiv:1903.01949 (2019)

  21. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  22. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  23. Manmadhan, S., Kovoor, B.C.: Visual question answering: a state-of-the-art review. Artif. Intell. Rev. 53(8), 5705–5745 (2020)

    Article  Google Scholar 

  24. Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.V.: Infographicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1697–1706 (2022)

    Google Scholar 

  25. Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C.V.: Document visual question answering challenge 2020 (2020)

    Google Scholar 

  26. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952. IEEE (2019)

    Google Scholar 

  27. Monroe, W., Wang, Y.: Dependency parsing features for semantic parsing (2014)

    Google Scholar 

  28. Paliwal, S.S., Vishwanath, D., Rahul, R., Sharma, M., Vig, L.: Tablenet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 128–133. IEEE (2019)

    Google Scholar 

  29. Pan, F., Canim, M., Glass, M., Gliozzo, A., Fox, P.: CLTR: an end-to-end, transformer-based system for cell level table retrieval and table question answering. arXiv preprint arXiv:2106.04441 (2021)

  30. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  31. Pasupat, P., Liang, P.: Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305 (2015)

  32. Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: Cascadetabnet: an approach for end to end table detection and structure recognition from image-based documents (2020)

    Google Scholar 

  33. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  Google Scholar 

  34. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3. In: Overview of the Third Text REtrieval Conference (TREC-3), pp. 109–126. NIST, Gaithersburg, MD (1995)

    Google Scholar 

  35. Shi, H., Gao, S., Tian, Y., Chen, X., Zhao, J.: Learning bounded context-free-grammar via LSTM and the transformer: difference and the explanations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8267–8276 (2022)

    Google Scholar 

  36. Shi, H., Gu, Y., Zhou, Y., Zhao, B., Gao, S., Zhao, J.: Every preference changes differently: neural multi-interest preference model with temporal dynamics for recommendation. arXiv preprint arXiv:2207.06652 (2022)

  37. Shi, H., Zhang, Y.: Deep symbolic superoptimization without human knowledge. In: ICLR 2020 (2020)

    Google Scholar 

  38. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  39. Singh, A., et al.: Towards VQA models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  40. Smock, B., Pesala, R., Abraham, R.: PubTables-1M: towards comprehensive table extraction from unstructured documents. arXiv preprint arXiv:2110.00061 (2021)

  41. Tanaka, R., Nishida, K., Yoshida, S.: Visualmrc: machine reading comprehension on document images. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13878–13888 (2021)

    Google Scholar 

  42. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  43. Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)

    Article  Google Scholar 

  44. Wang, X., et al.: On the general value of evidence, and bilingual scene-text visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  45. Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022)

  46. Zhong, V., Xiong, C., Socher, R.: Seq2sql: generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017)

  47. Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. arXiv preprint arXiv:1911.10683 (2019)

Download references

Acknowledgement

Research reported in this publication was supported by an Amazon Research Award, Fall 2022 CFP, and Amazon Post-Internship Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of Amazon.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Shi .

Editor information

Editors and Affiliations

Appendices

A Data Acquisition and Processing

1.1 A.1 Overview

WikiDT is created from Wikipedia, a public online encyclopedia created and updated by its users. We are only interested in those Wikipedia web pages with tables, for purpose of table extraction and table-based question answering. Overall, the data acquisition process is taking the Wikipedia URLs from existing datasets, namely WikiTableQuestions [31], TabFact [7], and WikiSQL [46], rendering the web pages to images while annotating the tables, then re-connecting the rendered images and table annotations to the existing question-answer pairs.

The basic annotations in the WikiDT are {image, question, answer} triplets, which are basic input and output to the end-to-end visual-based table question answering models. Aware of the challenges, WikiDT also provides auxiliary annotations include: ground truth table annotations (web tables); Textract table recognition and OCR annotations; QA target table labels based both on the Textract table and web tables; and executable SQL queries on majority of QA samples.

Although the task may seem straightforward, the creation of the dataset is fraught with numerous challenges. Especially, the web contents has changed dramatically since the original WikiSQL dataset was created. Additionally, the data processing in WikiSQL dataset is unknown and irreversible, which poses great hardship in re-connecting the tables to the question-answer pairs. We detail the approaches to overcome those as below.

1.2 A.2 Image Rendering and Pagination

We leverage the PuppeteerFootnote 2 to render the screenshots mainly from the URLs. Generally, the rendered images in the continuous scroll mode have large heights that are extremely hard for the popular visual networks to handle. Thus, the original rendered pages (full-page) are paginated to several sub-pages.

URLs. TabFact provides the URLs to the tables in itself and WikiSQL, and WikiTableQuesions also provide the URLs to its tables. Despite all the URLs are Wikipedia pages, the URLs in the TabFact is Wikipedia domain followed by article title, which changes drastically over the years, while the WikiTableQuestions provided URLs with archived version IDs, which have the exact content when they created the dataset. So for the first step, we search for the Wikipedia editing history for each TabFact pages, and find the closest version to then the WikiSQL dataset is created. Then both of the recovered TabFact pages and WikiTableQuestions pages are rendered.

Renderer Setting. The width of the page is set to 1600 pixels for a better table layout (as compared in Fig. 9). Less than 0.5% of pages have minimum page width requirements that are larger than 1600 pixels, mainly owing to extremely wide tables, and are rendered with their minimum page width.

Pagination. The extra-long full pages cause huge trouble for the down-stream tasks. For example, the typical width-height ratio of input layer to common visual models is 1:1 for VGG [38] and ResNet [13], and other backbone models trained on ImageNet. Therefore the pagination is critical for usage of the dataset. Fixed-height segmentation may divide a single table to multiple subpages. Thus we use a dynamic-height pagination strategy: first, the blank lines are detect via the slide window; concretely, if the color deviations of a W x H window is 0, we consider this window is a blank line. The W is set to 1200 to ignore the navigation bar on the left of the pages, and H is set to 10. Then a subset of blank lines are selected such that the width-to-height ration of every segment, excepting the last one, is close to 1: 1.

1.3 A.3 Table Annotation

The table annotations are derived from HTML table tags, e.g., \(\langle \)table\(\rangle \) translates to a table and \(\langle \)tr\(\rangle \) translates to a table row. The pseudo-code to process the HTML table annotation with Puppeteer is summarized as below. Puppeteer API returns valid bounding box for visible elements and returns NULL for invisible ones. The annotation process keeps only visible tables and table elements. Additionally, the merged table cells are annotated with non-empty row_span and col_span values.

figure a
figure b

Web Table Filtering. The editors on Wikipedia can use \(\langle \)table\(\rangle \) tag to express a broader range of data (e.g., legend as in Fig. 11). We remove those false tables by criteria that a true table should have at least two rows and two columns. Moreover, the nested tables are possible in web page while hardly seen in other document format. We notice that some web pages in our dataset contains nested tables (e.g., Fig. 7), the outer tables of which are usually used to produce certain layout. Therefore, we detect the nested tables by table bounding boxes and keep only the inner-most tables.

Fig. 7.
figure 7

Nested tables: the outer tables annotations are removed.

Fig. 8.
figure 8

Data acquisition for WikiDT table extraction task.

Fig. 9.
figure 9

Page and table layout in different page width.

Creating the Table Extraction Task. Figure 8 summarizes the data acquisition and processing of WikiDT table extraction dataset. Following the convention in PubTables-1M, we divide the table extraction to table detection which predicts the table bounding boxes from the subpage, and table structure recognition which predicts the inner structure in the table from the table crop. The table extraction task data is labeled using Pascal VOC format.

1.4 A.4 OCR and Textract Table Recognition

Realistically, in Document VQA task, ground truth OCR annotation is unavailable. Instead, datasets usually provide the OCR results from off-the-shelf systems. Following the same principle, the OCR and table recognition from a publicly available system is provided along with the images for Document VQA tasks. We used the AWS Textract service to acquire the OCR and table extraction results on the subpages. In practice, we also found the Textract results quality on the subpages is substantially better than the quality on full pages.Footnote 3,Footnote 4

1.5 A.5 QA Table Processing

Inspected the WikiSQL and WikiTableQuestion samples with the webpage, there are two observations: 1) the tables to the QA pair could not be the information tables, which are shown on the upper right of the page), and could not be reference tables, which are at the end of the page; 2) the tables to the QA pairs can not have only one content row. Therefore, we remove those tables in the QA task.

Furthermore, the content of tables are normalized in the following ways. First, we remove the JavaScript code that are extracted as texts by Puppeteer, which are detected by regular expression matching to keywords (e.g., mw-parser-output, navbar). Then, if any cell text is longer than 200 characters, it is replaced with a special token \(\langle {TLDR}\rangle \). Lastly, the string normalization method adopted in WikiTableQuestions is applied to each table cell text.

Beyond table filtering and table content normalization, the table headers, if contains more than one rows, are flattened. If the table headers are row-headers and multi-level, the headers are transformed using level1.level2... format, except for the level where the text for all the cells in that level is the same, i.e., a uniform level value. If the table has only column header, the table are transposed. If the table is 2D, only the row header is considered as header.

1.6 A.6 Table Retrieval Labels

Table retrieval is an essential step if there are many tables on the page. The table retrieval labels are generated automatically, primarily by retrieving the table with the highest recall using the reference tables and answers from the original dataset (i.e., WikiTableQuestions and WikiSQL) as queries. The process is illustrated in Fig. 10. Note that the retrieval label generation process does not provide any cheating method for the table retrieval task: the label generation is based on a reference table, which is not included in the WikiDT dataset and should be assumed unknown to the model, and the answer, which is also unknown to the retrieval model. The table retrieval model of the user of WikiDT should use only the question as a query.

Fig. 10.
figure 10

Data acquisition and processing for WikiDT table retrieval sub-task.

Algorithm 1 shows the table retrieval label generation process. The reference tables and answers from WikiSQL and WikiTableQuestions serve as queries and the candidate tables are regarded as keys. We record the retrieved table by best recall to the reference table and reference keys. In most cases (74%), the two results agree. For the QA experiments, we use the best recall to the reference table results, since the reference answers may not appear in the table owing to the aggregation functions.

figure c
Fig. 11.
figure 11

False table (upper left one) and true table (lower).

1.7 A.7 Answer Generation

Inspecting the table retrieval results from web table annotations, we found that the contents of tables in the WikiSQL dataset are crucially different from the contents on the images, presumably due to the web content updates and post-processing during the creation of the WikiSQL dataset. Therefore, we translated the SQL query from the WikiSQL dataset and executed it on the retrieved web table in WikiDT to generate the answers to the questions, shown in Fig. 12. On the other hand, the tables in WikiTableQuestions do not have this issue, thus the answers are still valid given the images and the questions.

Fig. 12.
figure 12

QA pair generation from WikiSQL dataset from acquired images and tables.

B Compare to Existing VQA Datasets

Table 2 compare the input modality and task formulation of the notable VQA datasets, and a concrete examples of the VQA samples are shown in Fig. 13. It can be clearly seen that WikiDT provides a unique type of QA tasks, and are highly challenging in the plenitude of the context information.

Fig. 13.
figure 13

Illustration of VQA datasets

C Dataset Description and Analysis

1.1 C.1 Tasks and Datasets

WikiDT-Detection Dataset. With the results of pagination, we take the subpage images and the table bounding boxes, ignoring the table structures and content, as a table detection dataset. Images with no tables are discarded.

WikiDT-Structure Dataset. Independent with the pagination, each table region is cropped from the full-page images to form the image inputs to the table structure recognition task. The coordinates of the bounding boxes are translated from full-page coordinates to the cropped image coordinates. The following types of structural objects are annotated directly from raw web table annotation: table body rows, header rows, table cells, and merged cells. Since the HTML tags do not have table column tags while columns are essential labels for many existing table extraction models (e.g., TableNet [28]), we compute the column bounding boxes and include them in the annotation as well.

WikiDT-TableVQA Dataset. The question answering task is to predict the answer to a question given the full page image in an end-to-end manner. Meanwhile, the intermediate labels along the reasoning chains are available for breaking down and diagnosing the models. The intermediate labels include table annotation, table retrieval labels, and SQL queries in most of the samples. The OCR results from Textract are highly accurate and could be leveraged directly to fit into existing document VQA or scene-text VQA frameworks.

1.2 C.2 Data Analysis

Images and Subpages. Figure 14 shows the image height distribution before (full page) and after (sub-page) pagination step. It shows the pagination greatly reduced average image heights and extremely long pages. The outliers that still have large image heights may contain long tables that the pagination module is unable to split. Figure 15 shows that the number of tables per image reduces after pagination.

Fig. 14.
figure 14

Image Heights Distributions (x-axis in \(10^3\)).

Page Layout. Comparing to existing table recognition dataset created from PDF, the detection task of WikiDT is challenging with diverse table shapes and positions, illustrated in Fig. 17. PubTables-1M images are single or double columned PDFs. The table position and width, especially in the double-columned documents are highly similar. On the contrary, table location and shape are much more arbitrary in WikiDT. Figure 16 plots the distribution of table bounding boxes of both dataset. It’s obviously seen that the PubTables-1M has few spikes indicating the few locations that are highly likely to be the table border, while the distribution in WikiDT are almost uniform which indicates a diversity in the table location. Moreover, WikiDT contains list-like text blocks that could be easily mis-detected as tables.

Fig. 15.
figure 15

Number of tables on a single image.

Fig. 16.
figure 16

Table location distribution.

Table 10. Textract table recognition accuracy.
Fig. 17.
figure 17

Table detection samples from PubTables-1M (upper) and WikiDT (lower).

Table Styles. The table structure recognition task is featured with diversified table styles, shown in Fig. 18.

Qualitative Annotation Results from Textract. AWS Textract gives overall high quality table recognition results, but there are still wrong recognition cases shown in Fig. 19. In the experiment section, we compare the two approaches that utilize the Textract Table recognition results with TAPAS model, and another one uses the OCR boxes. Though the TAPAS model has overall better performance, its function relies on the correct recognition results and could not learn to rectify the table recognition error, like the one in Fig. 19 (4), when the two columns are recognized as one.

Fig. 18.
figure 18

Various types of table headers.

Top Answer Items. Table 11 shows the most frequent numeric answers and the text answers. Usually, the numeric answers are produced from some aggregations. For instance, count the number of rows that satisfy certain criteria, or comparing the difference between two values (see Fig. 22 for example). To this end, understanding the implied reasoning from the questions is critical to solve the QA task.

Table 11. Top answer items from WikiDT-TableQA/VQA task.

D Additional Table Recognition Results

Figure 192021 illustrate the qualitative table detection and table recognition results. Generally, the DETR model and Textract can generate reliable results, despite that viewing from Table 8 the table extraction performance still have potential to improve.

Fig. 19.
figure 19

Textract recognition results. Red boxes show OCR tokens, olive boxes show table structure. (Color figure online)

Fig. 20.
figure 20

DETR prediction compared with ground truth.

Fig. 21.
figure 21

Table extraction result comparison.

Fig. 22.
figure 22

TAPAS prediction examples (left: correct, right: wrong). Light blocks show the column selection result and the darker blocks show the cell selection.

E TableVQA Additional Results

Figure 22 shows when TAPAS makes correct and wrong predictions. Notice that in the wrong prediction case, the question requires a multi-step aggregation. The QA system needs to count the number of ships wrecked in Lake Huron and in Lake Erie separately, then compare the difference. However, by design, TAPAS could solve questions with no more than one aggregation.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shi, H., Xie, Y., Goncalves, L., Gao, S., Zhao, J. (2024). WikiDT: Visual-Based Table Recognition and Question Answering Dataset. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14804. Springer, Cham. https://doi.org/10.1007/978-3-031-70533-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70533-5_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70532-8

  • Online ISBN: 978-3-031-70533-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics