TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

Desai, Harsh; Kayal, Pratik; Singh, Mayank

doi:10.1007/978-3-030-86331-9_36

Harsh Desai¹¹,
Pratik Kayal¹¹ &
Mayank Singh¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3962 Accesses

Abstract

Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles. TabLeX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding LaTeX source code. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to augment this dataset with more complex and diverse tables at regular intervals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tables to LaTeX: structure and content extraction from scientific tables

Article 27 October 2022

Flexible Hybrid Table Recognition and Semantic Interpretation System

Article Open access 04 March 2023

Image-Based Table Recognition: Data, Model, and Evaluation

Notes

1.
https://github.com/camelot-dev/camelot.
2.
https://github.com/chezou/tabula-py.
3.
https://github.com/jsvine/pdfplumber.
4.
https://www.adobe.com/devnet/acrobat/overview.html.
5.
http://arxiv.org/.
6.
https://pubmed.ncbi.nlm.nih.gov/.
7.
https://www.overleaf.com/learn/latex/font_typefaces.
8.
https://github.com/emcconville/wand.
9.
We use ‘400’ pixels as an experimental number.
10.
https://github.com/jitsi/jiwer.

References

Chi, Z., Huang, H., Xu, H., Yu, H., Yin, W., Mao, X.: Complicated table structure recognition. CoRR abs/1908.04729 (2019). http://arxiv.org/abs/1908.04729
Deng, Y., Kanervisto, A., Rush, A.M.: What you get is what you see: a visual markup decompiler. ArXiv abs/1609.04938 (2016)
Google Scholar
Deng, Y., Rosenberg, D., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 894–901 (2019). https://doi.org/10.1109/ICDAR.2019.00148
Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, pp. 980–989. JMLR.org (2017)
Google Scholar
Douglas, S., Hurst, M., Quinn, D., et al.: Using natural language processing for identifying and interpreting tables in plain text. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, pp. 535–546 (1995)
Google Scholar
Embley, D.W., Hurst, M., Lopresti, D.P., Nagy, G.: Table-processing paradigms: a research survey. Int. J. Doc. Anal. Recognit. 8(2–3), 66–86 (2006)
Article Google Scholar
Feng, X., Yao, H., Yi, Y., Zhang, J., Zhang, S.: Scene text recognition via transformer. arXiv preprint arXiv:2003.08077 (2020)
Gao, L., et al.: ICDAR 2019 competition on table detection and recognition (CTDAR). In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515 (2019). https://doi.org/10.1109/ICDAR.2019.00243
Gbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013). https://doi.org/10.1109/ICDAR.2013.292
Hao, L., Gao, L., Yi, X., Tang, Z.: A table detection method for pdf documents based on convolutional neural networks. In: DAS, pp. 287–292 (2016)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. CoRR abs/1703.06870 (2017). http://arxiv.org/abs/1703.06870
Kasar, T., Bhowmik, T.K., Belad, A.: Table information extraction and structure recognition using query patterns. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1086–1090 (2015). https://doi.org/10.1109/ICDAR.2015.7333928
Kieninger, T., Dengel, A.: A paper-to-html table converting system. Proc. Doc. Anal. Syst. (DAS) 98, 356–365 (1998)
Google Scholar
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition. CoRR abs/1903.01949 (2019). http://arxiv.org/abs/1903.01949
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Tableseer: Automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2007, New York, NY, USA, pp. 91–100. Association for Computing Machinery (2007). https://doi.org/10.1145/1255175.1255193
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. Association for Computational Linguistics, July 2002. https://doi.org/10.3115/1073083.1073135. https://www.aclweb.org/anthology/P02-1040
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. Association for Computational Linguistics October 2018. https://www.aclweb.org/anthology/W18-6319
Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table parsing using graph neural networks. CoRR abs/1905.13391 (2019). http://arxiv.org/abs/1905.13391
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 91–99. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the benchmarking of table structure recognition systems. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, New York, NY, USA pp. 113–120. Association for Computing Machinery (2010). https://doi.org/10.1145/1815330.1815345
Shigarov, A., Mikhailov, A., Altaev, A.: Configurable table structure recognition in untagged pdf documents. In: Proceedings of the 2016 ACM Symposium on Document Engineering, DocEng 2016, New York, NY, USA, pp. 119–122. Association for Computing Machinery (2016). https://doi.org/10.1145/2960811.2967152
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, New York, NY, USA, pp. 223–232. Association for Computing Machinery (2018). https://doi.org/10.1145/3197026.3197040
Singh, M., Sarkar, R., Vyas, A., Goyal, P., Mukherjee, A., Chakrabarti, S.: Automated early leaderboard generation from comparative tables. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 244–257. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_16
Chapter Google Scholar
Smith, R.: An overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633. IEEE (2007)
Google Scholar
Tao, X., Liu, Y., Fang, J., Qiu, R., Tang, Z.: Dataset, ground-truth and performance metrics for table detection evaluation. In: IAPR International Workshop on Document Analysis Systems, Los Alamitos, CA, USA, pp. 445–449. IEEE Computer Society, March 2012. https://doi.org/10.1109/DAS.2012.29
The ImageMagick Development Team: Imagemagick. https://imagemagick.org
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wu, G., Zhou, J., Xiong, Y., Zhou, C., Li, C.: TableRobot: an automatic annotation method for heterogeneous tables. Personal Ubiquit. Comput. 1–7 (2021). https://doi.org/10.1007/s00779-020-01485-1
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_34
Chapter Google Scholar
Zhong, X., Tang, J., Jimeno-Yepes, A.: PublayNet: largest dataset ever for document layout analysis. CoRR abs/1908.07836 (2019). http://arxiv.org/abs/1908.07836
Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Sun, M.: Graph neural networks: a review of methods and applications. CoRR abs/1812.08434 (2018). http://arxiv.org/abs/1812.08434

Download references

Acknowledgment

This work was supported by The Science and Engineering Research Board (SERB), under sanction number ECR/2018/000087.

Author information

Authors and Affiliations

Indian Institute of Technology, Gandhinagar, India
Harsh Desai, Pratik Kayal & Mayank Singh

Authors

Harsh Desai
View author publications
You can also search for this author in PubMed Google Scholar
Pratik Kayal
View author publications
You can also search for this author in PubMed Google Scholar
Mayank Singh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Desai, H., Kayal, P., Singh, M. (2021). TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-86331-9_36
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86330-2
Online ISBN: 978-3-030-86331-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)