Skip to main content

TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

  • 3962 Accesses

Abstract

Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles. TabLeX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding LaTeX source code. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to augment this dataset with more complex and diverse tables at regular intervals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/camelot-dev/camelot.

  2. 2.

    https://github.com/chezou/tabula-py.

  3. 3.

    https://github.com/jsvine/pdfplumber.

  4. 4.

    https://www.adobe.com/devnet/acrobat/overview.html.

  5. 5.

    http://arxiv.org/.

  6. 6.

    https://pubmed.ncbi.nlm.nih.gov/.

  7. 7.

    https://www.overleaf.com/learn/latex/font_typefaces.

  8. 8.

    https://github.com/emcconville/wand.

  9. 9.

    We use ‘400’ pixels as an experimental number.

  10. 10.

    https://github.com/jitsi/jiwer.

References

  1. Chi, Z., Huang, H., Xu, H., Yu, H., Yin, W., Mao, X.: Complicated table structure recognition. CoRR abs/1908.04729 (2019). http://arxiv.org/abs/1908.04729

  2. Deng, Y., Kanervisto, A., Rush, A.M.: What you get is what you see: a visual markup decompiler. ArXiv abs/1609.04938 (2016)

    Google Scholar 

  3. Deng, Y., Rosenberg, D., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 894–901 (2019). https://doi.org/10.1109/ICDAR.2019.00148

  4. Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, pp. 980–989. JMLR.org (2017)

    Google Scholar 

  5. Douglas, S., Hurst, M., Quinn, D., et al.: Using natural language processing for identifying and interpreting tables in plain text. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, pp. 535–546 (1995)

    Google Scholar 

  6. Embley, D.W., Hurst, M., Lopresti, D.P., Nagy, G.: Table-processing paradigms: a research survey. Int. J. Doc. Anal. Recognit. 8(2–3), 66–86 (2006)

    Article  Google Scholar 

  7. Feng, X., Yao, H., Yi, Y., Zhang, J., Zhang, S.: Scene text recognition via transformer. arXiv preprint arXiv:2003.08077 (2020)

  8. Gao, L., et al.: ICDAR 2019 competition on table detection and recognition (CTDAR). In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515 (2019). https://doi.org/10.1109/ICDAR.2019.00243

  9. Gbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013). https://doi.org/10.1109/ICDAR.2013.292

  10. Hao, L., Gao, L., Yi, X., Tang, Z.: A table detection method for pdf documents based on convolutional neural networks. In: DAS, pp. 287–292 (2016)

    Google Scholar 

  11. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. CoRR abs/1703.06870 (2017). http://arxiv.org/abs/1703.06870

  12. Kasar, T., Bhowmik, T.K., Belad, A.: Table information extraction and structure recognition using query patterns. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1086–1090 (2015). https://doi.org/10.1109/ICDAR.2015.7333928

  13. Kieninger, T., Dengel, A.: A paper-to-html table converting system. Proc. Doc. Anal. Syst. (DAS) 98, 356–365 (1998)

    Google Scholar 

  14. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition. CoRR abs/1903.01949 (2019). http://arxiv.org/abs/1903.01949

  15. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Tableseer: Automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2007, New York, NY, USA, pp. 91–100. Association for Computing Machinery (2007). https://doi.org/10.1145/1255175.1255193

  16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. Association for Computational Linguistics, July 2002. https://doi.org/10.3115/1073083.1073135. https://www.aclweb.org/anthology/P02-1040

  17. Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. Association for Computational Linguistics October 2018. https://www.aclweb.org/anthology/W18-6319

  18. Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table parsing using graph neural networks. CoRR abs/1905.13391 (2019). http://arxiv.org/abs/1905.13391

  19. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 91–99. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf

  20. Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the benchmarking of table structure recognition systems. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, New York, NY, USA pp. 113–120. Association for Computing Machinery (2010). https://doi.org/10.1145/1815330.1815345

  21. Shigarov, A., Mikhailov, A., Altaev, A.: Configurable table structure recognition in untagged pdf documents. In: Proceedings of the 2016 ACM Symposium on Document Engineering, DocEng 2016, New York, NY, USA, pp. 119–122. Association for Computing Machinery (2016). https://doi.org/10.1145/2960811.2967152

  22. Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, New York, NY, USA, pp. 223–232. Association for Computing Machinery (2018). https://doi.org/10.1145/3197026.3197040

  23. Singh, M., Sarkar, R., Vyas, A., Goyal, P., Mukherjee, A., Chakrabarti, S.: Automated early leaderboard generation from comparative tables. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 244–257. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_16

    Chapter  Google Scholar 

  24. Smith, R.: An overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633. IEEE (2007)

    Google Scholar 

  25. Tao, X., Liu, Y., Fang, J., Qiu, R., Tang, Z.: Dataset, ground-truth and performance metrics for table detection evaluation. In: IAPR International Workshop on Document Analysis Systems, Los Alamitos, CA, USA, pp. 445–449. IEEE Computer Society, March 2012. https://doi.org/10.1109/DAS.2012.29

  26. The ImageMagick Development Team: Imagemagick. https://imagemagick.org

  27. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  28. Wu, G., Zhou, J., Xiong, Y., Zhou, C., Li, C.: TableRobot: an automatic annotation method for heterogeneous tables. Personal Ubiquit. Comput. 1–7 (2021). https://doi.org/10.1007/s00779-020-01485-1

  29. Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_34

    Chapter  Google Scholar 

  30. Zhong, X., Tang, J., Jimeno-Yepes, A.: PublayNet: largest dataset ever for document layout analysis. CoRR abs/1908.07836 (2019). http://arxiv.org/abs/1908.07836

  31. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Sun, M.: Graph neural networks: a review of methods and applications. CoRR abs/1812.08434 (2018). http://arxiv.org/abs/1812.08434

Download references

Acknowledgment

This work was supported by The Science and Engineering Research Board (SERB), under sanction number ECR/2018/000087.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Desai, H., Kayal, P., Singh, M. (2021). TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86331-9_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86330-2

  • Online ISBN: 978-3-030-86331-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics