SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs

Darmanović, Filip; Hanbury, Allan; Zlabinger, Markus

doi:10.1007/978-3-031-41676-7_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14187))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1859 Accesses

Abstract

Extracting figures and similar visual elements from PDFs of scientific publications is important but non-trivial, and progress is impeded by a lack of datasets for evaluation and machine learning. In this work, we describe and publish the SCI-3000 dataset, containing 3 000 PDFs of scientific publications (34 791 pages) with annotations of figures, tables, and corresponding captions, from the fields of computer science, biomedicine, chemistry, physics, and technology. We demonstrate the use of the dataset to benchmark two figure, table, and caption extraction approaches from recent literature: one rule-based and one deep learning-based.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ICDAR 2021 Competition on Scientific Literature Parsing

FigureSeer: Parsing Result-Figures in Research Papers

A Benchmark of PDF Information Extraction Tools Using a Multi-task and Multi-domain Evaluation Framework for Academic Documents

Notes

1.
DOI: 10.5281/zenodo.6564971
2.
https://github.com/allenai/deepfigures-open, accessed on 15.09.2021
3.
https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist, accessed on 15.09.2021
4.
https://pypi.org/project/sci-annot-eval
5.
https://doaj.org/
6.
https://www.loc.gov/catdir/cpso/lcco/
7.
https://poppler.freedesktop.org/, accessed on 24.04.2023
8.
DOI: 10.5281/zenodo.7878627
9.
DOI: 10.5281/zenodo.7878638
10.
https://doi.org/10.34726/hss.2022.94800
11.
https://github.com/allenai/pdffigures2, accessed on 15.05.2022
12.
https://github.com/allenai/deepfigures-open, accessed on 15.05.2022

References

Ahmed, Z., Zeeshan, S., Dandekar, T.: Mining biomedical images towards valuable information retrieval in biomedical and life sciences. Database 2016, baw118 (2016). https://doi.org/10.1093/database/baw118
Article Google Scholar
Asgari Taghanaki, S., Abhishek, K., Cohen, J.P., Cohen-Adad, J., Hamarneh, G.: Deep semantic segmentation of natural and medical images: a review. Artif. Intell. Rev. 54(1), 137–178 (2021)
Article Google Scholar
Chiu, P., Chen, F., Denoue, L.: Picture detection in document page images. In: Proceedings of the 10th ACM Symposium on Document Engineering, pp. 211–214. DocEng 2010, Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1860559.1860605
Choudhury, S.R., et al.: A figure search engine architecture for a chemistry digital library. In: Proceedings of the 13th ACM/IEEE-CS joint Conference on Digital libraries, pp. 369–370. JCDL 2013, Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2467696.2467757
Clark, C., Divvala, S.: PDFFigures 2.0: Mining figures from research papers. In: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 143–152 (2016)
Google Scholar
Clark, C.A., Divvala, S.: Looking beyond text: extracting figures, tables and captions from computer science papers. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015), https://www.aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10092
Gao, L., Yi, X., Jiang, Z., Hao, L., Tang, Z.: ICDAR2017 competition on page object detection. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1417–1422 (2017). https://doi.org/10.1109/ICDAR.2017.231, ISSN: 2379-2140
Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013). https://doi.org/10.1109/ICDAR.2013.292, ISSN: 2379-2140
Hara, K., Adams, A., Milland, K., Savage, S., Callison-Burch, C., Bigham, J.P.: A data-driven analysis of workers’ earnings on amazon mechanical turk. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Association for Computing Machinery, New York (2018), https://doi.org/10.1145/3173574.3174023
Hara, K., et al.: Worker demographics and earnings on amazon mechanical turk: an exploratory analysis. In: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–6. CHI EA 2019, Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3290607.3312970
García Seco de Herrera, A., Schaer, R., Bromuri, S., Müller, H.: Overview of the ImageCLEF 2016 medical task. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum (2016)
Google Scholar
Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 340–353. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_25
Chapter Google Scholar
Jimeno Yepes, A., Zhong, P., Burdick, D.: ICDAR 2021 competition on scientific literature parsing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 605–617. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_40
Chapter Google Scholar
Kavasidis, I., et al.: A saliency-based convolutional neural network for table and chart detection in digitized documents. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11752, pp. 292–302. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30645-8_27
Chapter Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Article MathSciNet MATH Google Scholar
Kuzi, S., Zhai, C.X.: Figure retrieval from collections of research articles. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 696–710. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_45
Chapter Google Scholar
Kuzi, S., Zhai, C.X.: A study of distributed representations for figures of research articles. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12656, pp. 284–297. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72113-8_19
Chapter Google Scholar
Kuzi, S., Zhai, C., Tian, Y., Tang, H.: FigExplorer: a system for retrieval and exploration of figures from collections of research articles. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2133–2136. SIGIR 2020, Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3397271.3401400
Li, P., Jiang, X., Shatkay, H.: Figure and caption extraction from biomedical documents. Bioinformatics 35(21), 4381–4388 (2019)
Article Google Scholar
Li, X.H., Yin, F., Liu, C.L.: Page object detection from PDF document images by deep structured prediction and supervised clustering. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3627–3632 (2018). https://doi.org/10.1109/ICPR.2018.8546073, ISSN: 1051-4651
Liu, G., Haralick, R.M.: Optimal matching problem in detection and recognition performance evaluation. Pattern Recogn. 35(10), 2125–2139 (2002)
Article MATH Google Scholar
Lopez, L.D., Yu, J., Arighi, C.N., Huang, H., Shatkay, H., Wu, C.: An automatic system for extracting figures and captions in biomedical PDF documents. In: 2011 IEEE International Conference on Bioinformatics and Biomedicine, pp. 578–581 (2011). https://doi.org/10.1109/BIBM.2011.26
Peng, Y.X., et al.: Cross-media analysis and reasoning: advances and directions. Front. Inf. Technol. Electron. Eng. 18(1), 44–57 (2017). https://doi.org/10.1631/FITEE.1601787
Article Google Scholar
Pitale, S., Sharma, T.: Information extraction tools for portable document format. Int. J. Comput. Technol. Appl. 2, 2047–2051 (2012)
Google Scholar
Praczyk, P.A., Nogueras-Iso, J.: Automatic extraction of figures from scientific publications in high-energy physics. Inf. Technol. Libr. 32(4), 25–52 (2013)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91, ISSN: 1063-6919
Saha, R., Mondal, A., Jawahar, C.V.: Graphical object detection in document images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 51–58 (2019). https://doi.org/10.1109/ICDAR.2019.00018, ISSN: 2379-2140
Shao, M., Futrelle, R.P.: Recognition and classification of figures in PDF documents. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 231–242. Springer, Heidelberg (2006). https://doi.org/10.1007/11767978_21
Chapter Google Scholar
Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: FigureSeer: parsing result-figures in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 664–680. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_41
Chapter Google Scholar
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 223–232. JCDL 2018, Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3197026.3197040
Sohmen, L., Charbonnier, J., Blümel, I., Wartena, Ch., Heller, L.: Figures in scientific open access publications. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J. (eds.) TPDL 2018. LNCS, vol. 11057, pp. 220–226. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00066-0_19
Chapter Google Scholar
Stahl, C.G., Young, S.R., Herrmannova, D., Patton, R.M., Wells, J.C.: DeepPDF: a deep learning approach to extracting text from PDFs. In: Proceedings of the 7th International Workshop on Mining Scientific Publications (2018), https://www.osti.gov/biblio/1460210
Tsutsui, S., Crandall, D.J.: A data driven approach for compound figure separation using convolutional neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 533–540 (2017). https://doi.org/10.1109/ICDAR.2017.93, ISSN: 2379-2140
Yang, S.T., et al.: Identifying the central figure of a scientific paper. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1063–1070 (2019). https://doi.org/10.1109/ICDAR.2019.00173, ISSN: 2379-2140
Yi, X., Gao, L., Liao, Y., Zhang, X., Liu, R., Jiang, Z.: CNN based page object detection in document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 230–235 (2017). https://doi.org/10.1109/ICDAR.2017.46, ISSN: 2379-2140
Younas, J., et al.: Fi-Fo detector: figure and formula detection using deformable networks. Appl. Sci. 10(18), 6460 (2020)
Article Google Scholar
Yu, Y., Lin, H., Meng, J., Wei, X., Zhao, Z.: Assembling deep neural networks for medical compound figure detection. Information 8(2), 48 (2017)
Article Google Scholar
Zhong, X., Tang, J., Jimeno Yepes, A.: PubLayNet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022 (2019). https://doi.org/10.1109/ICDAR.2019.00166, ISSN: 2379-2140

Download references

Author information

Authors and Affiliations

TU Wien, Vienna, Austria
Filip Darmanović, Allan Hanbury & Markus Zlabinger

Authors

Filip Darmanović
View author publications
You can also search for this author in PubMed Google Scholar
Allan Hanbury
View author publications
You can also search for this author in PubMed Google Scholar
Markus Zlabinger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filip Darmanović .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Darmanović, F., Hanbury, A., Zlabinger, M. (2023). SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-41676-7_14
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41675-0
Online ISBN: 978-3-031-41676-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs