Unsupervised document structure analysis of digital scientific articles

Klampfl, Stefan; Granitzer, Michael; Jack, Kris; Kern, Roman

doi:10.1007/s00799-014-0115-1

Unsupervised document structure analysis of digital scientific articles

Published: 08 June 2014

Volume 14, pages 83–99, (2014)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Stefan Klampfl¹,
Michael Granitzer³,
Kris Jack⁴ &
…
Roman Kern^1,2

3019 Accesses
15 Citations
17 Altmetric
2 Mentions
Explore all metrics

Abstract

Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles

CERMINE: automatic extraction of structured metadata from scientific literature

Article Open access 03 July 2015

Dominika Tkaczyk, Paweł Szostek, … Łukasz Bolikowski

ICDAR 2021 Competition on Scientific Literature Parsing

Notes

http://www.mendeley.com
http://www.citeulike.org
http://citeseerx.ist.psu.edu
http://knowminer.at:8080/code-demo/index.html
https://www.knowminer.at/svn/opensource/projects/code/trunk
http://pdfbox.apache.org/
http://itextpdf.com/
http://opensource.intarsys.de/home/en/index.php?n=OpenSource.JPod
http://poppler.freedesktop.org
Consider a page with four text blocks arranged in two columns (two blocks in each column) and in the middle of the page there is another block spanning both columns. Then the top right block is before the middle block in the reading order, the middle block before the bottom left block, but the bottom left block before the top right block.
http://www.ncbi.nlm.nih.gov/pubmed/
http://wing.comp.nus.edu.sg/parsCit/
http://poppler.freedesktop.org/
https://github.com/timtadh/zhang-shasha

References

Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002). doi:10.1007/s10032-002-0080-x
Article MATH Google Scholar
Beel, J., Langer, S., Genzmehr, M., Müller, C.: Docear’s PDF inspector: title extraction from PDF files. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2013) (2013)
Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 13th ACM Symposium on Document, Engineering (2013)
Councill, I.G., Giles, C.L., Kan, M.y.: ParsCit: An Open-Source CRF Reference String Parsing Package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008). doi:10.1.1.150.6790
Dejean, H., Meunier, J.L.: A system for converting PDF documents into structured XML format. In: Document Analysis Systems VII, pp. 129–140 (2006)
Doucet, A., Kazai, G., Colutto, S., Mühlberger, G.: Overview of the ICDAR 2013 competition on book structure extraction. In: Proceedings of the Twelfth International Conference on Document Analysis and Recognition (ICDAR’2013), p. 6. Washington DC, USA (2013)
Esposito, F., Ferilli, S., Basile, T.M.A.: Machine learning for digital document processing: from layout analysis to metadata extraction. World Wide Web Internet Web Inform. Syst. 138(2008), 1–35 (2008). doi:10.1007/978-3-540-76280-5_5
Google Scholar
Ferilli, S., Basile, T., Mauro, N.D.: Markov logic networks for document layout correction. In: Modern Approaches in, Applied Intelligence, pp. 275–284 (2011)
Gao, L., Tang, Z., Lin, X., Liu, Y., Qiu, R., Wang, Y.: Structure extraction from PDF-based book documents. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 11–20 (2011)
Gorman, L.O., Definitions, A.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)
Article Google Scholar
Granitzer, M., Hristakeva, M., Knight, R., Jack, K.: A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In: Proceedings of the 27th Symposium On Applied Computing, p. to appear. ACM, New York (2012)
Granitzer, M., Hristakeva, M., Knight, R., Jack, K., Kern, R.: A comparison of layout based bibliographic metadata extraction techniques. In: WIMS12—International Conference on Web Intelligence, Mining and Semantics, pp. 19:1–19:8. ACM, New York (2012)
Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam—meta-data extraction from scientific literature. In: 1st International Workshop on Mining Scientific Publications (2012)
Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Magazine 19(9/10) (2013). doi:10.1045/september2013-kern
Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems (2000)
Lin, X.: Header and footer extraction by page-association. Proc. SPIE 5010, 164–171 (2002). doi:10.1117/12.472833
Article Google Scholar
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1006–1010 (2009). doi:10.1109/ICDAR.2009.138
Liu, Y., Mitra, P., Giles, C.L.: A fast preprocessing method for table boundary detection: narrowing down the sparse lines using solely coordinate information. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 431–438. IEEE (2008). doi:10.1109/DAS.2008.77
Liu, Y., Mitra, P., Giles, C.L.: Identifying table boundaries in digital documents via sparse line detection. In: Proceeding of the 17th ACM conference on Information and knowledge mining CIKM 08, pp. 1311–1320. ACM Press (2008). doi:10.1145/1458082.1458255
Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digital Libr. Syst. 1(4), 1–23 (2011). doi:10.4018/jdls.2010100101
Article Google Scholar
Malerba, D., Ceci, M., Berardi, M.: Machine learning for reading order detection in document image understanding. In: Machine Learning in Document Analysis, pp. 45–69 (2008)
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. Proc. SPIE 5010(1), 197–207 (2003). doi:10.1117/12.476326
Article Google Scholar
Meunier, J.L.: Optimized XY-cut for determining a page reading order. In: Eighth International Conference on Document Analysis and Recognition ICDAR05 1, pp. 347–351 (2005). doi:10.1109/ICDAR.2005.182
Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992). doi:10.1109/2.144436
Article Google Scholar
Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLTNAACL04, vol. 2004, pp. 329–336 (2004). doi: 10.1.1.10.5644
Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7(1), 7 (2012). doi:10.1186/1751-0473-7-7
Article Google Scholar
Summers, K.: Automatic discovery of logical document structure. Ph.D. thesis (1998)
Tkaczyk, D., Bolikowski, L., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 11–16 (2012). doi:10.1109/DAS.2012.4
Tkaczyk, D., Czeczko, A., Rusek, K.: GROTOAP: ground truth for open access publications. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 381–382 (2012)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004). doi:10.1007/s10032-004-0120-9
Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989). doi:10.1137/0218082
Article MATH MathSciNet Google Scholar

Download references

Acknowledgments

The presented work was in part developed within the CODE project funded by the EU FP7 (Grant No. 296150) and the TEAM IAPP project (Grant No. 251514) within the FP7 People Programme. The Know-Center is funded within the Austrian COMET Program—Competence Centers for Excellent Technologies—under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG

Author information

Authors and Affiliations

Know-Center GmbH, Inffeldgasse 13/VI, 8010 , Graz, Austria
Stefan Klampfl & Roman Kern
Knowledge Technologies Institute, Graz University of Technology, Graz, Austria
Roman Kern
University of Passau, Passau, Germany
Michael Granitzer
Mendeley Ltd, London, UK
Kris Jack

Authors

Stefan Klampfl
View author publications
You can also search for this author in PubMed Google Scholar
Michael Granitzer
View author publications
You can also search for this author in PubMed Google Scholar
Kris Jack
View author publications
You can also search for this author in PubMed Google Scholar
Roman Kern
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Klampfl.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Klampfl, S., Granitzer, M., Jack, K. et al. Unsupervised document structure analysis of digital scientific articles. Int J Digit Libr 14, 83–99 (2014). https://doi.org/10.1007/s00799-014-0115-1

Download citation

Received: 30 October 2013
Revised: 06 May 2014
Accepted: 14 May 2014
Published: 08 June 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s00799-014-0115-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Unsupervised document structure analysis of digital scientific articles

Abstract

Access this article

Similar content being viewed by others

An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles

CERMINE: automatic extraction of structured metadata from scientific literature

ICDAR 2021 Competition on Scientific Literature Parsing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised document structure analysis of digital scientific articles

Abstract

Access this article

Similar content being viewed by others

An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles

CERMINE: automatic extraction of structured metadata from scientific literature

ICDAR 2021 Competition on Scientific Literature Parsing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation