Skip to main content
Log in

Unsupervised document structure analysis of digital scientific articles

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.mendeley.com

  2. http://www.citeulike.org

  3. http://citeseerx.ist.psu.edu

  4. http://knowminer.at:8080/code-demo/index.html

  5. https://www.knowminer.at/svn/opensource/projects/code/trunk

  6. http://pdfbox.apache.org/

  7. http://itextpdf.com/

  8. http://opensource.intarsys.de/home/en/index.php?n=OpenSource.JPod

  9. http://poppler.freedesktop.org

  10. Consider a page with four text blocks arranged in two columns (two blocks in each column) and in the middle of the page there is another block spanning both columns. Then the top right block is before the middle block in the reading order, the middle block before the bottom left block, but the bottom left block before the top right block.

  11. http://www.ncbi.nlm.nih.gov/pubmed/

  12. http://wing.comp.nus.edu.sg/parsCit/

  13. http://poppler.freedesktop.org/

  14. https://github.com/timtadh/zhang-shasha

References

  1. Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002). doi:10.1007/s10032-002-0080-x

    Article  MATH  Google Scholar 

  2. Beel, J., Langer, S., Genzmehr, M., Müller, C.: Docear’s PDF inspector: title extraction from PDF files. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2013) (2013)

  3. Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 13th ACM Symposium on Document, Engineering (2013)

  4. Councill, I.G., Giles, C.L., Kan, M.y.: ParsCit: An Open-Source CRF Reference String Parsing Package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008). doi:10.1.1.150.6790

  5. Dejean, H., Meunier, J.L.: A system for converting PDF documents into structured XML format. In: Document Analysis Systems VII, pp. 129–140 (2006)

  6. Doucet, A., Kazai, G., Colutto, S., Mühlberger, G.: Overview of the ICDAR 2013 competition on book structure extraction. In: Proceedings of the Twelfth International Conference on Document Analysis and Recognition (ICDAR’2013), p. 6. Washington DC, USA (2013)

  7. Esposito, F., Ferilli, S., Basile, T.M.A.: Machine learning for digital document processing: from layout analysis to metadata extraction. World Wide Web Internet Web Inform. Syst. 138(2008), 1–35 (2008). doi:10.1007/978-3-540-76280-5_5

    Google Scholar 

  8. Ferilli, S., Basile, T., Mauro, N.D.: Markov logic networks for document layout correction. In: Modern Approaches in, Applied Intelligence, pp. 275–284 (2011)

  9. Gao, L., Tang, Z., Lin, X., Liu, Y., Qiu, R., Wang, Y.: Structure extraction from PDF-based book documents. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 11–20 (2011)

  10. Gorman, L.O., Definitions, A.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  11. Granitzer, M., Hristakeva, M., Knight, R., Jack, K.: A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In: Proceedings of the 27th Symposium On Applied Computing, p. to appear. ACM, New York (2012)

  12. Granitzer, M., Hristakeva, M., Knight, R., Jack, K., Kern, R.: A comparison of layout based bibliographic metadata extraction techniques. In: WIMS12—International Conference on Web Intelligence, Mining and Semantics, pp. 19:1–19:8. ACM, New York (2012)

  13. Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam—meta-data extraction from scientific literature. In: 1st International Workshop on Mining Scientific Publications (2012)

  14. Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Magazine 19(9/10) (2013). doi:10.1045/september2013-kern

  15. Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems (2000)

  16. Lin, X.: Header and footer extraction by page-association. Proc. SPIE 5010, 164–171 (2002). doi:10.1117/12.472833

    Article  Google Scholar 

  17. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1006–1010 (2009). doi:10.1109/ICDAR.2009.138

  18. Liu, Y., Mitra, P., Giles, C.L.: A fast preprocessing method for table boundary detection: narrowing down the sparse lines using solely coordinate information. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 431–438. IEEE (2008). doi:10.1109/DAS.2008.77

  19. Liu, Y., Mitra, P., Giles, C.L.: Identifying table boundaries in digital documents via sparse line detection. In: Proceeding of the 17th ACM conference on Information and knowledge mining CIKM 08, pp. 1311–1320. ACM Press (2008). doi:10.1145/1458082.1458255

  20. Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digital Libr. Syst. 1(4), 1–23 (2011). doi:10.4018/jdls.2010100101

    Article  Google Scholar 

  21. Malerba, D., Ceci, M., Berardi, M.: Machine learning for reading order detection in document image understanding. In: Machine Learning in Document Analysis, pp. 45–69 (2008)

  22. Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. Proc. SPIE 5010(1), 197–207 (2003). doi:10.1117/12.476326

    Article  Google Scholar 

  23. Meunier, J.L.: Optimized XY-cut for determining a page reading order. In: Eighth International Conference on Document Analysis and Recognition ICDAR05 1, pp. 347–351 (2005). doi:10.1109/ICDAR.2005.182

  24. Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992). doi:10.1109/2.144436

    Article  Google Scholar 

  25. Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLTNAACL04, vol. 2004, pp. 329–336 (2004). doi: 10.1.1.10.5644

  26. Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7(1), 7 (2012). doi:10.1186/1751-0473-7-7

    Article  Google Scholar 

  27. Summers, K.: Automatic discovery of logical document structure. Ph.D. thesis (1998)

  28. Tkaczyk, D., Bolikowski, L., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 11–16 (2012). doi:10.1109/DAS.2012.4

  29. Tkaczyk, D., Czeczko, A., Rusek, K.: GROTOAP: ground truth for open access publications. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 381–382 (2012)

  30. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004). doi:10.1007/s10032-004-0120-9

    Google Scholar 

  31. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989). doi:10.1137/0218082

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

The presented work was in part developed within the CODE project funded by the EU FP7 (Grant No. 296150) and the TEAM IAPP project (Grant No. 251514) within the FP7 People Programme. The Know-Center is funded within the Austrian COMET Program—Competence Centers for Excellent Technologies—under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Klampfl.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Klampfl, S., Granitzer, M., Jack, K. et al. Unsupervised document structure analysis of digital scientific articles. Int J Digit Libr 14, 83–99 (2014). https://doi.org/10.1007/s00799-014-0115-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-014-0115-1

Keywords

Navigation