Skip to main content

Advertisement

Log in

Detection and analysis of table of contents based on content association

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

As a special type of table understanding, the detection and analysis of tables of contents (TOCs) play an important role in the digitization of multi-page documents. Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. Besides, they often require manual coding or at least machine learning of document-specific models. This paper introduces a new method to detect and analyze TOCs based on content association. It fully leverages the text information throughout the whole multi-page document and can be directly applied to a wide range of documents without the need to build or learn the models for individual documents. In addition, the associations of general text and page numbers are combined to make the TOC analysis more accurate. Natural language processing and layout analysis are integrated to improve the TOC functional tagging. The applications of the proposed method in a large-scale digital library project are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Lin, C., Niwa, Y., Narita, S.: Logical structure analysis of book document images using contents information. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 1048–1054. Ulm, Germany (1997)

  • He, F., Ding, X., Peng, L.: Hierarchical logical structure extraction of book documents by analyzing tables of contents. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval IX, pp. 6–13. San Jose, USA (2004)

  • Mandal, S., Chowdhury, S.P.: Automated detection and segmentation of table of contents page from document images. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 398–402. Edinburgh, UK (2003)

  • Tsuruoka, S., Hirano, C., Yoshikawa, T., Shinogi, T.: Image-based structure analysis for a table of contents and conversion to XML documents. In: Workshop on Document Layout Interpretation and Its Application (DLIA 2001), Seattle, USA (2001)

  • Story, G.A., O'Gorman, L., Fox, D., Schaper, L.L., Jagadish, H.V.: RightPages image-based electronic library for browsing and alerting. IEEE Comput. 17–25 (1992)

  • Belaïd, A.: Recognition of table of contents for electronic library consulting. Int. J. Document Anal. Recog. 4(1), 35–45 (2001)

    Article  Google Scholar 

  • Satoh, S., Takasu, A., Katsura, E.: An automated generation of electronic library based on document image understanding. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 163–166. Tokyo, Japan (1995)

  • Le Bourgeois, F., Emptoz, H., Souafi Bensafi, S.: Document understanding using probabilistic relaxation: application on tables of contents of periodicals. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 508–512, Seattle, USA (2001)

  • MIT Press, Classics Book Collection Release Announcement. http://mitpress.mit.edu/main/feature/classics/MITPClassics_rele-ase.pdf

  • Simske, S., Lin, X.: Creating digital libraries: content generation and re-mastering. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, pp. 33–45. Palo Alto (2004)

  • Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval VII, pp. 291–302. San Jose, USA (2000)

  • Wang, Y., Phillips, I.T., Haralick, R.: Table detection via probability optimization. In: Proceedings of the 5th International Workshop DAS 2002, Document Image Analysis System V. pp. 272–282. Princeton, USA (2002)

  • Luo, Q., Watanabe, T., Nakayama, T.: Identifying contents page of documents. In: Proceedings of the 13th International Conference on Pattern Recognition, vol. 3, pp. 696–700 (1996)

  • Myers, G.: Whole-genome DNA sequencing. IEEE Comput. Eng. Sci. 33–43 (1999)

  • Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)

  • Lin, X., Simske, S.: Automatic document navigation for digital content re-mastering. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval XI, pp. 66–73. San Jose, USA (2004)

  • Lin, X.: Text-mining based journal splitting. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 1075–1079. Edinburgh, UK (2003)

  • CogNet website, http://cognet.mit.edu

  • Lin, X.: Reliable OCR solution for digital content re-mastering. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval IX, pp. 223–231. San Jose, USA (2002)

  • Lin, X.: Impact of imperfect OCR on part-of-speech tagging. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 284–288. Edinburgh, UK (2003)

  • Cogilex website, http://www.cogilex.com

  • Allison, L., Dix, T.I., Yee, C.N.: Shortest path and closure algorithms for banded matrices. Inform. Process. Lett. 40(6), 317–322 (1991)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofan Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, X., Xiong, Y. Detection and analysis of table of contents based on content association. IJDAR 8, 132–143 (2006). https://doi.org/10.1007/s10032-005-0149-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-005-0149-4

Keywords