Abstract
As a special type of table understanding, the detection and analysis of tables of contents (TOCs) play an important role in the digitization of multi-page documents. Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. Besides, they often require manual coding or at least machine learning of document-specific models. This paper introduces a new method to detect and analyze TOCs based on content association. It fully leverages the text information throughout the whole multi-page document and can be directly applied to a wide range of documents without the need to build or learn the models for individual documents. In addition, the associations of general text and page numbers are combined to make the TOC analysis more accurate. Natural language processing and layout analysis are integrated to improve the TOC functional tagging. The applications of the proposed method in a large-scale digital library project are also discussed.
Similar content being viewed by others
References
Lin, C., Niwa, Y., Narita, S.: Logical structure analysis of book document images using contents information. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 1048–1054. Ulm, Germany (1997)
He, F., Ding, X., Peng, L.: Hierarchical logical structure extraction of book documents by analyzing tables of contents. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval IX, pp. 6–13. San Jose, USA (2004)
Mandal, S., Chowdhury, S.P.: Automated detection and segmentation of table of contents page from document images. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 398–402. Edinburgh, UK (2003)
Tsuruoka, S., Hirano, C., Yoshikawa, T., Shinogi, T.: Image-based structure analysis for a table of contents and conversion to XML documents. In: Workshop on Document Layout Interpretation and Its Application (DLIA 2001), Seattle, USA (2001)
Story, G.A., O'Gorman, L., Fox, D., Schaper, L.L., Jagadish, H.V.: RightPages image-based electronic library for browsing and alerting. IEEE Comput. 17–25 (1992)
Belaïd, A.: Recognition of table of contents for electronic library consulting. Int. J. Document Anal. Recog. 4(1), 35–45 (2001)
Satoh, S., Takasu, A., Katsura, E.: An automated generation of electronic library based on document image understanding. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 163–166. Tokyo, Japan (1995)
Le Bourgeois, F., Emptoz, H., Souafi Bensafi, S.: Document understanding using probabilistic relaxation: application on tables of contents of periodicals. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 508–512, Seattle, USA (2001)
MIT Press, Classics Book Collection Release Announcement. http://mitpress.mit.edu/main/feature/classics/MITPClassics_rele-ase.pdf
Simske, S., Lin, X.: Creating digital libraries: content generation and re-mastering. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, pp. 33–45. Palo Alto (2004)
Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval VII, pp. 291–302. San Jose, USA (2000)
Wang, Y., Phillips, I.T., Haralick, R.: Table detection via probability optimization. In: Proceedings of the 5th International Workshop DAS 2002, Document Image Analysis System V. pp. 272–282. Princeton, USA (2002)
Luo, Q., Watanabe, T., Nakayama, T.: Identifying contents page of documents. In: Proceedings of the 13th International Conference on Pattern Recognition, vol. 3, pp. 696–700 (1996)
Myers, G.: Whole-genome DNA sequencing. IEEE Comput. Eng. Sci. 33–43 (1999)
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)
Lin, X., Simske, S.: Automatic document navigation for digital content re-mastering. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval XI, pp. 66–73. San Jose, USA (2004)
Lin, X.: Text-mining based journal splitting. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 1075–1079. Edinburgh, UK (2003)
CogNet website, http://cognet.mit.edu
Lin, X.: Reliable OCR solution for digital content re-mastering. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval IX, pp. 223–231. San Jose, USA (2002)
Lin, X.: Impact of imperfect OCR on part-of-speech tagging. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 284–288. Edinburgh, UK (2003)
Cogilex website, http://www.cogilex.com
Allison, L., Dix, T.I., Yee, C.N.: Shortest path and closure algorithms for banded matrices. Inform. Process. Lett. 40(6), 317–322 (1991)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, X., Xiong, Y. Detection and analysis of table of contents based on content association. IJDAR 8, 132–143 (2006). https://doi.org/10.1007/s10032-005-0149-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-005-0149-4