Abstract
This paper presents our research focusing on extracting referential heading-entries in recognized table of contents (TOC) pages. This task encounters two issues: the complexity of layouts (e.g., a referential heading-entry can have one or many lines, with “decorate” texts, etc.), and some text data errors caused by OCR processing in training data. Our approach uses several layout-based and content-based features to classify textual lines of TOC pages in datasets. Also, we propose synthesis rules to combine related and classified lines into identify referential heading-entries. The experiments are conducted on ICDAR Book Structure Extraction Datasets 2009, 2011, and 2013. The results of experiments show that proposed approach is more efficient than previous methods of referential heading-entries extraction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Doucet, A., Kazai, G., Colutto, S., Mühlberger, G.: Overview of the ICDAR 2013 Competition on Book Structure Extraction. In: Proceedings of the Twelfth International Conference on Document Analysis and Recognition (ICDAR 2013), Washington DC, USA, p. 6 (2013)
Liu, C., Chen, J., Zhang, X., Liu, J., Huang, Y.: TOC Structure Extraction from OCR-ed Books. In: Geva, S., Kamps, J., Schenkel, R. (eds.) INEX 2011. LNCS, vol. 7424, pp. 98–108. Springer, Heidelberg (2012)
Gander, L., Lezuo, C., Unterweger, R.: Rule based document understanding of historical books using a hybrid fuzzy classification system. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, HIP 2011, pp. 91–97. ACM, New York (2011)
Lazzara, G., Levillain, R., Géraud, T., Jacquelet, Y., Marquegnies, J., Crépin-Leblond, A.: The scribo module of the olena platform: A free software framework for document image analysis. In: Proceedings of the Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011), pp. 252–258 (2011)
Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Book layout analysis: Toc structure extraction engine. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 164–171. Springer, Heidelberg (2009)
Doucet, A., Kazai, G., Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: ICDAR 2009 Book Structure Extraction Competition. In: Proceedings of the Tenth International Conference on Document Analysis and Recognition (ICDAR 2009), Barcelona, Spain, pp. 1408–1412 (2009)
Doucet, A., Kazai, G., Meunier, J.L.: ICDAR 2011 Book Structure Extraction Competition. In: Proceedings of the Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011), Beijing, China, pp. 1501–1505 (2011)
Doucet, A., Kazai, G., Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Setting up a competition framework for the evaluation of structure extraction from ocr-ed books. International Journal of Document Analysis and Recognition (IJDAR), Special Issue on Performance Evaluation of Document Analysis and Recognition Algorithms 14, 45–52 (2011)
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Nguyen, P.T., Nguyen, D.T. (2015). Extraction of Referential Heading-Entries in Recognized Table of Contents Pages. In: Silhavy, R., Senkerik, R., Oplatkova, Z., Prokopova, Z., Silhavy, P. (eds) Intelligent Systems in Cybernetics and Automation Theory. CSOC 2015. Advances in Intelligent Systems and Computing, vol 348. Springer, Cham. https://doi.org/10.1007/978-3-319-18503-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-18503-3_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18502-6
Online ISBN: 978-3-319-18503-3
eBook Packages: EngineeringEngineering (R0)