Elsevier

Pattern Recognition

Volume 35, Issue 2, February 2002, Pages 485-503
Pattern Recognition

Automatic generation of structured hyperdocuments from document images

https://doi.org/10.1016/S0031-3203(01)00026-7Get rights and content

Abstract

As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.

Introduction

The growing popularity of the internet has been recently and continually increasing the demand to have documents accessible and retrievable through the World Wide Web, for the purpose of sharing them via the internet. Inevitably, this has given rise to the need for the automatic conversion of paper document images, as well as digital documents, into hyperdocuments.

As for the conversion of electronic documents into hyperdocuments, many methods and commercial tools have been developed and are now being used in real applications. As for the conversion of paper documents, however, only a few research works have been conducted. Furthermore, these research works were primarily concerned with the conversion of single-column document images, and the images have been limited to containing only text and image objects [1], [2], [3]. Unfortunately, an automatic conversion of complex and various multi-column document images has not been dealt with, but the necessity of representing such documents in the form of hyperdocuments is continuously increasing.

In this paper, we propose two methods that convert multi-column document images into HTML documents; one is implemented using the table structure and the other using their layer structure. We also suggest a method for generating a table of contents (ToC) page through a logical structure analysis (logical labeling) [4], [5], [6], [7]. Fig. 1 illustrates the overall process of the automatic generation of the structured hyperdocuments from multi-column document images.

In the process, geometrical and logical structure analysis are performed on multi-column document images. In the geometrical structure analysis, it classifies all objects in the input document images into image, table, and text object. After character recognition, logical structure analysis provides labels such as section title, caption, page number, header, footer, etc., to the text objects. The proposed methods are then applied to the result of the structure analysis. In converting multi-column document images into hyperdocuments, it is desirable for the screen display of the hyperdocuments to be consistent with the layout of the paper document images, so as to preserve their logical flow and appearance. To do so, we use the table and layer structure for the conversion stage. Finally, for generating a structured table of contents page, only the section titles from text objects are extracted and ordered hierarchically. The generated table of contents page provides the logical flow of the input documents and hyperlinks to the corresponding contents.

Although tags in HTML are much more limited in availability and representation than those of XML or SGML, HTML provides a convenient way for the sharing and retrieving of various and complex document images through the internet, after they are automatically converted into their corresponding structured hyperdocuments in HTML.

This paper is organized as follows: In Section 2, we review the previous works related to the conversion of document images into HTML documents and to the logical structure analysis of document images. In Section 3, we describe two proposed conversion methods of multi-column document images. In Section 4, we describe how to generate a table of contents page by extracting section titles from the text objects. In Section 5, experimental results on various kinds of complex multi-column document images are analyzed to evaluate the performance of the proposed conversion method. Finally, conclusions and further studies are given in Section 6.

Section snippets

Related works

In order to convert paper document images into hyperdocuments, document structure analysis (geometrical structure analysis and logical structure analysis) must be performed. Geometrical structure analysis generally classifies homogeneous regions in a document image into text, image, and table objects. Logical structure analysis labels the objects with respect to their logical senses and establishes the relationship among the objects (classified as texts, images and tables).

In the generation of

HTML conversion of multi-column document images

We propose two methods to convert multi-column document images into hyperdocuments; one is based on the table structure and the other is based on the layer structure.

Generation of a structured hyperdocument based on logical structure analysis

In this section, we choose many related papers as input, and automatically generate a table of contents page by extracting section titles from the input images and arranging them with respect to their hierarchical relation.

Since HTML tags are very limited in availability and representation, it is difficult to fully express enough meaningful information of objects in a document image to HTML format. For that reason, the creation of a table of contents page is very worthwhile work, which provides

Experimental environment

The proposed methods were implemented on Pentium MMX 166 MHz PC. Experiments on HTML conversion were carried out with 300 document images taken from magazines, newspapers, books, scientific and technical journals, manuals, and UWDB (the database of University of Washington) [10]. Generation of a table of contents page was tested on images of technical papers which are collected from proceedings of scientific conferences and journals.

Experimental results

Fig. 15, Fig. 16, Fig. 17 show the converted example images

Conclusions and further research

In this paper, we proposed two methods  one using the table structure and the other using the layer structure  converting multi-column document images into HTML documents and also proposed a method for generating a structured table of contents page by extracting the section titles from input document images.

For the conversion of each paper image into its hyperdocument, the proposed conversion methods were tested on various kinds of complex multi-column document images. Experimental results

About the Author—JI-YEON LEE was born in Seoul, Korea, in 1975. She received the B.S. degree in Information Processing from Sangmyung University, Korea, in 1997 and received the M.S. degree in Computer Science and Engineering at Korea University, Seoul, Korea, in 2000.

She is currently working as a research engineer at Samsung Electronics, Co., Ltd. in Korea. Her research interests include multimedia, hyperdocument and document structure analysis.

References (10)

  • T. Tanaka, S. Tsuruoka, Table form document understanding using node classification method and HTML document...
  • T.G. Kieninger, A. Dengel, A paper-to-HTML table converting system, Proceedings of the Third IAPR Workshop on Document...
  • M. Worring et al.

    Content based internet access to paper documents

    Int. J. Document Anal. Recognition

    (1999)
  • C. Faure, Preattentive reading and selective attention for document image analysis, Proceedings of the Fifth...
  • J.L. Fisher, Logical structure descriptions of segmented document images, Proceedings of the First International...
There are more references available in the full text version of this article.

Cited by (0)

About the Author—JI-YEON LEE was born in Seoul, Korea, in 1975. She received the B.S. degree in Information Processing from Sangmyung University, Korea, in 1997 and received the M.S. degree in Computer Science and Engineering at Korea University, Seoul, Korea, in 2000.

She is currently working as a research engineer at Samsung Electronics, Co., Ltd. in Korea. Her research interests include multimedia, hyperdocument and document structure analysis.

About the Author—JEONG-SEON PARK received the B.S. and M.S. degrees in Computer Science from Chungbuk National University, Cheongju, Korea, in 1988 and 1992, respectively. She is currently working toward the Ph.D. degree in computer science and engineering at Korea University, Seoul, Korea.

From February 1994 to July 1996, she was a research engineer in S/W R&D center at Hyundai Electronics, Co., Ltd. in Korea and worked as an advanced research engineer at Hyundai Information Technology, Co., Ltd. in Korea from August 1996 to March 1999. She was the winner of the Annual Best Paper Award of the Korea Information Science Society in 1994. Her research interests include pattern recognition, image processing and computer vision.

About the Author—HYERAN BYUN received the B.S. and M.S. degrees in Mathematics from Yonsei University, Korea. She received her Ph.D. degree in Computer Science from Purdue University, West Lafayette, Indiana. She was an assistant professor in Hallym University, Chooncheon, Korea from 1994–1995. Since 1995, she has been an associate professor of Computer Science at Yonsei University, Korea. Her research interests include multimedia, computer vision, image processing, and pattern recognition.

About the Author—JONGSUB MOON received the B.S. degree and M.S. degree in Computer Science from Seoul National University, Korea in 1981 and 1983, respectively. Also, he received the Ph.D. degree in Computer Science from Illinois Institute of Technology, Illinois, U.S.A., in 1991. He worked at Gold Star Tele-electric research Institute as researcher between 1981 and 1985. After receiving the Ph.D. degree, he joined the Department of Information Engineering of Korea University, Korea as an assistant professor. Now he is an associate professor in the Department of Electric and Information Engineering of Korea University, Korea. His research interests include neural network, image processing, pattern matching and cognitive science.

About the Author—SEONG-WHAN LEE received his B.S. degree in Computer Science and Statistics from Seoul National University, Seoul, Korea, in 1984; and M.S. and Ph.D. degrees in Computer Science from KAIST in 1986 and 1989, respectively.

From February 1989 to February 1995, he was an Assistant Professor in the Department of Computer Science at Chungbuk National University, Cheongju, Korea. In March 1995, he joined the faculty of the Department of Computer Science and Engineering at Korea University, Seoul, Korea, as an Associate Professor, and now he is a Full Professor. Currently, Dr. Lee is the director of National Creative Research Initiative Center for Artificial Vision Research (CAVR) supported by the Korean Ministry of Science and Technology.

Dr. Lee was the winner of the Annual Best Paper Award of the Korea Information Science Society in 1986. He obtained the First Outstanding Young Researcher Award at the 2nd International Conference on Document Analysis and Recognition in 1993, and the First Distinguished Research Professor Award from Chungbuk National University in 1994. He also obtained the Outstanding Research Award from the Korea Information Science Society in 1996.

He has been the Co-Editor-in-Chief of the International Journal on Document Analysis and Recognition since 1998 and the Associate Editor of the Pattern Recognition Journal, the International Journal of Pattern Recognition and Artificial Intelligence, and the International Journal of Computer Processing of Oriental Languages since 1997.

He was the Program Co-Chair of the 6th International Workshop on Frontiers in Handwriting Recognition, the 2nd International Conference on Multimodal Interface, the 17th International Conference on Computer Processing of Oriental Languages, the 5th International Conference on Document Analysis and Recognition, and the 7th International Conference on Neural Information Processing. He was the Workshop Co-Chair of the 3rd International Workshop on Document Analysis Systems and the 1st IEEE International Workshop on Biologically Motivated Computer Vision. He served on the program committees of several well-known international conferences.

He is a fellow of International Association for Pattern Recognition, a senior member of the IEEE Computer Society and a life member of the Korea Information Science Society, the International Neural Network Society, and the Oriental Languages Computer Society.

His research interests include pattern recognition, computer vision and neural networks. He has more than 200 publications on these areas in international journals and conference proceedings, and authored five books.

This research was supported by Creative Research Initiatives of the Korean Ministry of Science and Technology. A preliminary version of this paper has been presented at the 15th International Conference on Pattern Recognition, Barcelona, September 2000.

View full text