Word Extraction from Table Regions in Document Images

Jeong, Chang-Bu; Park, Sang-Cheol; Son, Hwa-Jeong; Kim, Soo-Hyung

doi:10.1007/11599517_25

Chang-Bu Jeong²⁰,
Sang-Cheol Park²¹,
Hwa-Jeong Son²¹ &
…
Soo-Hyung Kim²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3815))

Included in the following conference series:

International Conference on Asian Digital Libraries

1142 Accesses
1 Citations

Abstract

This paper describes a method to extract words from table regions in document images. The proposed approach consists of two stages: cell detection and word extraction. In the cell detection module, a table frame is extracted first by analyzing connected components and then intersection points are detected by a method using masks in the table frame. We correct false intersections, and detect the location of the cells within the table. In the word extraction module, a text region in each cell is located by using the connected components information that was obtained during the cell extraction module, and segmented into text lines by using projection profiles. Finally we divide the segmented lines into words using gap clustering and special symbol detection. The method correctly included character components touching the table frame with words, so experimental results show that more than 99% of words were successfully extracted from table regions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Oh, I.S., Choi, Y.S., Yang, J.H., Kim, S.H.: A Keyword Spotting System of Korean Document Images. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, p. 530. Springer, Heidelberg (2002)
Chapter Google Scholar
Marinai, S., Marino, E., Cesarini, F., Soda, G.: A General System for the Retrieval of Document Images from Digital Libraries. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp. 150–173 (2004)
Google Scholar
Lu, Y., Zhang, L., Tan, C.L.: Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp. 174–187 (2004)
Google Scholar
Jeong, C.B., Kim, S.H.: A Document Image Preprocessing System for Keyword Spotting. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-p. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 440–443. Springer, Heidelberg (2004)
Chapter Google Scholar
Lopresti, D., Nagy, G.: A Tabular Survey of Automated Table Processing. In: Chhabra, A.K., Dori, D. (eds.) GREC 1999. LNCS, vol. 1941, pp. 93–120. Springer, Heidelberg (2000)
Chapter Google Scholar
Watanabe, T., Luo, Q., Sugie, N.: Layout Recognition of Multi-Kinds of Table-Form Documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 432–445 (1995)
Article Google Scholar
Taylor, S., Fritzson, R., Pastor, J.: Extraction of Data from Pre-printed Forms. Machine Vision and Applications 5(3), 211–222 (1992)
Article Google Scholar
Arias, J.F., Kasturi, R.: Efficient Extraction of Primitives from Line Drawings Composed of Horizontal and Vertical Lines. Machine Vision and Applications archive 10, 214–221 (1997)
Article Google Scholar
Neves, L.A.P., Facon, J.: Methodology of Automatic Extraction of Table-Form Cells. In: XIII Brazilian Symposium on Computer Graphics and Image Processing, pp. 15–21 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Internet Software, Honam University, 59-1 Sebong-dong, Gwangsan-gu, Gwangju, 506-714, Korea
Chang-Bu Jeong
Department of Computer Science, Chonnam National University, 300 YongBong-dong, Buk-Gu, Gwangju, 500-757, Korea
Sang-Cheol Park, Hwa-Jeong Son & Soo-Hyung Kim

Authors

Chang-Bu Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Cheol Park
View author publications
You can also search for this author in PubMed Google Scholar
Hwa-Jeong Son
View author publications
You can also search for this author in PubMed Google Scholar
Soo-Hyung Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Virginia Tech, 24061, Blacksburg, VA
Edward A. Fox
University of Vienna, Vienna, Austria
Erich J. Neuhold
Department of Library Science, Chulalongkorn University, 10330, Bangkok, Thailand
Pimrumpai Premsmit
School of Engineering and Technology, Asian Institute of Technology, P.O. Box 4, 12120, Klong Luang, Pathum Thani, Thailand
Vilas Wuwongse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jeong, CB., Park, SC., Son, HJ., Kim, SH. (2005). Word Extraction from Table Regions in Document Images. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_25

Download citation

DOI: https://doi.org/10.1007/11599517_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30850-8
Online ISBN: 978-3-540-32291-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics