Skip to main content

Text Area Identification in Web Images

  • Conference paper
Methods and Applications of Artificial Intelligence (SETN 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3025))

Included in the following conference series:

Abstract

With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Antonacopoulos, A., Karatzas, D., Ortiz Lopez, J.: Accessing Textual Information Embedded in Internet Images. In: SPIE Internet Imaging II, San Jose, USA, pp. 198–205 (2001)

    Google Scholar 

  2. Lopresti, D., Zhou, J.: Document Analysis and the World Wide Web. In: Workshop on Document Analysis Systems, Marven, Pennsylvania, pp. 417–424 (1996)

    Google Scholar 

  3. Jain, A.K., Yu, B.: Automatic Text Location in Images and Video Frames. Pattern Recognition 31(12), 2055–2076 (1998)

    Article  Google Scholar 

  4. Huang, Q., Dom, B., Steele, D., Ashley, J., Niblack, W.: Foreground/background segmentation of color images by integration of multiple cues. In: Computer Vision and Pattern Recognition, pp. 246–249 (1995)

    Google Scholar 

  5. Li, H., Kia, O., Doermann, D.: Text enhancement in digital video. In: Doc. Recognition & Retrieval VI (IS&SPIE Electronic Imaging 1999), San Jose, vol. 3651, pp. 2–9 (1999)

    Google Scholar 

  6. Strouthopoulos, C., Papamarkos, N.: Text identification for document image analysis using a neural network. Image and Vision Computing 16, 879–896 (1998)

    Article  Google Scholar 

  7. Antonacopoulos, A., Karatzas, D.: Text Extraction from Web Images Based on Human Perception and Fuzzy Inference. In: 1st Int’l Workshop on Web Document Analysis (WDA 2001), Seattle, USA, pp. 35–38 (2001)

    Google Scholar 

  8. Antonacopoulos, A., Karatzas, D.: An Anthropocentric Approach to Text Extraction from WWW Images. In: 4th IAPR Workshop on Document Analysis Systems (DAS 2000), Rio de Janeiro, pp. 515–526 (2000)

    Google Scholar 

  9. Antonacopoulos, A., Delporte, F.: Automated Interpretation of Visual representations: Extracting textual Information from WWW Images. In: Paton, R., Neilson, I. (eds.) Visual Representations and Interpretations, Springer, London (1999)

    Google Scholar 

  10. Lopresti, D., Zhou, J.: Locating and Recognizing Text in WWW Images. Information Retrieval 2(2/3), 177–206 (2000)

    Article  Google Scholar 

  11. Perantonis, S.J., Gatos, B., Maragos, V.: A Novel Web Image Processing Algorithm for Text Area Identification that Helps Commercial OCR Engines to Improve Their Web Image Recognition Efficiency. In: Second International Workshop on Web Document Analysis (WDA 2003), Edinburgh, Scotland (2003)

    Google Scholar 

  12. Antonacopoulos, A., Gatos, B., Karatzas, D.: ICDAR 2003 Page Segmentation Competition. In: 7th International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland (2003)

    Google Scholar 

  13. Petasis, G., Karkaletsis, V., Spyropoulos, C.D.: Cross-lingual Information Extraction from Web pages: the use of a general-purpose Text Engineering Platform. In: 4th International Conference on Recent Advances in Natural Language Processing (RANLP 2003), Borovets, Bulgaria (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Perantonis, S.J., Gatos, B., Maragos, V., Karkaletsis, V., Petasis, G. (2004). Text Area Identification in Web Images. In: Vouros, G.A., Panayiotopoulos, T. (eds) Methods and Applications of Artificial Intelligence. SETN 2004. Lecture Notes in Computer Science(), vol 3025. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24674-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24674-9_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21937-8

  • Online ISBN: 978-3-540-24674-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics