short-paper

Designing a Vision Transformer based Enhanced Text Extractor for Product Images

Authors:

Kunal BanerjeeAuthors Info & Claims

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 208 - 212

https://doi.org/10.1145/3570991.3571010

Published: 04 January 2023 Publication History

Get Access

Abstract

Product images, such as those which appear in e-commerce sites, exhibit unique characteristics that are typically not present in natural images. The primary distinguishing characteristic is the presence of text (e.g., brand names, price, constituents) along with high local entropy (i.e., too much visual information in the form of both text and brightly coloured pictures condensed in a small region). Extracting the text from these images may have multiple benefits: catalogue enrichment, product matching, offensive content identification, and more. However, the images are sometimes unclear and blurry where it is difficult to recognise the text even with human perception, and these texts are often written in non-standard fonts (at times each character in a word has a different colour and/or style), or are oriented at odd angles or appear on curved surfaces; moreover, many of these words such as, the brand names, do not appear in dictionaries. In this work, we present a vision transformer based text extractor that can handle the aforementioned challenges for product images effectively, and outperforms our earlier model considerably. We further compare our new end-to-end text extraction solution with those of Google and Azure text extraction cloud offerings, and showcase its efficacy both in terms of accuracy and latency.

References

[1]

Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In ICCV. 4714–4722.

Google Scholar

[2]

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character Region Awareness for Text Detection. In CVPR. 9365–9374.

Google Scholar

[3]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929(2020).

Google Scholar

[4]

Pranay Dugar, Rajesh Shreedhar Bhat, Asit Sharad Tarsode, Uddipto Dutta, Kunal Banerjee, Anirban Chatterjee, and Vijay Srinivas Agneeswaran. 2021. From Pixels to Words: A Scalable Journey of Text Information from Product Images to Retail Catalog. In CIKM. 3787–3795.

Google Scholar

[5]

Pranay Dugar, Aditya Vikram, Anirban Chatterjee, Kunal Banerjee, and Vijay Agneeswaran. 2022. Don’t Miss the Fine Print! An Enhanced Framework to Extract Text from Low Resolution Images. In VISIGRAPP (5: VISAPP). 664–671.

Google Scholar

[6]

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In CVPR. 7098–7107.

Google Scholar

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770–778.

Google Scholar

[8]

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading Text in the Wild with Convolutional Neural Networks. Int. J. Comput. Vis. 116, 1 (2016), 1–20.

Digital Library

Google Scholar

[9]

Wei Liu, Chaofeng Chen, and Kwan-Yee K Wong.2018. A character-aware neural network for distorted scene text recognition. In AAAI.

Google Scholar

[10]

Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. 2016. Star-net: A spatial attention residue network for scene text recognition. In BMVC, Vol. 2.

Google Scholar

[11]

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do Vision Transformers See Like Convolutional Neural Networks?. In NeurIPS. 12116–12128.

Google Scholar

[12]

Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, Vol. 39. 2298–2304.

Google Scholar

[13]

Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust scene text recognition with automatic rectification. In CVPR. 4168–4176.

Google Scholar

[14]

Vibhuti Vasisth and Nishtha Das. 2020. India: Country Of Origin To Be Specified On E-Commerce Websites For Product Listings. https://www.mondaq.com/india/international-trade-investment/968240/country-of-origin-to-be-specified-on-e-commerce-websites-for-product-listings. Accessed: 2021-07-08.

Google Scholar

Index Terms

Designing a Vision Transformer based Enhanced Text Extractor for Product Images
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Effective text extraction and recognition for WWW images
DocEng '03: Proceedings of the 2003 ACM symposium on Document engineering

Images play a very important role in web content delivery. Many WWW images contain text information that can be used for web indexing and searching. A new text extraction and recognition algorithm is proposed in this paper. The character strokes in the ...
Feature string-based intelligent information retrieval from Tamil document images

Information Retrieval (IR) in document images has become a growing and challenging problem due to its rising popularity. This paper proposes a simple and effective method to extract the text and perform intelligent IR from Tamil Document Images without ...
Color-based clustering for text detection and extraction in image
MM '07: Proceedings of the 15th ACM international conference on Multimedia

This paper proposes a new approach for the text detection and extraction in image. The novelty of our approach mainly lies in the color-based clustering into two phases: In text detection phase, we consider jointly the two significant features of text ...

Comments

Information & Contributors

Information

Published In

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 2023

357 pages

ISBN:9781450397971

DOI:10.1145/3570991

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

CODS-COMAD 2023

CODS-COMAD 2023: 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 4 - 7, 2023

Mumbai, India

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
107
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

Effective text extraction and recognition for WWW images

Feature string-based intelligent information retrieval from Tamil document images

Color-based clustering for text detection and extraction in image

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations