research-article

Implementing the Tesseract Method for Information Extraction from Images

Authors:
Abrar Al Sayem

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

0000-0003-4956-423X
View Profile

,
Asiful Islam Chowdhury

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

0009-0007-1175-1642
View Profile

,
Shahriar Hossain Shojol

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

0009-0005-8775-3754
View Profile

,
Md Humaion Kabir Mehedi

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

0000-0002-5759-022X
View Profile

,
Annajiat Alim Rasel

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

School of Data and Sciences, BRAC UNIVERSITY, Bangladesh

0000-0003-0198-3734
View Profile

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology ApplicationsMay 2023Pages 97–101https://doi.org/10.1145/3605423.3605431

Published:20 August 2023Publication History

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications

Pages 97–101

ABSTRACT

Character recognition (CR) has been the subject of substantial research over the past half century and has now reached a level of development that is adequate to build applications driven by technology. Now, the fast expanding computer power makes it possible to execute the existing Optical Character Recognition (OCR) approaches and produces an increasing demand in a wide variety of emergent application domains that call for more advanced methodologies. Scanning the paper and entering the digitized image into a computer system is one of the quickest and easiest ways to save text information. After that, it will be saved on the computer, and if necessary, alterations can be made to it as well. However, recognizing the text in an image that has been recorded is a very difficult challenge to accomplish. As a result, The Tesseract method has been employed to extract text from images, simplifying the process of doing so.

References

Andrew S Agbemenu, Jepthah Yankey, and Ernest O Addo. 2018. An automatic number plate recognition system using opencv and tesseract ocr engine. International Journal of Computer Applications 180, 43 (2018), 1–5.Google ScholarCross Ref
Nafiz Arica and Fatos Yarman Vural. 2001. An overview of character recognition focused on off-line handwriting. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 31 (06 2001), 216 – 233. https://doi.org/10.1109/5326.941845Google ScholarDigital Library
Muskan Chawla, Rachna Jain, and Preeti Nagrath. 2020. Implementation of tesseract algorithm to extract text from different images. In Proceedings of the International Conference on Innovative Computing & Communications (ICICC).Google ScholarCross Ref
Ahmed Chowdhury, Ejaj Ahmed, Shameem Ahmed, Shohrab Hossain, and Chowdhury Rahman. 2002. Optical Character Recognition of Bangla Characters using Neural Network: A Better Approach.Google Scholar
Shrey Dutta, Naveen Sankaran, K. Pramod Sankar, and C.V. Jawahar. 2012. Robust Recognition of Degraded Documents Using Character N-Grams. In 2012 10th IAPR International Workshop on Document Analysis Systems. 130–134. https://doi.org/10.1109/DAS.2012.76Google ScholarDigital Library
Md. Imdadul Haque Emon, Khondoker Nazia Iqbal, Md Humaion Kabir Mehedi, Mohammed Julfikar Ali Mahbub, and Annajiat Alim Rasel. 2023. A Review of Optical Character Recognition (OCR) Techniques on Bengali Scripts. 85–94. https://doi.org/10.1007/978-3-031-25161-0_6Google ScholarCross Ref
Lubna, Naveed Mufti, and Syed Afaq Ali Shah. 2021. Automatic Number Plate Recognition:A Detailed Survey of Relevant Algorithms. Sensors 21, 9 (2021). https://www.mdpi.com/1424-8220/21/9/3028Google Scholar
Jamshed Memon, Maira Sami, and Rizwan Ahmed Khan. 2020. Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR). IEEE Access 8 (2020), 142642–142668.Google ScholarCross Ref
Jeroen Ooms. 2023. tesseract: Open Source OCR Engine. https://docs.ropensci.org/tesseract/ (website) https://github.com/ropensci/tesseract (devel).Google Scholar
Hisashi Saiga, Yasuhisa Nakamura, Yoshihiro Kitamura, and Toshiaki Morita. 1993. An OCR system for business cards. In Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR’93). IEEE, 802–805.Google ScholarCross Ref
R. Smith. 2007. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. 629–633. https://doi.org/10.1109/ICDAR.2007.4376991Google ScholarCross Ref
Ray Smith, Daria Antonova, and Dar-Shyang Lee. 2009. Adapting the Tesseract Open Source OCR Engine for Multilingual OCR. In MOCR ’09: Proceedings of the International Workshop on Multilingual OCR. http://doi.acm.org/10/1145/1577802.1577804Google ScholarDigital Library
Dan Sporici, Elena Cușnir, and Costin-Anton Boiangiu. 2020. Improving the Accuracy of Tesseract 4.0 OCR Engine Using Convolution-Based Preprocessing. Symmetry 12 (05 2020), 715. https://doi.org/10.3390/sym12050715Google ScholarCross Ref
Junqing Tang, Li Wan, Jennifer Schooling, Pengjun Zhao, Jun Chen, and Shufen Wei. 2022. Automatic number plate recognition (ANPR) in smart cities: A systematic review on technological advancements and application cases. Cities 129 (2022), 103833. https://doi.org/10.1016/j.cities.2022.103833Google ScholarCross Ref
Zhenyao Zhao, Min Jiang, Shihui Guo, Zhenzhong Wang, Fei Chao, and Kay Chen Tan. 2020. Improving deep learning based optical character recognition via neural architecture search. In 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE, 1–7.Google ScholarDigital Library

Index Terms

Implementing the Tesseract Method for Information Extraction from Images
1. Software and its engineering

Recommendations

Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique
SBES '13: Proceedings of the 2013 27th Brazilian Symposium on Software Engineering

Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu ...
Read More
Implementation of Optical Character Recognition using Tesseract with the Javanese Script Target in Android Application
Abstract
Recognising characters from text have been a popular topic in the computer vision area. The application can benefit to many problems in the world. For example: recognising text in documents, classifying the text or scripts of documents, plate ...
Read More
MAPS: midline analysis and propagation of segmentation
ICVGIP '12: Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing

Scenic word images undergo degradations due to motion blur, uneven illumination, shadows and defocussing, which lead to difficulty in segmentation. As a result, the recognition results reported on the scenic word image datasets of ICDAR have been low. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications
May 2023
270 pages
ISBN:9781450399579
DOI:10.1145/3605423

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Optical Character Recognition
Recognition
Segmentation
Tesseract
Text Recognition
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 33
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Implementing the Tesseract Method for Information Extraction from Images

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

Implementation of Optical Character Recognition using Tesseract with the Javanese Script Target in Android Application

MAPS: midline analysis and propagation of segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Implementing the Tesseract Method for Information Extraction from Images

ICCTA '23: Proceedings of the 2023 9th International Conference on Computer Technology Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

Implementation of Optical Character Recognition using Tesseract with the Javanese Script Target in Android Application

MAPS: midline analysis and propagation of segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media