research-article

Deep learning Arabic printed document knowledge extraction

Authors:
Taghreed Alghamdi

University of Jeddah, Saudi Arabia

University of Jeddah, Saudi Arabia
View Profile

,
Samia Snoussi

University of Jeddah, Saudi Arabia

University of Jeddah, Saudi Arabia
View Profile

,
Lobna Hsairi

University of Jeddah, Saudi Arabia

University of Jeddah, Saudi Arabia
View Profile

ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed SystemsDecember 2021Pages 201–207https://doi.org/10.1145/3508072.3508103

Published:13 April 2022Publication History

ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed Systems

Pages 201–207

ABSTRACT

This paper presents how to utilize deep learning to extract knowledge from Arabic printed document images. The fundamental goal of deep learning is automatically extracting significant features from images, eliminating the need for a classic feature extraction method. We describe how to extract high-quality and coherent knowledge from Arabic printed document images using deep learning. This system is constructed on keywords used to classify Arabic document images according to these keywords. We used A questionnaire to choose valuable words according to historical, scientific, or religious documents. The evaluation of the proposed system is applied to Arabic printed document images to extract keywords. The accuracy of the proposed deep learning extraction approach is hugely affected by image preprocessing and image quality. The proposed system has a higher level of accuracy while extracting keywords. We achieve a 3.78% character error rate in the proposed system and a 15.46% word error rate.

Supplemental Material

Available for Download

pdf

icfnds2021-31.pdf (937.8 KB)

Presentation slides

References

Alkhateeb, F., Doush, I. A., & Albsoul, A. (2017). Arabic optical character recognition software: A review. Pattern Recognition and Image Analysis, 27(4), 763-776.Google ScholarDigital Library
Siddhu, M. K., & Yaakob, S. N. Deep Learning Applied To Arabic And Latin Scripts: A Review.Google Scholar
Das, A., Roy, S., Bhattacharya, U., & Parui, S. K. (2018, August). Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 3180-3185). IEEE.Google ScholarCross Ref
Revathi, A. S., & Modi, N. A. (2021, March). Comparative Analysis of Text Extraction from Color Images using Tesseract and OpenCV. In 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom) (pp. 931-936). IEEE.‏Google Scholar
Zacharias, E., Teuchler, M., & Bernier, B. (2020). Image Processing Based Scene-Text Detection and Recognition with Tesseract. arXiv preprint arXiv:2004.08079.‏Google Scholar
Alginahi, Y. M. (2013). A survey on Arabic character segmentation. International Journal on Document Analysis and Recognition (IJDAR), 16(2), 105-126.Google ScholarDigital Library
Alghamdi, M. A., Alkhazi, I. S., & Teahan, W. J. (2016, July). Arabic OCR evaluation tool. In 2016 7th international conference on computer science and information technology (CSIT) (pp. 1-6). IEEEGoogle Scholar
Shi, Z., Setlur, S., & Govindaraju, V. (2009, July). A steerable directional local profile technique for extraction of handwritten arabic text lines. In 2009 10th International Conference on Document Analysis and Recognition (pp. 176-180). IEEE.‏Google ScholarDigital Library
Boussellaa, W., Bougacha, A., Zahour, A., El Abed, H., & Alimi, A. (2009, July). Enhanced text extraction from Arabic degraded document images using EM algorithm. In 2009 10th International Conference on Document Analysis and Recognition (pp. 743-747). IEEE.‏Google ScholarDigital Library
Dixit, U. D., & Shirdhonkar, M. S. (2015). A survey on document image analysis and retrieval system. International Journal on Cybernetics & Informatics (IJCI), 4(2), 259-270.‏Google ScholarCross Ref
Khaled, M., & Pouzi, M. (2018). Information Extraction- based on Arabic Information Retrieval using RDF Graphs: A Preliminary Study. International Journal of Computer Applications, 182, 13-18.Google ScholarCross Ref
Abedi, A., Faez, K., & Mozaffari, S. (2009, November). Extraction of numerical strings in Farsi/Arabic documents using structural features. In 2009 Asia-Pacific Conference on Computational Intelligence and Industrial Applications (PACIIA) (Vol. 1, pp. 245-248). IEEE.‏Google ScholarCross Ref
Manwatkar, P. M., & Singh, K. R. (2015, January). A technical review on text recognition from images. In 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO) (pp. 1-5). IEEE.‏Google ScholarCross Ref
Karthikeyan, U., & Vanitha, M. (2019). A Study on Text Recognition using Image Processing with Datamining Techniques. no, 2, 1-5.Google Scholar
Yadav, V., & Ragot, N. (2016, April). Text extraction in document images: highlight on using corner points. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) (pp. 281-286). IEEE.Google ScholarCross Ref
Natei, K. N., Viradiya, J., & Sasikumar, S. (2018). Extracting text from image document and displaying its related information. J. Eng. Res. Appl, 8(5), 27-33.‏Google Scholar
Alghamdi, T., Snoussi, S., & Hsairi, L. (2021, November). Arabic document classification by deep learning. In The International Journal of Advanced Computer Science and Applications(IJACSA).Google Scholar
Deepa, R., & Lalwani, K. N. (2019, June). Image Classification and Text Extraction using machine learning. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 680-684). IEEE.Google ScholarCross Ref
Optical Character Recognition Pipeline: https://theailearner.com/2019/05/28/optical-character-recognition-pipeline/Google Scholar
Moussa, S. B., Zahour, A., Benabdelhafid, A., & Alimi, A. M. (2010). New features using fractal multi-dimensions for generalized Arabic font recognition. Pattern Recognition Letters, 31(5), 361-371.Google ScholarDigital Library
Sabbour, N., & Shafait, F. (2013, February). A segmentation-free approach to Arabic and Urdu OCR. In Document recognition and retrieval XX (Vol. 8658, p. 86580N). International Society for Optics and Photonics.Google Scholar
Liang, H., Sun, X., Sun, Y., & Gao, Y. (2017). Text feature extraction based on deep learning: a review. EURASIP journal on wireless communications and networking, 2017(1), 1-12.Google ScholarCross Ref
Image processing with python, https://datacarpentry.org/image-processing/07-thresholding/Google Scholar
Luqman, H., Mahmoud, S. A., & Awaida, S. (2014). KAFD Arabic font database. Pattern Recognition, 47(6), 2231-2240.Google ScholarDigital Library
OCR with Deep Learning: The Curious Machine Learning Case, https://labelyourdata.com/articles/ocr-with-deep-learningGoogle Scholar
TopOCR - Bringing Enhanced Tesseract OCR to Document Cameras, https://www.topocr.com/ocr.htmlGoogle Scholar
Drobac, S., & Lindén, K. (2020). Optical character recognition with neural networks and post-correction with finite state methods. International Journal on Document Analysis and Recognition (IJDAR), 23(4), 279-295.‏Google ScholarDigital Library
Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER), https://towardsdatascience.com/evaluating-ocr-output-quality-with-character-error-rate-cer-and-word-error-rate-wer-853175297510Google Scholar
Vasilopoulos, N., Wasfi, Y., & Kavallieratou, E. (2018, June). Automatic text extraction from arabic newspapers. In International Conference Image Analysis and Recognition (pp. 505-510). Springer, Cham.‏Google ScholarCross Ref
Yousfi, S., Berrani, S. A., & Garcia, C. (2015, August). Deep learning and recurrent connectionist-based approaches for Arabic text recognition in videos. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR)(pp. 1026-1030). IEEE.Google Scholar
Ujwala, B. S., & Sumathi, K. (2019). A Novel Approach Towards Implementation Of Optical Character Recognition Using LSTM And Adaptive Classifier. JNNCE Journal of Engineering & Management (JJEM), 3(2), 1.Google Scholar
Omee, F. Y., Himel, S. S., Bikas, M., & Naser, A. (2012). A complete workflow for development of Bangla OCR. arXiv preprint arXiv:1204Google Scholar

Recommendations

Canny edge detection towards deep learning Arabic document classification
ICFNDS '20: Proceedings of the 4th International Conference on Future Networks and Distributed Systems

The paper describes the implementation of deep learning-based edge detection in image processing. A set of points in an image at which image brightness changes formally or sharply is called edge detection. Using edge detection filters, we can extract ...
Read More
Multi-font printed Mongolian document recognition system
Special Issue DRR09

Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing ...
Read More
Typefaces and Ligatures in Printed Arabic Text: A Deep Learning-Based OCR Perspective
Document Analysis and Recognition – ICDAR 2023 Workshops
Abstract
Arabic script is complex, with multiple shapes for the same characters in different positions. Another challenge of the script, in the context of recognition, is ligatures. A combination of a specific two or more character sequence takes a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed Systems
December 2021
847 pages
ISBN:9781450387347
DOI:10.1145/3508072

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 April 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Arabic document image
Deep Learning
Image Preprocessing
knowledge extraction
text extraction
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 30
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Deep learning Arabic printed document knowledge extraction

ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed Systems

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Recommendations

Canny edge detection towards deep learning Arabic document classification

Multi-font printed Mongolian document recognition system

Typefaces and Ligatures in Printed Arabic Text: A Deep Learning-Based OCR Perspective

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Deep learning Arabic printed document knowledge extraction

ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed Systems

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Recommendations

Canny edge detection towards deep learning Arabic document classification

Multi-font printed Mongolian document recognition system

Typefaces and Ligatures in Printed Arabic Text: A Deep Learning-Based OCR Perspective

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media