skip to main content
10.1145/3508072.3508103acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicfndsConference Proceedingsconference-collections
research-article

Deep learning Arabic printed document knowledge extraction

Published:13 April 2022Publication History

ABSTRACT

This paper presents how to utilize deep learning to extract knowledge from Arabic printed document images. The fundamental goal of deep learning is automatically extracting significant features from images, eliminating the need for a classic feature extraction method. We describe how to extract high-quality and coherent knowledge from Arabic printed document images using deep learning. This system is constructed on keywords used to classify Arabic document images according to these keywords. We used A questionnaire to choose valuable words according to historical, scientific, or religious documents. The evaluation of the proposed system is applied to Arabic printed document images to extract keywords. The accuracy of the proposed deep learning extraction approach is hugely affected by image preprocessing and image quality. The proposed system has a higher level of accuracy while extracting keywords. We achieve a 3.78% character error rate in the proposed system and a 15.46% word error rate.

Skip Supplemental Material Section

Supplemental Material

References

  1. Alkhateeb, F., Doush, I. A., & Albsoul, A. (2017). Arabic optical character recognition software: A review. Pattern Recognition and Image Analysis, 27(4), 763-776.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Siddhu, M. K., & Yaakob, S. N. Deep Learning Applied To Arabic And Latin Scripts: A Review.Google ScholarGoogle Scholar
  3. Das, A., Roy, S., Bhattacharya, U., & Parui, S. K. (2018, August). Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 3180-3185). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  4. Revathi, A. S., & Modi, N. A. (2021, March). Comparative Analysis of Text Extraction from Color Images using Tesseract and OpenCV. In 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom) (pp. 931-936). IEEE.‏Google ScholarGoogle Scholar
  5. Zacharias, E., Teuchler, M., & Bernier, B. (2020). Image Processing Based Scene-Text Detection and Recognition with Tesseract. arXiv preprint arXiv:2004.08079.‏Google ScholarGoogle Scholar
  6. Alginahi, Y. M. (2013). A survey on Arabic character segmentation. International Journal on Document Analysis and Recognition (IJDAR), 16(2), 105-126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alghamdi, M. A., Alkhazi, I. S., & Teahan, W. J. (2016, July). Arabic OCR evaluation tool. In 2016 7th international conference on computer science and information technology (CSIT) (pp. 1-6). IEEEGoogle ScholarGoogle Scholar
  8. Shi, Z., Setlur, S., & Govindaraju, V. (2009, July). A steerable directional local profile technique for extraction of handwritten arabic text lines. In 2009 10th International Conference on Document Analysis and Recognition (pp. 176-180). IEEE.‏Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Boussellaa, W., Bougacha, A., Zahour, A., El Abed, H., & Alimi, A. (2009, July). Enhanced text extraction from Arabic degraded document images using EM algorithm. In 2009 10th International Conference on Document Analysis and Recognition (pp. 743-747). IEEE.‏Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dixit, U. D., & Shirdhonkar, M. S. (2015). A survey on document image analysis and retrieval system. International Journal on Cybernetics & Informatics (IJCI), 4(2), 259-270.‏Google ScholarGoogle ScholarCross RefCross Ref
  11. Khaled, M., & Pouzi, M. (2018). Information Extraction- based on Arabic Information Retrieval using RDF Graphs: A Preliminary Study. International Journal of Computer Applications, 182, 13-18.Google ScholarGoogle ScholarCross RefCross Ref
  12. Abedi, A., Faez, K., & Mozaffari, S. (2009, November). Extraction of numerical strings in Farsi/Arabic documents using structural features. In 2009 Asia-Pacific Conference on Computational Intelligence and Industrial Applications (PACIIA) (Vol. 1, pp. 245-248). IEEE.‏Google ScholarGoogle ScholarCross RefCross Ref
  13. Manwatkar, P. M., & Singh, K. R. (2015, January). A technical review on text recognition from images. In 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO) (pp. 1-5). IEEE.‏Google ScholarGoogle ScholarCross RefCross Ref
  14. Karthikeyan, U., & Vanitha, M. (2019). A Study on Text Recognition using Image Processing with Datamining Techniques. no, 2, 1-5.Google ScholarGoogle Scholar
  15. Yadav, V., & Ragot, N. (2016, April). Text extraction in document images: highlight on using corner points. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) (pp. 281-286). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  16. Natei, K. N., Viradiya, J., & Sasikumar, S. (2018). Extracting text from image document and displaying its related information. J. Eng. Res. Appl, 8(5), 27-33.‏Google ScholarGoogle Scholar
  17. Alghamdi, T., Snoussi, S., & Hsairi, L. (2021, November). Arabic document classification by deep learning. In The International Journal of Advanced Computer Science and Applications(IJACSA).Google ScholarGoogle Scholar
  18. Deepa, R., & Lalwani, K. N. (2019, June). Image Classification and Text Extraction using machine learning. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 680-684). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  19. Optical Character Recognition Pipeline: https://theailearner.com/2019/05/28/optical-character-recognition-pipeline/Google ScholarGoogle Scholar
  20. Moussa, S. B., Zahour, A., Benabdelhafid, A., & Alimi, A. M. (2010). New features using fractal multi-dimensions for generalized Arabic font recognition. Pattern Recognition Letters, 31(5), 361-371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sabbour, N., & Shafait, F. (2013, February). A segmentation-free approach to Arabic and Urdu OCR. In Document recognition and retrieval XX (Vol. 8658, p. 86580N). International Society for Optics and Photonics.Google ScholarGoogle Scholar
  22. Liang, H., Sun, X., Sun, Y., & Gao, Y. (2017). Text feature extraction based on deep learning: a review. EURASIP journal on wireless communications and networking, 2017(1), 1-12.Google ScholarGoogle ScholarCross RefCross Ref
  23. Image processing with python, https://datacarpentry.org/image-processing/07-thresholding/Google ScholarGoogle Scholar
  24. Luqman, H., Mahmoud, S. A., & Awaida, S. (2014). KAFD Arabic font database. Pattern Recognition, 47(6), 2231-2240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. OCR with Deep Learning: The Curious Machine Learning Case, https://labelyourdata.com/articles/ocr-with-deep-learningGoogle ScholarGoogle Scholar
  26. TopOCR - Bringing Enhanced Tesseract OCR to Document Cameras, https://www.topocr.com/ocr.htmlGoogle ScholarGoogle Scholar
  27. Drobac, S., & Lindén, K. (2020). Optical character recognition with neural networks and post-correction with finite state methods. International Journal on Document Analysis and Recognition (IJDAR), 23(4), 279-295.‏Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER), https://towardsdatascience.com/evaluating-ocr-output-quality-with-character-error-rate-cer-and-word-error-rate-wer-853175297510Google ScholarGoogle Scholar
  29. Vasilopoulos, N., Wasfi, Y., & Kavallieratou, E. (2018, June). Automatic text extraction from arabic newspapers. In International Conference Image Analysis and Recognition (pp. 505-510). Springer, Cham.‏Google ScholarGoogle ScholarCross RefCross Ref
  30. Yousfi, S., Berrani, S. A., & Garcia, C. (2015, August). Deep learning and recurrent connectionist-based approaches for Arabic text recognition in videos. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR)(pp. 1026-1030). IEEE.Google ScholarGoogle Scholar
  31. Ujwala, B. S., & Sumathi, K. (2019). A Novel Approach Towards Implementation Of Optical Character Recognition Using LSTM And Adaptive Classifier. JNNCE Journal of Engineering & Management (JJEM), 3(2), 1.Google ScholarGoogle Scholar
  32. Omee, F. Y., Himel, S. S., Bikas, M., & Naser, A. (2012). A complete workflow for development of Bangla OCR. arXiv preprint arXiv:1204Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed Systems
    December 2021
    847 pages
    ISBN:9781450387347
    DOI:10.1145/3508072

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 13 April 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited
  • Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format