ABSTRACT
This paper presents how to utilize deep learning to extract knowledge from Arabic printed document images. The fundamental goal of deep learning is automatically extracting significant features from images, eliminating the need for a classic feature extraction method. We describe how to extract high-quality and coherent knowledge from Arabic printed document images using deep learning. This system is constructed on keywords used to classify Arabic document images according to these keywords. We used A questionnaire to choose valuable words according to historical, scientific, or religious documents. The evaluation of the proposed system is applied to Arabic printed document images to extract keywords. The accuracy of the proposed deep learning extraction approach is hugely affected by image preprocessing and image quality. The proposed system has a higher level of accuracy while extracting keywords. We achieve a 3.78% character error rate in the proposed system and a 15.46% word error rate.
Supplemental Material
Available for Download
Presentation slides
- Alkhateeb, F., Doush, I. A., & Albsoul, A. (2017). Arabic optical character recognition software: A review. Pattern Recognition and Image Analysis, 27(4), 763-776.Google ScholarDigital Library
- Siddhu, M. K., & Yaakob, S. N. Deep Learning Applied To Arabic And Latin Scripts: A Review.Google Scholar
- Das, A., Roy, S., Bhattacharya, U., & Parui, S. K. (2018, August). Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 3180-3185). IEEE.Google ScholarCross Ref
- Revathi, A. S., & Modi, N. A. (2021, March). Comparative Analysis of Text Extraction from Color Images using Tesseract and OpenCV. In 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom) (pp. 931-936). IEEE.Google Scholar
- Zacharias, E., Teuchler, M., & Bernier, B. (2020). Image Processing Based Scene-Text Detection and Recognition with Tesseract. arXiv preprint arXiv:2004.08079.Google Scholar
- Alginahi, Y. M. (2013). A survey on Arabic character segmentation. International Journal on Document Analysis and Recognition (IJDAR), 16(2), 105-126.Google ScholarDigital Library
- Alghamdi, M. A., Alkhazi, I. S., & Teahan, W. J. (2016, July). Arabic OCR evaluation tool. In 2016 7th international conference on computer science and information technology (CSIT) (pp. 1-6). IEEEGoogle Scholar
- Shi, Z., Setlur, S., & Govindaraju, V. (2009, July). A steerable directional local profile technique for extraction of handwritten arabic text lines. In 2009 10th International Conference on Document Analysis and Recognition (pp. 176-180). IEEE.Google ScholarDigital Library
- Boussellaa, W., Bougacha, A., Zahour, A., El Abed, H., & Alimi, A. (2009, July). Enhanced text extraction from Arabic degraded document images using EM algorithm. In 2009 10th International Conference on Document Analysis and Recognition (pp. 743-747). IEEE.Google ScholarDigital Library
- Dixit, U. D., & Shirdhonkar, M. S. (2015). A survey on document image analysis and retrieval system. International Journal on Cybernetics & Informatics (IJCI), 4(2), 259-270.Google ScholarCross Ref
- Khaled, M., & Pouzi, M. (2018). Information Extraction- based on Arabic Information Retrieval using RDF Graphs: A Preliminary Study. International Journal of Computer Applications, 182, 13-18.Google ScholarCross Ref
- Abedi, A., Faez, K., & Mozaffari, S. (2009, November). Extraction of numerical strings in Farsi/Arabic documents using structural features. In 2009 Asia-Pacific Conference on Computational Intelligence and Industrial Applications (PACIIA) (Vol. 1, pp. 245-248). IEEE.Google ScholarCross Ref
- Manwatkar, P. M., & Singh, K. R. (2015, January). A technical review on text recognition from images. In 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO) (pp. 1-5). IEEE.Google ScholarCross Ref
- Karthikeyan, U., & Vanitha, M. (2019). A Study on Text Recognition using Image Processing with Datamining Techniques. no, 2, 1-5.Google Scholar
- Yadav, V., & Ragot, N. (2016, April). Text extraction in document images: highlight on using corner points. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) (pp. 281-286). IEEE.Google ScholarCross Ref
- Natei, K. N., Viradiya, J., & Sasikumar, S. (2018). Extracting text from image document and displaying its related information. J. Eng. Res. Appl, 8(5), 27-33.Google Scholar
- Alghamdi, T., Snoussi, S., & Hsairi, L. (2021, November). Arabic document classification by deep learning. In The International Journal of Advanced Computer Science and Applications(IJACSA).Google Scholar
- Deepa, R., & Lalwani, K. N. (2019, June). Image Classification and Text Extraction using machine learning. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 680-684). IEEE.Google ScholarCross Ref
- Optical Character Recognition Pipeline: https://theailearner.com/2019/05/28/optical-character-recognition-pipeline/Google Scholar
- Moussa, S. B., Zahour, A., Benabdelhafid, A., & Alimi, A. M. (2010). New features using fractal multi-dimensions for generalized Arabic font recognition. Pattern Recognition Letters, 31(5), 361-371.Google ScholarDigital Library
- Sabbour, N., & Shafait, F. (2013, February). A segmentation-free approach to Arabic and Urdu OCR. In Document recognition and retrieval XX (Vol. 8658, p. 86580N). International Society for Optics and Photonics.Google Scholar
- Liang, H., Sun, X., Sun, Y., & Gao, Y. (2017). Text feature extraction based on deep learning: a review. EURASIP journal on wireless communications and networking, 2017(1), 1-12.Google ScholarCross Ref
- Image processing with python, https://datacarpentry.org/image-processing/07-thresholding/Google Scholar
- Luqman, H., Mahmoud, S. A., & Awaida, S. (2014). KAFD Arabic font database. Pattern Recognition, 47(6), 2231-2240.Google ScholarDigital Library
- OCR with Deep Learning: The Curious Machine Learning Case, https://labelyourdata.com/articles/ocr-with-deep-learningGoogle Scholar
- TopOCR - Bringing Enhanced Tesseract OCR to Document Cameras, https://www.topocr.com/ocr.htmlGoogle Scholar
- Drobac, S., & Lindén, K. (2020). Optical character recognition with neural networks and post-correction with finite state methods. International Journal on Document Analysis and Recognition (IJDAR), 23(4), 279-295.Google ScholarDigital Library
- Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER), https://towardsdatascience.com/evaluating-ocr-output-quality-with-character-error-rate-cer-and-word-error-rate-wer-853175297510Google Scholar
- Vasilopoulos, N., Wasfi, Y., & Kavallieratou, E. (2018, June). Automatic text extraction from arabic newspapers. In International Conference Image Analysis and Recognition (pp. 505-510). Springer, Cham.Google ScholarCross Ref
- Yousfi, S., Berrani, S. A., & Garcia, C. (2015, August). Deep learning and recurrent connectionist-based approaches for Arabic text recognition in videos. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR)(pp. 1026-1030). IEEE.Google Scholar
- Ujwala, B. S., & Sumathi, K. (2019). A Novel Approach Towards Implementation Of Optical Character Recognition Using LSTM And Adaptive Classifier. JNNCE Journal of Engineering & Management (JJEM), 3(2), 1.Google Scholar
- Omee, F. Y., Himel, S. S., Bikas, M., & Naser, A. (2012). A complete workflow for development of Bangla OCR. arXiv preprint arXiv:1204Google Scholar
Recommendations
Canny edge detection towards deep learning Arabic document classification
ICFNDS '20: Proceedings of the 4th International Conference on Future Networks and Distributed SystemsThe paper describes the implementation of deep learning-based edge detection in image processing. A set of points in an image at which image brightness changes formally or sharply is called edge detection. Using edge detection filters, we can extract ...
Multi-font printed Mongolian document recognition system
Special Issue DRR09Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing ...
Typefaces and Ligatures in Printed Arabic Text: A Deep Learning-Based OCR Perspective
Document Analysis and Recognition – ICDAR 2023 WorkshopsAbstractArabic script is complex, with multiple shapes for the same characters in different positions. Another challenge of the script, in the context of recognition, is ligatures. A combination of a specific two or more character sequence takes a ...
Comments