Making scanned Arabic documents machine accessible using an ensemble of SVM classifiers

Elanwar, Randa; Qin, Wenda; Betke, Margrit

doi:10.1007/s10032-018-0298-x

Making scanned Arabic documents machine accessible using an ensemble of SVM classifiers

Original Paper
Published: 02 April 2018

Volume 21, pages 59–75, (2018)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

478 Accesses
8 Citations
Explore all metrics

Abstract

Raster-image PDF files originating from scanning or photographing paper documents are inaccessible to both text search engines and screen readers that people with visual impairments use. We here focus on the relatively less-researched problem of converting raster-image files with Arabic script into machine-accessible documents. Our method, called ECDP for “Ensemble-based classification of document patches,” segments the physical layout of the document, classifies image patches as containing text or graphics, assembles homogeneous document regions, and passes the text to an optical character recognition engine to convert into natural language. Classification is based on the majority voting of an ensemble of support vector machines. When tested on the dataset BCE-Arabic [Saad et al. in: ACM 9th annual international conference on pervasive technologies related to assistive environments (PETRA’16), Corfu, 2016], ECDP yielded an average patch classification accuracy of 97.3% and average \(F_1\) score of 95.26% for text patches and efficiently extracted text zones in both paragraphs and text-embedded graphics, even if the text is rotated by \(90^{\circ }\) or is in English. ECDP outperforms a classical layout analysis method (RLSA) and a state-of-the-art commercial product (RDI-CleverPage) on this dataset and maintains a relatively high level of performance on document images drawn from two other datasets (Hesham et al. in Pattern Anal Appl 20:1275–1287, 2017; Proprietary Dataset of 109 Arabic Documents. http://www.rdi-eg.com). The results suggest that the proposed method has the potential to generalize well to the analysis of documents with a broad range of content.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic Character Recognition

Arabic document layout analysis

Article 08 February 2017

Database for Arabic Printed Text Recognition Research

References

Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.M.: Geometric layout analysis techniques for document image understanding: a review. Technical Report TR9703-09, ITC-IRST, Trento, January 1998, 68 pages
Chen, K., Seuret, M., Wei, H., Liwicki, M., Hennebert, J., Ingold, R.: Ground truth model, tool, and dataset for layout analysis of historical documents. In: Proceedings of SPIE 9402, Document Recognition and Retrieval XXII, Feb. 2015, 10 pages
Alshameri, A., Abdou, S., Mostafa, K.: A combined algorithm for layout analysis of Arabic document images and text lines extraction. Int. J. Comput. Appl. 49(23), 30–37 (2012)
Google Scholar
Bukhari, S.S., Breuel, T.M., Asi, A., El-Sana, J.: Layout analysis for Arabic historical document images using machine learning. In: International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 639–644 (2012)
Hadjar, K., Ingold, R.: Physical layout analysis of complex structured Arabic documents using artificial neural nets. In: International Workshop on Document Analysis Systems, pp. 170–178 (2004)
Saad, R.S.M., Elanwar, R.I., Abdel Kader, N.S., Mashali, S., Betke, M.: BCE-Arabic-v1 dataset: a step towards interpreting Arabic document images for people with visual impairments. In: ACM 9th Annual International Conference on Pervasive Technologies Related to Assistive Environments (PETRA’16), pp. 25–32, Corfu, June (2016)
BCE-Arabic Dataset: Publicly Available. http://www.cs.bu.edu/faculty/betke/BCE June (2016)
Hesham, A.M., Rashwan, M.A., Al-Barhamtoshy, H.M., Abdou, S.M., Badr, A.A., Farag, I.: Arabic document layout analysis. Pattern Anal. Appl. 20, 1275–1287 (2017)
Article MathSciNet Google Scholar
Proprietary Dataset of 109 Arabic Documents: Provided by RDI, The Engineering Compuany for Digital Systems Development. http://www.rdi-eg.com
Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 7(25), 10–22 (1992)
Article Google Scholar
Baird, H.: Background structure in document images. Int. J. Pattern Recognit. Artif. Intell. 8(5), 1013–1030 (1994)
Article Google Scholar
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Recognit. Mach. Learn. (TPAMI) 15(11), 1162–1173 (1993)
Article Google Scholar
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area Voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)
Article Google Scholar
Breuel, T.M.: Two geometric algorithms for layout analysis. In: Workshop on Document Analysis Systems, pp. 188–199, Princeton (2002)
Wong, K., Casey, R., Wahl, F.: Document analysis system. IBM J. Res. Dev. 26(6), 647–656 (1982)
Article Google Scholar
Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 20(3), 294–308 (1988)
Article Google Scholar
Fletcher, L.A., Kasturi, R.: A robust algorithm for text string separation from mixed text/graphics images. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 10, 910–918 (1988)
Article Google Scholar
Shafait, F., Hasan, A., Keysers, D., Breuel, T.M.: Layout analysis of Urdu document images. In: 10th International Multitopic Conference, pp. 293–298, Islamabad (2006)
Won, C.S.: Image extraction in digital documents. J. Electron. Imaging 17(3), 033016 (2008)
Article Google Scholar
Bukhari, S.S., Shafait, F., Breuel, T.M.: The IUPR dataset of camera-captured document images. In: International Workshop on Camera-Based Document Analysis and Recognition, pp. 164–171 (2012)
Lin, M.W., Tapamo, J.R., Ndovie, B.: A texture-based method for document segmentation and classification. S. Afr. Comput. J. 36(1), 49–56 (2006)
Google Scholar
Bukhari, S.S., Azawi, A., Ali, M.I., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: DAS’10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190, Boston, June (2010)
Ye, X., Cheriet, M., Suen, C.Y.: A generic system to extract and clean handwritten data from business forms. In: 7th International Workshop on Frontiers in Handwriting Recognition, pp. 63–72 (2000)
Fan, W., Sun, J., Naoi, S.: Separation of text and background regions for high performance document image compression. In: Proceedings SPIE 9402, Document Recognition and Retrieval XXII, pp. 94020K1–9420K12, February (2015)
Le, V.P., Nayef, N., Visani, M., Ogier, J.M., De Tran, C.: Text and non-text segmentation based on connected component features. In: IEEE 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1096–1100, August (2015)
Rahman, A.F.R., Fairhurst, M.C.: Multiple classifier decision combination strategies for character recognition: a review. Int. J. Doc. Anal. Recognit. (IJDAR) 5, 166–194 (2003)
Article Google Scholar
Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T.: Data sets for OCR and document image understanding research. In: Wang, H.B. (ed.) Handbook of Character Recognition and Document Image Analysis, pp. 779–799. World Scientific, Singapore (1997)
Chapter Google Scholar
Taghva, K., Nartker, T., Borsack, J., Condit, A.: UNLV-ISRI document collection for research in OCR and information retrieval. In: IS&T/SPIE Conference on Document Recognition and Retrieval VII, San Jose, pp. 157–164, January (2000)
Shafait, F., Breuel, T.M.: Document image dewarping contest. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition, pp. 181–188, Curitiba (2007)
Zelenika, D., Povh, J., Ženko, B.: Text detection in document images by machine learning algorithms. In: 9th International Conference on Computer Recognition Systems CORES 2015, pp. 169–179. Springer (2016)
Belaïd, A., Ouwayed, N.: Segmentation of ancient Arabic documents. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 103–122. Springer, London (2012)
Chapter Google Scholar
Boussellaa, W., Zahour, A., Taconet, B., Alimi, A., Benabdelhafid, A.: PRAAD: preprocessing and analysis tool for Arabic ancient documents. In: 9th International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 1058–1062 (2007)
Bukhari, S.S., Shafait, F., Breuel, T.M.: Layout analysis of Arabic script documents. In: Margner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 35–53. Springer, London (2012)
Bloomberg, D.: Multiresolution morphological approach to document image analysis. In: 1st International Conference on Document Analysis and Recognition, pp. 963–971 (1991)
Breuel, T.M.: An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis. In: 7th International Conference on Document Analysis and Recognition, pp. 66–70 (2003)
Hadjar, K., Ingold, R.: Arabic newspaper page segmentation. In: 7th International Conference on Document Analysis and Recognition, pp. 895–899 (2003)
Capobianco, S., Marinai, S.: Text line extraction in handwritten historical documents. In: Italian Research Conference on Digital Libraries, pp. 68–79 (2017)
Pastor-Pellicer, J., Afzal, M.Z., Liwicki, M., Castro-Bleda, M.J.: Complete system for text line extraction using convolutional neural networks and watershed transform. In: 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini, pp. 30–35 (2016)
Wick, C., Puppe, F.: Fully convolutional neural networks for page segmentation of historical document images. https://arxiv.org/pdf/1711.07695.pdf (2017), 6 pages
Hao, L., Gao, L., Yi, X., Tang, Z.: A table detection method for PDF documents based on convolutional neural networks. In: 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini (2016)
Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolutional neural networks for newspaper article segmentation. In: 14th International Conference on Document Analysis and Recognition (ICDAR), pp. 1–6 (2017)
Oliveira, D.A.B., Viana, P.M.: Fast CNN-based document layout analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1173–1180, Waikiki (2017)
Capobianco, S., Marinai, S.: Deep neural networks for record counting in historical handwritten documents. In: Pattern Recognition Letters (2017). https://doi.org/10.1016/j.patrec.2017.10.023
Rahman, M.M., Finin, T.: Understanding the logical and semantic structure of large documents. https://arxiv.org/pdf/1709.00770.pdf (2017), 10 pages
Afzal, M.Z., Capobiancot, S., Malik, M.I., Marinait, S., Breuel, T.M., Dengel, A., Liwicki, M.: DeepDocClassifier: document classification with deep convolutional neural network. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1111–1115 (2015)
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 991–995, August (2015)
Kölsch, A., Afzal, M.Z., Ebbecke, M., Liwicki, M.: Real-time document image classification using deep CNN and extreme learning machines. https://arxiv.org/pdf/1711.05862.pdf (2017), 6 pages
Seuret, M., Fischer, A., Garz, A., Liwicki, M., Ingold, R.: Clustering historical documents based on the reconstruction error of autoencoders. In: ACM 3rd International Workshop on Historical Document Imaging and Processing, pp. 85–91 (2015)
Chen, K., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation of historical document images with convolutional autoencoders. In: 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1011–1015 (2015)
Chen, K., Liu, C.L., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation for historical document images based on superpixel classification with unsupervised feature learning. In: 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 299–304 (2016)
Wei, H., Seuret, M., Chen, K., Fischer, A., Liwicki, M., Ingold, R.: Selecting autoencoder features for layout analysis of historical documents. In: ACM 3rd International Workshop on Historical Document Imaging and Processing, pp. 55–62 (2015)
Zhu, W., Chen, Q., Wei, C., Li, Z.: A segmentation algorithm based on image projection for complex text layout. In: American Institute of Physics (AIP) Conference Proceedings, vol. 1890, p. 1 (2017)
Ahn, B., Ryu, J., Koo, H.I., Cho, N.I.: Textline detection in degraded historical document images. EURASIP J. Image Video Process. 82, 1–13 (2017)
Google Scholar
Zhang, X., Duan, L., Ma, L., Wu, J.: Text extraction for historical Tibetan document images based on connected component analysis and corner point detection. In: Yang, J., et al. (eds.) Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol. 772, pp. 545–555. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-7302-1_45
Clausner, C., Pletschacher, S., Antonacopoulos, A.: Aletheia—an advanced document layout and text ground-truthing system for production environments. In: IEEE International Conference on Document Analysis and Recognition (ICDAR), pp. 48–52, September (2011)
Pletschacher, S., Antonacopoulos, A.: The page (page analysis and ground-truth elements) format framework. In: 20th International Conference on Pattern Recognition (ICPR), pp. 257–260 (2010)
Kavitha, A.S., Shivakumara, P., Kumar, G.H., Lu, T.: Text segmentation in degraded historical document images. Egypt. Inf. J. 17, 189–197 (2016)
Article Google Scholar
Wang, Y., Phillips, I.T., Haralick, R.M.: A study on the document zone content classification problem. In: International Workshop on Document Analysis Systems, pp. 212–223 (2002)
Baechler, M., Bloechle, J.-L., Ingold, R.L.: Semi-automatic annotation tool for medieval manuscripts. In: 2010 IEEE International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 182–187 (2010)
Deivalakshmi, S., Palanisamy, P., Vishwanathan, G.: A novel method for text and non-text segmentation in document images. In: 2013 IEEE International Conference on Communications and Signal Processing (ICCSP), pp. 255–259 (2013)
Tahmasbi, A., Saki, F., Shokouhi, S.B.: Classification of benign and malignant masses based on Zernike moments. Comput. Biol. Med. 41(8), 726–735 (2011)
Article Google Scholar
RDI CleverPage: Document layout analysis software by RDI, The Engineering Compuany for Digital Systems Development. http://www.rdi-eg.com
Shafait, F., Keysers, D., Breuel, T.M.: Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 941–954 (2008)
Article Google Scholar
Mao, S., Kanungo, T.: Empirical performance evaluation methodology and its application to page segmentation algorithms. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 23(3), 242–256 (2001)
Article Google Scholar
Oyedotun, O.K., Khashman, A.: Document segmentation using textural features summarization and feedforward neural network. Appl. Intell. 45, 1–15 (2016)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the RDI team for sharing their datasets and especially Shaimaa Samir for testing our data on the RDI Clever Page software. The authors acknowledge partial funding from the National Science Foundation (1337866, 1421943) (to M.B.) and the Cairo Initiative Scholarship Program (to R.E.).

Author information

Authors and Affiliations

Electronics Research Institute, Cairo, Egypt
Randa Elanwar
Boston University, Boston, USA
Wenda Qin & Margrit Betke

Authors

Randa Elanwar
View author publications
You can also search for this author in PubMed Google Scholar
Wenda Qin
View author publications
You can also search for this author in PubMed Google Scholar
Margrit Betke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Randa Elanwar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elanwar, R., Qin, W. & Betke, M. Making scanned Arabic documents machine accessible using an ensemble of SVM classifiers. IJDAR 21, 59–75 (2018). https://doi.org/10.1007/s10032-018-0298-x

Download citation

Received: 04 January 2017
Revised: 23 January 2018
Accepted: 23 March 2018
Published: 02 April 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10032-018-0298-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Making scanned Arabic documents machine accessible using an ensemble of SVM classifiers

Abstract

Access this article

Similar content being viewed by others

Arabic Character Recognition

Arabic document layout analysis

Database for Arabic Printed Text Recognition Research

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Making scanned Arabic documents machine accessible using an ensemble of SVM classifiers

Abstract

Access this article

Similar content being viewed by others

Arabic Character Recognition

Arabic document layout analysis

Database for Arabic Printed Text Recognition Research

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation