Skip to main content

A Two-Stage Approach for Text and Non-text Separation from Handwritten Scientific Document Images

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 699))

Abstract

The presence of non-text components in the document image hinders the result of an optical character recognition (OCR)-based document analysis system. Thus, text and non-text separation has become an essential task in the domain of document image processing. To address this issue, in the present work, a simple two-stage method is developed to separate the text and the non-text components from the images of handwritten scientific documents. Before starting the actual process, connected components from the document pages are extracted. Then, in the first stage, some commonly occurred components are identified and separated out as graphics. In the second stage, remaining components are passed through feature extraction and subsequent classification processes. Evaluating the system on handwritten scientific document images, it is found that 87.16% components are classified correctly as text or non-text.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Oyedotun, O.K., Khashman, A.: Document segmentation using textural features summarization and feedforward neural network. Appl. Intell. 1–15 (2016)

    Google Scholar 

  2. Lin, M.W., Tapamo, J.-R., Ndovie, B.: A texture-based method for document segmentation and classification. South African Comput. J. 36(1), 49–56 (2006)

    Google Scholar 

  3. Vil’kin, A.M., Safonov, I.V., Egorova, M.A.: Algorithm for segmentation of documents based on texture features. Pattern Recogn. Image Anal. 23(1), 153–159 (2013)

    Article  Google Scholar 

  4. Park, H.C., Ok, S.Y., Cho, H.: Word extraction in text/graphic mixed image using 3-dimensional graph model. ICCPOL 99, 171–176 (1999)

    Google Scholar 

  5. Le, V.P., Nayef, N., Visani, M., Ogier, J.-M., De Tran, C.: Text and non-text segmentation based on connected component features in document analysis and recognition (ICDAR). In: 13th International Conference on 2015, pp. 1096–1100 (2015)

    Google Scholar 

  6. Tran, T.-A., Na, I.-S., Kim, S.-H.: Separation of text and non-text in document layout analysis using a recursive filter. KSII Trans. Inter. Inf. Syst. 9(10), 4072–4091 (2015)

    Google Scholar 

  7. Sarkar, R., Moulik, S., Das, N., Basu, S., Nasipuri, M., Kundu, M.: Suppression of non-text components in handwritten document images. In: ICIIP 2011—Proceedings of International Conference Image Information Process on 2011, no. Iciip (2011)

    Google Scholar 

  8. Bhowmik, S., Sarkar, R., Nasipuri, M.: Text and non-text separation in handwritten document images using local binary pattern operator. In: Proceedings of the First International Conference on Intelligent Computing and Communication, 2017, pp. 507–515

    Google Scholar 

  9. Moll, M.A., Baird, H.S., An, C.: Truthing for pixel-accurate segmentation. In: Document Analysis Systems, 2008. DAS’08. The Eighth IAPR International Workshop on 2008, pp. 379–385

    Google Scholar 

  10. Moll, M.A., Baird, H.S.: Segmentation-based retrieval of document images from diverse collections. Electron. Imag. 2008, 68150L–68150L (2008)

    Google Scholar 

  11. Shih, F.Y., Chen, S.S.: Adaptive document block segmentation and classification. IEEE Trans. Syst. Man Cybern. Part B 26(5), 797–802 (1996)

    Article  Google Scholar 

  12. Das, B., Bhowmik, S., Saha, A., Sarkar, R.: An adaptive foreground-background separation method for effective binarization of document images. In: International Conference on Soft Computing and Pattern Recognition pp. 515–524. Springer, Cham Dec. 2016

    Google Scholar 

  13. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  MathSciNet  Google Scholar 

  14. AbuBaker, A., Qahwaji, R., Ipson, S., Saleh, M.: One scan connected component labeling technique. In: Signal Processing and Communications, 2007. IEEE International Conference on ICSPC 2007, pp. 1283–1286

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soumyadeep Kundu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bhowmik, S., Kundu, S., De, B.K., Sarkar, R., Nasipuri, M. (2019). A Two-Stage Approach for Text and Non-text Separation from Handwritten Scientific Document Images. In: Chandra, P., Giri, D., Li, F., Kar, S., Jana, D. (eds) Information Technology and Applied Mathematics. Advances in Intelligent Systems and Computing, vol 699. Springer, Singapore. https://doi.org/10.1007/978-981-10-7590-2_3

Download citation

Publish with us

Policies and ethics