Skip to main content

Deep-TDRS: An Integrated System for Handwritten Text Detection-Recognition and Conversion to Speech Using Deep Learning

  • Conference paper
  • First Online:
Computer Vision and Image Processing (CVIP 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1567))

Included in the following conference series:

  • 916 Accesses

Abstract

Development of complete OCR for handwritten document (HOCR) is a challenging task due to a wide variation in writing styles, cursiveness, and contrasts in captured text images. We introduce a new three-staged pipeline process consisting of a) text detection, b) text recognition, c) text to speech conversion for the development of successful HOCR of multi-line document and converting them to speech. We have considered two state of the art object detection deep neural networks, EfficientDet and Faster R-CNN (Region based Convolutional Neural Network) followed by Weighted Boxes Fusion to obtain bounding boxes among all sentence wise text instances in the document. The detected text instances (image) are passed on to a hybrid CNN-RNN(CNN-Recurrent Neural Network) to obtain the recognized texts after appropriate training. The recognized text instances are provided as inputs to a state of the art TTS (Text to Speech) model DeepVoice3 for converting the text to speech which gets compiled as an audio book. The developed handwritten text detection and recognition model is comparable with the state of the art.

B. Mondal and S. G. Dastidar—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bluche, T., Louradour, J., Messina, R.: Scan, attend and read: end-to-end handwritten paragraph recognition with mdlstm attention (2016)

    Google Scholar 

  2. Chung, J., Delteil, T.: A computationally efficient pipeline approach to full page offline handwritten text recognition (2020)

    Google Scholar 

  3. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

  4. Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: detecting scene text via instance segmentation (2018)

    Google Scholar 

  5. Dutta, Ket al.: Multi scale mirror connection based encoder decoder network for text localization. Pattern Recogn. Lett. 135, 64 – 71 (2020). https://doi.org/10.1016/j.patrec.2020.04.002, http://www.sciencedirect.com/science/article/pii/S0167865520301227

  6. Dutta, K., Das, N., Kundu, M., Nasipuri, M.: Text localization in natural scene images using extreme learning machine. In: 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), pp. 1–6 (2019)

    Google Scholar 

  7. Fast, B.B., Allen, D.R.: OCR image preprocessing method for image enhancement of scanned documents. uS Patent 5,594,815 (1997)

    Google Scholar 

  8. Gllavata, J., Ewerth, R., Freisleben, B.: Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004, ICPR 2004, vol. 1, pp. 425–428 (2004). https://doi.org/10.1109/ICPR.2004.1334146

  9. Ito, K., Johnson, L.: The lj speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/

  10. Jain, A.K., Bin Yu: Automatic text location in images and video frames. In: Proceedings Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170), vol. 2, pp. 1497–1499 (1998). https://doi.org/10.1109/ICPR.1998.711990

  11. Kim, K.I., Jung, K., Kim, J.H.: Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1631–1639 (2003). https://doi.org/10.1109/TPAMI.2003.1251157

    Article  Google Scholar 

  12. Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild (2016)

    Google Scholar 

  13. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network (2016)

    Google Scholar 

  14. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  15. Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, M.: Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128(2), 261–318 (2020)

    Article  Google Scholar 

  16. Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text line detection in handwritten documents. Pattern Recogn. 41(12), 3758 – 3772 (2008). https://doi.org/10.1016/j.patcog.2008.05.011, http://www.sciencedirect.com/science/article/pii/S0031320308001775

  17. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002). https://doi.org/10.1007/s100320200071

    Article  MATH  Google Scholar 

  18. Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access 8, 142642–142668 (2020). https://doi.org/10.1109/ACCESS.2020.3012542

    Article  Google Scholar 

  19. Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6494, pp. 770–783. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19318-7_60

    Chapter  Google Scholar 

  20. Ping, W., et al.: Deep voice 3: scaling text-to-speech with convolutional sequence learning (2018)

    Google Scholar 

  21. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks (2016)

    Google Scholar 

  22. Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: ensembling boxes for object detection models (2020)

    Google Scholar 

  23. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  24. Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2020)

    Google Scholar 

  25. Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection (2020)

    Google Scholar 

  26. Wojna, Z., et al.: Attention-based extraction of structured information from street view imagery. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 844–850. IEEE (2017)

    Google Scholar 

  27. Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015). https://doi.org/10.1109/TPAMI.2014.2366765

    Article  Google Scholar 

  28. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems, pp. 3320–3328 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuvayan Ghosh Dastidar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mondal, B., Dastidar, S.G., Das, N. (2022). Deep-TDRS: An Integrated System for Handwritten Text Detection-Recognition and Conversion to Speech Using Deep Learning. In: Raman, B., Murala, S., Chowdhury, A., Dhall, A., Goyal, P. (eds) Computer Vision and Image Processing. CVIP 2021. Communications in Computer and Information Science, vol 1567. Springer, Cham. https://doi.org/10.1007/978-3-031-11346-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-11346-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-11345-1

  • Online ISBN: 978-3-031-11346-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics