Deep-TDRS: An Integrated System for Handwritten Text Detection-Recognition and Conversion to Speech Using Deep Learning

Mondal, Bisakh; Dastidar, Shuvayan Ghosh; Das, Nibaran

doi:10.1007/978-3-031-11346-8_2

Bisakh Mondal¹⁰,
Shuvayan Ghosh Dastidar¹⁰ &
Nibaran Das¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1567))

Included in the following conference series:

International Conference on Computer Vision and Image Processing

916 Accesses

Abstract

Development of complete OCR for handwritten document (HOCR) is a challenging task due to a wide variation in writing styles, cursiveness, and contrasts in captured text images. We introduce a new three-staged pipeline process consisting of a) text detection, b) text recognition, c) text to speech conversion for the development of successful HOCR of multi-line document and converting them to speech. We have considered two state of the art object detection deep neural networks, EfficientDet and Faster R-CNN (Region based Convolutional Neural Network) followed by Weighted Boxes Fusion to obtain bounding boxes among all sentence wise text instances in the document. The detected text instances (image) are passed on to a hybrid CNN-RNN(CNN-Recurrent Neural Network) to obtain the recognized texts after appropriate training. The recognized text instances are provided as inputs to a state of the art TTS (Text to Speech) model DeepVoice3 for converting the text to speech which gets compiled as an audio book. The developed handwritten text detection and recognition model is comparable with the state of the art.

B. Mondal and S. G. Dastidar—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bluche, T., Louradour, J., Messina, R.: Scan, attend and read: end-to-end handwritten paragraph recognition with mdlstm attention (2016)
Google Scholar
Chung, J., Delteil, T.: A computationally efficient pipeline approach to full page offline handwritten text recognition (2020)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: detecting scene text via instance segmentation (2018)
Google Scholar
Dutta, Ket al.: Multi scale mirror connection based encoder decoder network for text localization. Pattern Recogn. Lett. 135, 64 – 71 (2020). https://doi.org/10.1016/j.patrec.2020.04.002, http://www.sciencedirect.com/science/article/pii/S0167865520301227
Dutta, K., Das, N., Kundu, M., Nasipuri, M.: Text localization in natural scene images using extreme learning machine. In: 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), pp. 1–6 (2019)
Google Scholar
Fast, B.B., Allen, D.R.: OCR image preprocessing method for image enhancement of scanned documents. uS Patent 5,594,815 (1997)
Google Scholar
Gllavata, J., Ewerth, R., Freisleben, B.: Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004, ICPR 2004, vol. 1, pp. 425–428 (2004). https://doi.org/10.1109/ICPR.2004.1334146
Ito, K., Johnson, L.: The lj speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Jain, A.K., Bin Yu: Automatic text location in images and video frames. In: Proceedings Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170), vol. 2, pp. 1497–1499 (1998). https://doi.org/10.1109/ICPR.1998.711990
Kim, K.I., Jung, K., Kim, J.H.: Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1631–1639 (2003). https://doi.org/10.1109/TPAMI.2003.1251157
Article Google Scholar
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild (2016)
Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network (2016)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, M.: Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128(2), 261–318 (2020)
Article Google Scholar
Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text line detection in handwritten documents. Pattern Recogn. 41(12), 3758 – 3772 (2008). https://doi.org/10.1016/j.patcog.2008.05.011, http://www.sciencedirect.com/science/article/pii/S0031320308001775
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002). https://doi.org/10.1007/s100320200071
Article MATH Google Scholar
Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access 8, 142642–142668 (2020). https://doi.org/10.1109/ACCESS.2020.3012542
Article Google Scholar
Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6494, pp. 770–783. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19318-7_60
Chapter Google Scholar
Ping, W., et al.: Deep voice 3: scaling text-to-speech with convolutional sequence learning (2018)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks (2016)
Google Scholar
Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: ensembling boxes for object detection models (2020)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2020)
Google Scholar
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection (2020)
Google Scholar
Wojna, Z., et al.: Attention-based extraction of structured information from street view imagery. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 844–850. IEEE (2017)
Google Scholar
Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015). https://doi.org/10.1109/TPAMI.2014.2366765
Article Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems, pp. 3320–3328 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Jadavpur University, Kolkata, India
Bisakh Mondal, Shuvayan Ghosh Dastidar & Nibaran Das

Authors

Bisakh Mondal
View author publications
You can also search for this author in PubMed Google Scholar
Shuvayan Ghosh Dastidar
View author publications
You can also search for this author in PubMed Google Scholar
Nibaran Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuvayan Ghosh Dastidar .

Editor information

Editors and Affiliations

Indian Institute of Technology Roorkee, Roorkee, India
Balasubramanian Raman
Indian Institute of Technology Ropar, Ropar, India
Subrahmanyam Murala
Jadavpur University, Kolkata, India
Ananda Chowdhury
Indian Institute of Technology Ropar, Ropar, India
Abhinav Dhall
Indian Institute of Technology Ropar, Ropar, India
Puneet Goyal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mondal, B., Dastidar, S.G., Das, N. (2022). Deep-TDRS: An Integrated System for Handwritten Text Detection-Recognition and Conversion to Speech Using Deep Learning. In: Raman, B., Murala, S., Chowdhury, A., Dhall, A., Goyal, P. (eds) Computer Vision and Image Processing. CVIP 2021. Communications in Computer and Information Science, vol 1567. Springer, Cham. https://doi.org/10.1007/978-3-031-11346-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-11346-8_2
Published: 24 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11345-1
Online ISBN: 978-3-031-11346-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep-TDRS: An Integrated System for Handwritten Text Detection-Recognition and Conversion to Speech Using Deep Learning