Abstract
Development of complete OCR for handwritten document (HOCR) is a challenging task due to a wide variation in writing styles, cursiveness, and contrasts in captured text images. We introduce a new three-staged pipeline process consisting of a) text detection, b) text recognition, c) text to speech conversion for the development of successful HOCR of multi-line document and converting them to speech. We have considered two state of the art object detection deep neural networks, EfficientDet and Faster R-CNN (Region based Convolutional Neural Network) followed by Weighted Boxes Fusion to obtain bounding boxes among all sentence wise text instances in the document. The detected text instances (image) are passed on to a hybrid CNN-RNN(CNN-Recurrent Neural Network) to obtain the recognized texts after appropriate training. The recognized text instances are provided as inputs to a state of the art TTS (Text to Speech) model DeepVoice3 for converting the text to speech which gets compiled as an audio book. The developed handwritten text detection and recognition model is comparable with the state of the art.
B. Mondal and S. G. Dastidar—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bluche, T., Louradour, J., Messina, R.: Scan, attend and read: end-to-end handwritten paragraph recognition with mdlstm attention (2016)
Chung, J., Delteil, T.: A computationally efficient pipeline approach to full page offline handwritten text recognition (2020)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: detecting scene text via instance segmentation (2018)
Dutta, Ket al.: Multi scale mirror connection based encoder decoder network for text localization. Pattern Recogn. Lett. 135, 64 – 71 (2020). https://doi.org/10.1016/j.patrec.2020.04.002, http://www.sciencedirect.com/science/article/pii/S0167865520301227
Dutta, K., Das, N., Kundu, M., Nasipuri, M.: Text localization in natural scene images using extreme learning machine. In: 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), pp. 1–6 (2019)
Fast, B.B., Allen, D.R.: OCR image preprocessing method for image enhancement of scanned documents. uS Patent 5,594,815 (1997)
Gllavata, J., Ewerth, R., Freisleben, B.: Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004, ICPR 2004, vol. 1, pp. 425–428 (2004). https://doi.org/10.1109/ICPR.2004.1334146
Ito, K., Johnson, L.: The lj speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Jain, A.K., Bin Yu: Automatic text location in images and video frames. In: Proceedings Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170), vol. 2, pp. 1497–1499 (1998). https://doi.org/10.1109/ICPR.1998.711990
Kim, K.I., Jung, K., Kim, J.H.: Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1631–1639 (2003). https://doi.org/10.1109/TPAMI.2003.1251157
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild (2016)
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., Pietikäinen, M.: Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128(2), 261–318 (2020)
Louloudis, G., Gatos, B., Pratikakis, I., Halatsis, C.: Text line detection in handwritten documents. Pattern Recogn. 41(12), 3758 – 3772 (2008). https://doi.org/10.1016/j.patcog.2008.05.011, http://www.sciencedirect.com/science/article/pii/S0031320308001775
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002). https://doi.org/10.1007/s100320200071
Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): a comprehensive systematic literature review (SLR). IEEE Access 8, 142642–142668 (2020). https://doi.org/10.1109/ACCESS.2020.3012542
Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6494, pp. 770–783. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19318-7_60
Ping, W., et al.: Deep voice 3: scaling text-to-speech with convolutional sequence learning (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks (2016)
Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: ensembling boxes for object detection models (2020)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2020)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection (2020)
Wojna, Z., et al.: Attention-based extraction of structured information from street view imagery. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 844–850. IEEE (2017)
Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015). https://doi.org/10.1109/TPAMI.2014.2366765
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems, pp. 3320–3328 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mondal, B., Dastidar, S.G., Das, N. (2022). Deep-TDRS: An Integrated System for Handwritten Text Detection-Recognition and Conversion to Speech Using Deep Learning. In: Raman, B., Murala, S., Chowdhury, A., Dhall, A., Goyal, P. (eds) Computer Vision and Image Processing. CVIP 2021. Communications in Computer and Information Science, vol 1567. Springer, Cham. https://doi.org/10.1007/978-3-031-11346-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-11346-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11345-1
Online ISBN: 978-3-031-11346-8
eBook Packages: Computer ScienceComputer Science (R0)