Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement

Zhu, Anna; Zhang, Chen; Li, Zhi; Xiong, Shengwu

doi:10.1007/s10032-019-00341-0

Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement

Special Issue Paper
Published: 08 August 2019

Volume 22, pages 351–360, (2019)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Anna Zhu¹,
Chen Zhang¹,
Zhi Li¹ &
…
Shengwu Xiong ORCID: orcid.org/0000-0003-3931-9687¹

898 Accesses
16 Citations
3 Altmetric
Explore all metrics

Abstract

Document localization is a promising step for document-based optical character recognition. This task gains difficulty when documents are located in complex natural scene images. In this paper, we propose a coarse-to-fine document localization approach to detect the four corner points of the document in natural scene images. In the first stage, the four corners are roughly predicted through a deep neural networks-based Joint Corner Detector (JCD) with an attention mechanism, which roughly localize the document region via an attentional map. As a key to produce accurate inference of corners, the JCD module suppresses the interference from background in convolutional features substantially. In the second stage, a corner-specific refiner module is designed to refine the previously predicted corners. Considering the different characteristics of the four document corners, the patches cropped around the predicted corners are input into four different corner-specified CNN models, to search the accurate corner locations recursively. Three datasets (ICDAR 2015 SmartDoc competition 1 dataset, SEECS-NUSF dataset and a self-collected dataset) are used to evaluate the performance of our method. The experimental results demonstrate the superiority of the proposed method in localizing the document in natural images, especially in those with complex background. Compared with the state-of-the-art works, our method outperforms most of them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lightweight cross-attention-based HookNet for historical handwritten document layout analysis

Article 19 April 2025

Layout Analysis of Historical Document Images Using a Light Fully Convolutional Network

U-Net Based Architectures for Document Text Detection and Binarization

References

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: IEEE International Conference on Computer Vision, pp. 3047–3055 (2017)
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceeding of International Conference on Computer Vision, pp. 1520–1528 (2015)
Qiao, Y., Hu, Q.M., Qian, G.Y., Luo, S.H., Nowinski, W.L.: Thresholding based on variance and intensity contrast. Pattern Recognit. 40, 596–608 (2007)
Article MATH Google Scholar
Tobias, O.J., Seara, R.: Image segmentation by histogram thresholding using fuzzy sets. IEEE Trans. Image Process. 11, 1457–65 (2002)
Article Google Scholar
Lampert, C.H., Braun, T., Ulges, A., Keysers, D., Breuel, T.M.: Oblivious document capture and real-time retrieval. In: Proceeding of International Conference on Camera Based Document Analysis and Recognition, pp. 79–86 (2005)
Guillou, E., Meneveaux, D., Maisel, E., Bouaouch, K.: Using vanishing points for camera calibration and coarse 3D reconstruction from a single image. Visual Comput. 16, 396–410 (2000)
Article MATH Google Scholar
Kofler, C., Keysers, D., Koetsier, A., Laagland, J., Breuel, T.M.: Gestural interaction for an automatic document capture system. In: Proceedings of the International Workshop on Camera-Based Document Analysis and Recognition, pp. 161–167 (2007)
Clark, P., Mirmehdi, M.: Rectifying perspective view of text in 3D scenes using vanishing points. Pattern Recognit. 36, 2673–2686 (2003)
Article Google Scholar
Miao, L., Peng, S.: Perspective rectification of document images based on morphology. In: International Conference on Computational Intelligence and Security, pp. 1805–1808 (2009)
Lu, S., Tan, C.L.: The restoration of camera documents through image segmentation. In: Proceeding of Document Analysis Systems, vol. 3872, pp. 484–495 (2006)
Lu, S., Chen, B.M., Ko, C.C.: Perspective rectification of document images using fuzzy set and morphological operations. Image Vis. Comput. 23, 541–553 (2005)
Article Google Scholar
Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic borders detection of camera document images. Psychopharmacology 182, 597–598 (2007)
Google Scholar
Bulatov, K., Arlazarov, V.V., Chernov, T., Slavin, O., Nikolaev, D.: Smart IDReader: document recognition in video stream. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 39–44 (2018)
Zhang, Z., He, L. W.: Note-taking with a camera: whiteboard scanning and image enhancement. In: Proceeding of International Conference on Acoustics, Speech, and Signal Processing, pp. 533–536 (2004)
Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)
Zhukovsky, A., Nikolaev, D., Arlazarov, V., Postnikov, V., Polevoy, D., Skoryukina, N., Chernov, T., Shemiakina, J., Mukovozov, A., Konovalenko, I.: Segments graph-based approach for document capture in a smartphone video stream. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 337–342 (2018)
Javed, K., Shafait, F.: Real-time document localization in natural images by recursive application of a CNN. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 105–110 (2017)
Yin, X.C., Sun, J., Naoi, S., Fujimoto, K., Fujii, Y., Kurokawa, K., Takebe, H.: A multi-stage strategy to perspective rectification for mobile phone camera-based document images. In: Proceeding of International Conference on Document Analysis and Recognition, pp. 574–578 (2007)
Azulay, A., Weiss, Y.: Why do deep convolutional networks generalize so poorly to small image transformations (2018). arXiv preprint arXiv: 1805.12177
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI Conference on Artificial Intelligence, pp. 4278–4284 (2016)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 265–283 (2016)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceeding of International Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Burie, J.C., Chazalon, J., Coustaty, M., Eskenazi, S., Luqman, M.M., Mehri, M., Nayef, N., Ogier, J.M., Prum, S., Rusinol, M.: ICDAR2015 competition on smartphone document capture and OCR (Smart-Doc). In: Proceeding of International Conference on Document Analysis and Recognition, pp. 1161–1165 (2015)
Zisserman, A.: The Pascal Visual Object Classes Challenge. Lecture Notes in Computer Science, vol. 111, pp. 98–136 (2007)

Download references

Acknowledgements

This work was partly supported by the National Natural Science Foundation of China under Grant 61703316 and the National Key R&D Program of China (Grant No. 2017YFB1402200).

Author information

Authors and Affiliations

Wuhan University of Technology, Wuhan, China
Anna Zhu, Chen Zhang, Zhi Li & Shengwu Xiong

Authors

Anna Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Chen Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Zhi Li
View author publications
You can also search for this author inPubMed Google Scholar
Shengwu Xiong
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Shengwu Xiong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, A., Zhang, C., Li, Z. et al. Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement. IJDAR 22, 351–360 (2019). https://doi.org/10.1007/s10032-019-00341-0

Download citation

Received: 15 November 2018
Revised: 12 March 2019
Accepted: 26 July 2019
Published: 08 August 2019
Issue Date: September 2019
DOI: https://doi.org/10.1007/s10032-019-00341-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coarse-to-fine document localization in natural scene image with regional attention and recursive corner refinement

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Lightweight cross-attention-based HookNet for historical handwritten document layout analysis

Layout Analysis of Historical Document Images Using a Light Fully Convolutional Network

U-Net Based Architectures for Document Text Detection and Binarization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now