Abstract
Scene text analysis involves detecting and processing text/words in natural scene images for serving various purposes. This problem domain intrigues the research fraternity due to challenges like dealing with noise, blur, heterogeneous intensity variation, etc. The ultimate goal is making detected scene word recognizable by any standard Optical Character Recognition system, thereby necessitating effective scene word binarization. Several methods address scene text detection, but comparatively few addresses scene word binarization. These binarization methods, however, have limitations in robustness against image quality-based complexities thus causing low precision. Here, a novel approach is proposed for scene word binarization called Bi-level Overlapped Binning where intensities of color channels R, G and B are grouped or binned to generate several solutions in the form of binary images. The stable binary images are identified such that the image solutions from them can be classified as text or non-text using a standard classifier trained with some popular features. Finally, the resultant text solutions are combined probabilistically to get the binarized output. The proposed method is evaluated on standard datasets such as SVT, ICDAR-2003, ICDAR-2011 (Scene), ICDAR-2011 (BDI), KAIST and Total-Text achieving precisions 0.76, 0.87, 0.89, 0.85, 0.84 and 0.87 respectively, which are mostly better than that of the state-of-the-art.








Similar content being viewed by others
References
Bai X, Yao C, Liu W (2016) Strokelets: a learned multi-scale mid-level representation for scene text recognition. IEEE Trans Image Process 25(6):2789–2802
Bai, B, Yin, F and Liu, CL (2014). A seed-based segmentation method for scene text extraction. In 2014 11th IAPR International Workshop on Document Analysis Systems (pp. 262-266). IEEE
Bhowmik S, Sarkar R, Das B, Doermann D (2018) GiB: a ${G} $ ame theory ${I} $ nspired ${B} $ inarization technique for degraded document images. IEEE Trans Image Process 28(3):1443–1455
Bhunia AK, Kumar G, Roy PP, Balasubramanian R, Pal U (2018) Text recognition in scene image and video frame using Color Channel selection. Multimed Tools Appl 77(7):8551–8578
Bonechi, S, Andreini, P, Bianchini, M and Scarselli, F (2019). COCO_TS dataset: pixel–level annotations based on weak supervision for scene text segmentation. In International Conference on Artificial Neural Networks (pp. 238-250). Springer, Cham
Chen, H, Tsai, SS, Schroth, G, Chen, DM, Grzeszczuk, R and Girod, B (2011). Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In 2011 18th IEEE International Conference on Image Processing (pp. 2609-2612). IEEE
Dai, Y, Huang, Z, Gao, Y, Xu, Y, Chen, K, Guo, J and Qiu, W (2018). Fused text segmentation networks for multi-oriented scene text detection. In 2018 24th International Conference on Pattern Recognition (ICPR) (pp. 3604-3609). IEEE
Dutta, IN, Chakraborty, N, Mollah, AF, Basu, S and Sarkar, R (2019). Multi-lingual text localization from camera captured images based on foreground homogenity analysis. In Recent Developments in Machine Learning and Data Analytics (pp. 149–158). Springer, Singapore
Epshtein, B, Ofek, E and Wexler, Y (2010). Detecting text in natural scenes with stroke width transform. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 2963-2970). IEEE
Fan, DP, Cheng, MM, Liu, Y, Li, T and Borji, A (2017). Structure-measure: a new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision (pp. 4548-4557)
Fan, DP, Gong, C, Cao, Y, Ren, B, Cheng, MM and Borji, A (2018). Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421
Feild, J and Learned-Miller, E (2012). Scene text recognition with bilateral regression. Department of Computer Science, University of Massachusetts Amherst, Tech. Rep. UM-CS-2012-021
Ghoshal R, Roy A, Banerjee A, Dhara BC, Parui SK (2019) A novel method for binarization of scene text images and its application in text identification. Pattern Anal Applic 22(4):1361–1375
Howe NR (2013) Document binarization with automatic parameter tuning. International journal on document analysis and recognition (ijdar) 16(3):247–258
Kasar, T, Kumar, J and Ramakrishnan, AG (2007). Font and background color independent text binarization. In Second international workshop on camera-based document analysis and recognition (pp. 3-9)
Kittler J, Illingworth J, Föglein J (1985) Threshold selection based on a simple image statistic. Computer vision, graphics, and image processing 30(2):125–147
Kumar, D, Prasad, MA and Ramakrishnan, AG (2012). Benchmarking recognition results on camera captured word image data sets. In Proceeding of the workshop on Document Analysis and Recognition (pp. 100-107)
Li Y, Jia W, Shen C, van den Hengel A (2014) Characterness: an indicator of text in the wild. IEEE Trans Image Process 23(4):1666–1677
Liao, M, Wan, Z, Yao, C, Chen, K and Bai, X (2020). Real-time scene text detection with differentiable Binarization. In AAAI (pp. 11474-11481)
Lin H, Yang P, Zhang F (2020) Review of scene text detection and recognition. Archives of Computational Methods in Engineering 27(2):433–454
Malakar S, Ghosh M, Bhowmik S, Sarkar R, Nasipuri M (2020) A GA based hierarchical feature selection approach for handwritten word recognition. Neural Comput & Applic 32(7):2533–2552
Margolin, R, Zelnik-Manor, L and Tal, A (2014). How to evaluate foreground maps?. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248-255)
Matas J, Chum O, Urban M, Pajdla T (2004) Robust wide-baseline stereo from maximally stable extremal regions. Image Vis Comput 22(10):761–767
Milyaev S, Barinova O, Novikova T, Kohli P, Lempitsky V (2015) Fast and accurate scene text understanding with image binarization and off-the-shelf OCR. International Journal on Document Analysis and Recognition (IJDAR) 18(2):169–182
Mishra A, Alahari K, Jawahar CV (2017) Unsupervised refinement of color and stroke features for text binarization. International Journal on Document Analysis and Recognition (IJDAR) 20(2):105–121
Mukhopadhyay A, Kumar S, Chowdhury SR, Chakraborty N, Mollah AF, Basu S, Sarkar R (2019) Multi-lingual scene text detection using one-class classifier. International Journal of Computer Vision and Image Processing (IJCVIP) 9(2):48–65
Niblack W (1985) An introduction to digital image processing, 215 Strandberg publishing company. Copenhagen, Denmark
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics 9(1):62–66
Paul, S, Saha, S, Basu, S and Nasipuri, M (2015). Text localization in camera captured images using adaptive stroke filter. In Information Systems Design and Intelligent Applications (pp. 217–225). Springer, New Delhi
Paul S, Saha S, Basu S, Saha PK, Nasipuri M (2019) Text localization in camera captured images using fuzzy distance transform based adaptive stroke filter. Multimed Tools Appl 78(13):18017–18036
Sauvola J, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236
Tian C, Xia Y, Zhang X, Gao X (2017) Natural scene text detection with MC–MR candidate extraction and coarse-to-fine filtering. Neurocomputing 260:112–122
Weinman JJ, Butler Z, Knoll D, Feild J (2013) Toward integrated scene text reading. IEEE Trans Pattern Anal Mach Intell 36(2):375–387
Wolf, C and Doermann, D (2002). Binarization of low quality text using a markov random field model. In Object recognition supported by user interaction for service robots (Vol. 3, pp. 160-163). IEEE
Xie, E, Zang, Y, Shao, S, Yu, G, Yao, C and Li, G (2019). Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 9038-9045)
Yin XC, Pei WY, Zhang J, Hao HW (2015) Multi-orientation scene text detection with adaptive clustering. IEEE Trans Pattern Anal Mach Intell 37(9):1930–1937
Zhang H, Zhao K, Song YZ, Guo J (2013) Text extraction from natural scene image: a survey. Neurocomputing 122:310–323
Acknowledgements
This work is partially supported by the CMATER research laboratory of the Computer Science and Engineering Department, Jadavpur University, India, PURSE-II and UPE-II, project. SB is partially funded by DBT grant (BT/PR16356/BID/7/596/2016). RS, SB and AFM are partially funded by DST grant (EMR/2016/007213).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Lemma 1: If the difference between the pixel with the highest PixelValue and lowest PixelValue in foreground region R is D and let S be any bin size such that S ≥ D then for all (i, j) ∈ R, (\( {\varDelta}_k^s \))ij = 1 for some bin number \( k\in \left[1,2\ast m-1\right], where\ m=\left\lfloor \frac{PV_{max}}{S}\right\rfloor +1 \).
Given Definitions:
-
1.
Definition of a bin image \( {B}_k^s \) where \( {\left({B}_k^s\right)}_{ij} \) is the value at coordinate (i, j) of the image:
-
2.
Definition of a Delta image \( {\varDelta}_i^s \) :
-
3.
Definition of a foreground region R:
A foreground region R is a set of two or more points in an image I such that for every point p in R, there exists a point q in R such that q is in N8(p) where N8 is the set of 8-connected neighbors of a point.
Proof: Let the minimum PixelValue for the region R be Pmin for pixel coordinate (xa, ya) and maximum be Pmax for pixel coordinate (xb, yb) such that Pmax − Pmin = D.
Let,
From the definition of (\( {B}_k^s \))ij, in Eq (12) \( {\left({B}_k^s\right)}_{XaYa}=1 \). All pixels with PixelValue equal to Pmin in R will occur as a positive (value = 1) in binary image \( {B}_k^s \)
(Now)
(Thus,)
(Since,)
Also,
Thus, from Eq. (15) and Eq. (16) we get,
Since k is an integer and \( \left\lfloor \frac{P_{max}}{S}\right\rfloor +1 \) is an integer.
The pixels in foreground region R with value Pmax will fall in the same bin as the pixels with value Pmin (bin number k) or the immediate next bin (bin number k + 1). Since difference is less than the size of each bin, the bins where the region R is spread over, are limited to a single bin or two adjacent bins:
In that case, both \( \left\lfloor \frac{P_{min}}{S}\right\rfloor \kern0.5em +1=k\kern0.5em \)from Eq. (3) and \( \left\lfloor \frac{P_{max}}{S}\right\rfloor +1=k \)
For any (i, j) ∈ R,
Since k is an integer and\( \left\lfloor \frac{PixelValue\left(i,j\right)}{S}\right\rfloor +1 \) is an integer, this is the only possible solution
(Now,)
Since
We have to consider two different sub cases depending on the value of Pmax
Let us consider bin number k + m , which is the bin that overlaps with the right half of bin number k at Level 2. \( m=\left\lfloor \frac{PV_{max}}{S}\right\rfloor \) +1 from lemma statement.
(Now,)
From Eq. (19),
and from Eq. (18): \( \left\lfloor \frac{P_{max}}{S}\right\rfloor +1=k+1 \)
From definition of (\( {\mathrm{B}}_k^s \))ij in Eq. (12), we get:
Bin k has values ranging from (k − 1) ∗ S to k ∗ S and Bin k + m has the values \( \left(k-1\right)\ast S+\frac{S}{2}\ to\ k\ast S+\frac{S}{2} \)
From definition of \( {\Delta}_k^s \):
Thus, from Eq. (24)
Rearranging Eq. (19) we get
And because Pmin lies in Bin k from Eq. (14):
Now, any PixelValue in region R will lie between Pmax and Pmin.
From Eqs (25) and (28), we thus prove that \( {\left({\boldsymbol{\Delta}}_{\boldsymbol{k}}^{\boldsymbol{s}}\right)}_{\boldsymbol{ij}}=\mathbf{1} \) for all (i, j) ∈ R
Since D ≤ S and \( {P}_{min}\ge {P}_{max}-S\ge \left(k\ast S\right)+\frac{S}{2}-S=\left(k-1\right)\ast S+\frac{S}{2} \), hence,
And from Eq. (14):
Hence, from (28) and (29) we get:
From definition of (\( {\mathrm{B}}_k^s \))ij: \( {\left({\mathrm{B}}_{k+m}^s\right)}_{\mathrm{XbYb}}\kern0.75em =1 \)
Bin k + 1 has values ranging from k ∗ S to (k + 1) ∗ S and bin k + m has the values \( \left(k-1\right)\ast S+\frac{S}{2}\ to\ k\ast S+\frac{S}{2} \)
And from the definition of \( {\Delta}_{k+m}^s \):
Thus,
From Eq. (18) we have
And Pmin from Eq. (30) we get,
Now, any PixelValue in region R will lie between Pmax and Pmin
From Eqs. (37) and (39), we thus prove that \( {\left({\boldsymbol{\Delta}}_{\boldsymbol{k}}^{\boldsymbol{s}}\right)}_{\boldsymbol{ij}}=\mathbf{1} \) for all (i, j) ∈ R
Thus, for all the cases the lemma gets proved.
Rights and permissions
About this article
Cite this article
Dutta, I.N., Chakraborty, N., Mollah, A.F. et al. BOB: a bi-level overlapped binning procedure for scene word binarization. Multimed Tools Appl 80, 7609–7635 (2021). https://doi.org/10.1007/s11042-020-09785-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09785-7