Combined Document/Business Card Detector for Proactive Document-Based Services on the Smartphone

Kim, Yong-Joong; Kim, Yonghyun; Kang, Bong-Nam; Kim, Daijin

doi:10.1007/978-3-319-26561-2_47

Combined Document/Business Card Detector for Proactive Document-Based Services on the Smartphone

Yong-Joong Kim¹⁷,
Yonghyun Kim¹⁷,
Bong-Nam Kang¹⁸ &
…
Daijin Kim^17,18

Conference paper
First Online: 18 November 2015

2279 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9492))

Abstract

In this paper, we present a novel combined detector of document and business card. To detect document or business card, our method firstly extracts a document object region from a given image, and then classifies it into positive or negative class. In the step of extracting the document object region, a block-based processing is exploited to efficiently find the line segment candidates of its boundary, and RANSAC-like method under three constraints is used to search its real boundary. In classification step, after performing image normalization on the extracted region, the Fisher vector is extracted to represent the document object, then it is classified by linear-SVM. For evaluating the proposed method, we carry out some experiments by using the collected images, and show that our method has achieved about 94 % accuracy.

Download conference paper PDF

1 Introduction

Thanks to the advance of hardware specification of smartphone, the computer vision applications working on the desktop environment have become executable on the smartphone. In particular, the document-based applications such as automatic business card scanning and optical character recognition (OCR) have been released and widely used by many smartphone users. For example, CidT Co., Ltd. has released a business card management application [2], which segments its region from an input image and then registers it into the application by performing OCR. Another example is the ABBYY Mobile OCR engine [1] released by ABBYY Co., which serves functions such as business card recognition, word search, and sentence translation using the result of OCR.

Even though many document-based applications are available, they however have a limitation in that their user has to give the positive image including a document object, namely no detection procedure. This is basically because the documents have various textures depending on document type, and consequently it makes building the document detector hard. If it is possible to build the document detector, it can be used as a service invocator of document-based applications on the smartphone.

In this paper, to build a service invocator of the two widely used applications, business card scanning and OCR, we propose a combined detector of document and business card, in which the detection problem is formulated as a three-class classification problem. Our detector consists of two steps:

Quadrilateral Region Extraction: This procedure extracts a region candidate of document or business card from an input image via a block-based line fitting and the largest quadrilateral search.
Region Classification: The extracted region from the previous step is normalized and classified into one of document object classes (document or business card) or negative class by using linear-SVM with the Fisher vector [18].

The remaining of this paper is organized as follows. In Sect. 2, the candidate region extraction method of the document object is described in detail, and Sect. 3 presents the description of region classification method. The experimental results are analyzed in Sect. 4, and finally Sect. 5 concludes this work.

2 Quadrilateral Region Extraction

The region extraction method of this paper is illustrated in Fig. 1. At first, an input image is partitioned into four blocks, and the probabilistic Hough transform (PHT) [14] is applied to the edge image of each block for obtaining line segment candidates of document object boundary. Finally, our method detects the boundary of the document object through searching the largest quadrilateral under three constraints.

2.1 Image Partitioning

When extracting region candidate from an input image, our key observation is as follows. One often takes a picture of some text-contained objects instead of note-taking. Thus the object captured by this purpose usually occupies a large space in the image and has little perspective distortion because one may think readability is important in this case. Following this observation, the input image is normalized to 320 $\times $ 240, and it is partitioned into four blocks (Top/Bottom/Left/Right) having some overlapped region (see Fig. 2). It means that each region of interest (ROI) to find the line segments of document object boundary is restricted to the specific size smaller than that of the whole image. In other words, the line segment fit on a particular block can be only the candidate corresponding to its side of document object boundary. We set the size of top and bottom block to $320 \times 80$ and that of left and right block to $100 \times 240$. It also means that the basic assumption about the size of document object region is that its size is greater than $120 \times 80$ (gray-colored region in Fig. 2).

2.2 Block-based Line Fitting

To find line segment candidates of document object boundary, we apply the PHT [14] to the edge images of each partitioned block. Compared with the standard HT [8], the PHT reduces the computing time by considering the selected pixels instead of all pixels in voting procedure of the HT.

In line fitting step, the advantages which are able to be obtained when partitioning the input image are two folds. Firstly, we can reduce unnecessary line segments which can be fit on the inside region of document object. It leads to the search time reduction of finding the largest quadrilateral in a later step. Secondly, when finding the line segment candidates in each block through the PHT, we can use different parameter values such as minimum length of line segment. For example, we can set the minimum length of line segment in top and bottom blocks longer than that of left and right blocks.

The comparison results of several test images when using the PHT and the block-based PHT are shown in Fig. 3. We used different colors to distinguish the line segments of different blocks in Fig. 3(b). This figure also indicates our block-based processing is useful for finding true line segments of document object boundary well.

2.3 Searching Largest Quadrilateral

In this step, the RANSAC-like method [9] is used to search the largest quadrilateral, where one line segment is randomly selected from the detected candidates of each block, a quadrangle is formed by the selected four line segments, and the area of the quadrangle is calculated.

We denote $N_i$ as the number of line segments extracted from i block, and denote $L_i=\{l^i_j | 1 \le j \le N_i \}$ as the set of line segments of i block, where $i \in $ {T op, B ottom, L eft, R ight}. We also denote $\mathcal {Q} = \{Q_1,Q_2,...,Q_k \}$ as the set of all possible quadrangles, where $Q_k = \{l^{T}_a, l^{B}_b, l^{L}_c, l^{R}_d\}, 1 \le a \le N_T, 1 \le b \le N_B, 1 \le c \le N_L, 1 \le d \le N_R$. Then, the total number of quadrangles is $|\mathcal {Q}| = {N_T\atopwithdelims ()1} \times {N_B\atopwithdelims ()1} \times {N_L\atopwithdelims ()1} \times {N_R\atopwithdelims ()1}$. To calculate the area of $Q_k$, its line segments are extended to straight lines, and their intersection points are calculated. Then, we can calculate its area by

$$\begin{aligned} Area(Q_k) = \frac{1}{2}pq \sin \theta , \end{aligned}$$

(1)

where p and q are diagonal line segments, and $\theta $ is their included angle.

Using the above notations and Eq. (1), the largest quadrangle search problem can be formulated as

$$\begin{aligned} Q^* = \arg \!\max _{Q\in \mathcal {Q}} Area(Q). \end{aligned}$$

(2)

When finding the maximum quadrangle $Q^*$, our method imposes three constraints on the quadrangle, following our observation (see Sect. 2.1). The first one is that the aspect value of business card must be in the range of lower to upper threshold value. In our experiments, we chose lower and upper threshold values as 0.55 and 0.75, respectively. The second one is that all vertical angles of quadrilateral must be greater than $75^{\circ }$ and smaller than $105^{\circ }$. In other words, the pair of opposite sides of the quadrangle must be parallel as possible. The last one is the image boundary condition, which rejects the quadrangles touched on the image boundary.

To find the largest quadrilateral in $\mathcal {Q}$ under the constraints, our method iteratively does the largest quadrangle search procedure until it converges, or specific number is reached.

3 Region Classification

3.1 Previous Work on Document Image Classification

Before presenting our region classification method, we briefly review the related works with respect to the document classification. Various approaches have been proposed by researchers. Kang et al. [12] proposed to use convolutional neural network, in which the ReLU (Rectified Linear Units) [15] was used as activation function of neuron, and dropout [10] was employed to prevent overfitting. Kumar et al. [13] proposed a method to measure structural similarity for document image classification and retrieval. In their method, SURF-based codebook is constructed, horizontal-vertical pooling is applied by recursively partitioning the image in horizontal and vertical directions to compute features, and total feature is classified by random forest [6].

3.2 Proposed Method

Basically, object detection is a binary classification problem. However, in our case we formulate the detection problem as a three-class classification problem to build the combined detector of document and business card. Two classes are for document and business card, and the remaining one is for negative class.

In this step, the extracted region from the previous step is normalized by removing perspective distortion and reducing image size, and then the Fisher vector (FV) is extracted from the normalized image. Finally, it is classified via linear-SVM. The proposed method is depicted in Fig. 4.

Image Normalization. In the region extraction step, although input image is processed in the reduced size of $320 \times 240$, we can recover the original positions of the corner points of the extracted largest quadrangle by scale ratio,

$$\begin{aligned} S_w = \frac{W}{320}, \qquad S_h = \frac{H}{240}, \end{aligned}$$

(3)

where $S_w$, $S_h$, W, and H are the scale ratios of width and height, original width, and height, respectively. After finding the real positions of four corner points, the document region is segmented and perspective distortion of the region is removed. Then, the size of the distortion removed image is normalized to $640 \times 480$.

Fisher Vector Extraction. We propose to use the FV [18] for representing a document object. Originally, its background theory was proposed by Jaakkola and Haussler [11] to combine the benefits of generative and discriminative methods, in which the Fisher kernel (FK) was derived from the generative model. The FK can be written as a dot-product between normalized vector $\mathscr {G}_{\lambda }$:

$$\begin{aligned} K_{FK}(X,Y) = \mathscr {G}_{\lambda }^{X\prime } \mathscr {G}_{\lambda }^{Y}, \end{aligned}$$

(4)

where $\mathscr {G}_{\lambda }^{X} = L_\lambda G_\lambda ^X = L_\lambda \nabla _\lambda \log u_\lambda (X)$. Here, $\mathscr {G}_{\lambda }^{X}$ is referred to as the FV of X.

For using the FV framework to encode the normalized document image, local descriptors are extracted from the patches segmented from keypoint locations on the image by using HOG [7], and then GMM is fit to the local descriptors calculated from training samples for representing generative model $u_\lambda $ as in [16].

Classification. The FV representation can be seen as the extension of the BOV (Bag-of-Visual words) because it not only considers the number of occurrences of each visual word but it also encodes additional information about the distribution of the descriptors. Thus, we can regard the FV as a mapping result of an input to the FK space. For this reason, we use linear-SVM as a region classifier.

4 Experiments

In this section, the performance evaluation of our method is carried out. Firstly, we have built our combined detector using OpenCV [3] and VLFeat [5] libraries, and ported it to an Android-based smartphone. Throughout all experiments, we use the Samsung Galaxy S4 smartphone [4] as experimental device.

4.1 Data Collection

To evaluate the proposed detector, we have collected the total 2,839 images consisting of documents, business cards, and negative class images. The information of the dataset is shown in Table 1.

Table 1. Information of the collected dataset.

Full size table

4.2 Evaluation of Extracting Quadrilateral Region

For evaluating our region extraction method, we have sampled 300 business card images and 300 document images from the collected dataset, and made six test datasets in which each dataset consists of 100 images. This is because most of the images in our dataset have simple background. In constructing the six test datasets, we consider the background condition of images as shown in Table 2.

Table 2. Six test datasets for evaluating the quadrangle region extraction.

Full size table

In testing the datasets in Table 2, we use an evaluation criterion,

$$\begin{aligned} R_{overlap} = \frac{A(\text {ground-truth region}) \cap A(\text {detected region})}{A(\text {ground-truth region}) \cup A(\text {detected region})} \ge th, \end{aligned}$$

(5)

where $A(\mathord {\cdot })$ and th are the area function and threshold value, respectively, and our method decides the success or failure of one test sample depending on whether the overlap ratio is over the threshold value or not. In our experiments, we used 0.8 and 0.9 as threshold values. Tables 3 and 4 show the evaluation results of the six datasets, and some examples of segmentation results are shown in Fig. 5. Average speed on the experimental device is about 1.2 fps.

Table 3. Segmentation rates with threshold value 0.8 on six datasets in Table 2.

Full size table

Table 4. Segmentation rates with threshold value 0.9 on six datasets in Table 2.

Full size table

4.3 Evaluation of Region Classification

To calculate local descriptors from a normalized image, we extract 300 keypoints using FAST corner detector [17], segment $16 \times 16$ local patches from each keypoint, and describe the patches using HOG. When fitting GMM to the HOG-based local descriptors calculated from training samples, we have experimentally set the number of Gaussian components as three, the minimum value that guarantees good performance.

Using these experimental setting, we have carried out 4-fold cross validation twenty times by randomly dividing total dataset described in Table 1. The experimental results are shown in Fig. 6 and Table 5. Average speed on the experimental device is about 11.4 fps.

Table 5. Average precision, recall, and accuracy on our dataset.

Full size table

5 Conclusion

In this paper, we have presented a combined document/business card detector, which consists of two steps, quadrilateral region extraction and region classification. To extract a document object region, our method exploits the block-based line fitting and the largest quadrangle search. Then, the extracted region is classified into one of three categories: document, business card, and negative class. To do this, after normalizing the extracted region, the FV is extracted from the normalized image. Then, it is classified via linear-SVM.

In this work, we have only evaluated our method on the collected dataset, but, in future work we will evaluate several alternatives. Also, we will plan to improve the performance of quadrilateral region segmentation method when no line segments are fit on some blocks.

References

Abbyy mobile ocr engine. http://www.abbyy.com/mobile-ocr
camcard. http://www.camcard.com
Opencv. http://www.opencv.org
Samsung galaxy s4. http://www.samsung.com/global/microsite/galaxys4/
Vlfeat. http://www.vlfeat.org
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893 (2005)
Google Scholar
Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972)
Article MATH Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Article MathSciNet Google Scholar
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012)
Google Scholar
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pp. 487–493 (1999)
Google Scholar
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: Proceeding of 22nd International Conference on Pattern Recognition, pp. 3168–3172 (2014)
Google Scholar
Kumar, J., Ye, P., Doermann, D.: Structural similarity for document image classification and retrieval. Pattern Recogn. Lett. 43(1), 119–126 (2014)
Article Google Scholar
Matas, J., Galambos, C., Kittler, J.: Robust detection of lines using the progressive probabilistic hough transform. Comput. Vis. Image Underst. 78(1), 119–137 (2000)
Article Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Google Scholar
Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
Google Scholar
Rosten, E., Drummond, T.W.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 430–443. Springer, Heidelberg (2006)
Chapter Google Scholar
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The research was supported by the Implementation of Technologies for Identification, Behavior, and Location of Human based on Sensor Network Fusion Program through the Ministry of Trade, Industry and Energy (Grant Number: 10041629), and also by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2015R1A2A2A01004282).

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Pohang University of Science and Technology, 77 Cheongam-ro, Nam-gu, Pohang, Gyeongbuk, 37673, Republic of Korea
Yong-Joong Kim, Yonghyun Kim & Daijin Kim
Department of Creative IT Engineering, Pohang University of Science and Technology, 77 Cheongam-ro, Nam-gu, Pohang, Gyeongbuk, 37673, Republic of Korea
Bong-Nam Kang & Daijin Kim

Authors

Yong-Joong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yonghyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Bong-Nam Kang
View author publications
You can also search for this author in PubMed Google Scholar
Daijin Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daijin Kim .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, YJ., Kim, Y., Kang, BN., Kim, D. (2015). Combined Document/Business Card Detector for Proactive Document-Based Services on the Smartphone. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9492. Springer, Cham. https://doi.org/10.1007/978-3-319-26561-2_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-26561-2_47
Published: 18 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26560-5
Online ISBN: 978-3-319-26561-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Quadrilateral Region Extraction

2.1 Image Partitioning

2.2 Block-based Line Fitting

2.3 Searching Largest Quadrilateral

3 Region Classification

3.1 Previous Work on Document Image Classification

3.2 Proposed Method

4 Experiments

4.1 Data Collection

4.2 Evaluation of Extracting Quadrilateral Region

4.3 Evaluation of Region Classification

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation