Abstract
In this paper, we present a novel combined detector of document and business card. To detect document or business card, our method firstly extracts a document object region from a given image, and then classifies it into positive or negative class. In the step of extracting the document object region, a block-based processing is exploited to efficiently find the line segment candidates of its boundary, and RANSAC-like method under three constraints is used to search its real boundary. In classification step, after performing image normalization on the extracted region, the Fisher vector is extracted to represent the document object, then it is classified by linear-SVM. For evaluating the proposed method, we carry out some experiments by using the collected images, and show that our method has achieved about 94 % accuracy.
1 Introduction
Thanks to the advance of hardware specification of smartphone, the computer vision applications working on the desktop environment have become executable on the smartphone. In particular, the document-based applications such as automatic business card scanning and optical character recognition (OCR) have been released and widely used by many smartphone users. For example, CidT Co., Ltd. has released a business card management application [2], which segments its region from an input image and then registers it into the application by performing OCR. Another example is the ABBYY Mobile OCR engine [1] released by ABBYY Co., which serves functions such as business card recognition, word search, and sentence translation using the result of OCR.
Even though many document-based applications are available, they however have a limitation in that their user has to give the positive image including a document object, namely no detection procedure. This is basically because the documents have various textures depending on document type, and consequently it makes building the document detector hard. If it is possible to build the document detector, it can be used as a service invocator of document-based applications on the smartphone.
In this paper, to build a service invocator of the two widely used applications, business card scanning and OCR, we propose a combined detector of document and business card, in which the detection problem is formulated as a three-class classification problem. Our detector consists of two steps:
-
Quadrilateral Region Extraction: This procedure extracts a region candidate of document or business card from an input image via a block-based line fitting and the largest quadrilateral search.
-
Region Classification: The extracted region from the previous step is normalized and classified into one of document object classes (document or business card) or negative class by using linear-SVM with the Fisher vector [18].
The remaining of this paper is organized as follows. In Sect. 2, the candidate region extraction method of the document object is described in detail, and Sect. 3 presents the description of region classification method. The experimental results are analyzed in Sect. 4, and finally Sect. 5 concludes this work.
2 Quadrilateral Region Extraction
The region extraction method of this paper is illustrated in Fig. 1. At first, an input image is partitioned into four blocks, and the probabilistic Hough transform (PHT) [14] is applied to the edge image of each block for obtaining line segment candidates of document object boundary. Finally, our method detects the boundary of the document object through searching the largest quadrilateral under three constraints.
2.1 Image Partitioning
When extracting region candidate from an input image, our key observation is as follows. One often takes a picture of some text-contained objects instead of note-taking. Thus the object captured by this purpose usually occupies a large space in the image and has little perspective distortion because one may think readability is important in this case. Following this observation, the input image is normalized to 320 \(\times \) 240, and it is partitioned into four blocks (Top/Bottom/Left/Right) having some overlapped region (see Fig. 2). It means that each region of interest (ROI) to find the line segments of document object boundary is restricted to the specific size smaller than that of the whole image. In other words, the line segment fit on a particular block can be only the candidate corresponding to its side of document object boundary. We set the size of top and bottom block to \(320 \times 80\) and that of left and right block to \(100 \times 240\). It also means that the basic assumption about the size of document object region is that its size is greater than \(120 \times 80\) (gray-colored region in Fig. 2).
2.2 Block-based Line Fitting
To find line segment candidates of document object boundary, we apply the PHT [14] to the edge images of each partitioned block. Compared with the standard HT [8], the PHT reduces the computing time by considering the selected pixels instead of all pixels in voting procedure of the HT.
In line fitting step, the advantages which are able to be obtained when partitioning the input image are two folds. Firstly, we can reduce unnecessary line segments which can be fit on the inside region of document object. It leads to the search time reduction of finding the largest quadrilateral in a later step. Secondly, when finding the line segment candidates in each block through the PHT, we can use different parameter values such as minimum length of line segment. For example, we can set the minimum length of line segment in top and bottom blocks longer than that of left and right blocks.
The comparison results of several test images when using the PHT and the block-based PHT are shown in Fig. 3. We used different colors to distinguish the line segments of different blocks in Fig. 3(b). This figure also indicates our block-based processing is useful for finding true line segments of document object boundary well.
2.3 Searching Largest Quadrilateral
In this step, the RANSAC-like method [9] is used to search the largest quadrilateral, where one line segment is randomly selected from the detected candidates of each block, a quadrangle is formed by the selected four line segments, and the area of the quadrangle is calculated.
We denote \(N_i\) as the number of line segments extracted from i block, and denote \(L_i=\{l^i_j | 1 \le j \le N_i \}\) as the set of line segments of i block, where \(i \in \) {T op, B ottom, L eft, R ight}. We also denote \(\mathcal {Q} = \{Q_1,Q_2,...,Q_k \}\) as the set of all possible quadrangles, where \(Q_k = \{l^{T}_a, l^{B}_b, l^{L}_c, l^{R}_d\}, 1 \le a \le N_T, 1 \le b \le N_B, 1 \le c \le N_L, 1 \le d \le N_R\). Then, the total number of quadrangles is \(|\mathcal {Q}| = {N_T\atopwithdelims ()1} \times {N_B\atopwithdelims ()1} \times {N_L\atopwithdelims ()1} \times {N_R\atopwithdelims ()1}\). To calculate the area of \(Q_k\), its line segments are extended to straight lines, and their intersection points are calculated. Then, we can calculate its area by
where p and q are diagonal line segments, and \(\theta \) is their included angle.
Using the above notations and Eq. (1), the largest quadrangle search problem can be formulated as
When finding the maximum quadrangle \(Q^*\), our method imposes three constraints on the quadrangle, following our observation (see Sect. 2.1). The first one is that the aspect value of business card must be in the range of lower to upper threshold value. In our experiments, we chose lower and upper threshold values as 0.55 and 0.75, respectively. The second one is that all vertical angles of quadrilateral must be greater than \(75^{\circ }\) and smaller than \(105^{\circ }\). In other words, the pair of opposite sides of the quadrangle must be parallel as possible. The last one is the image boundary condition, which rejects the quadrangles touched on the image boundary.
To find the largest quadrilateral in \(\mathcal {Q}\) under the constraints, our method iteratively does the largest quadrangle search procedure until it converges, or specific number is reached.
3 Region Classification
3.1 Previous Work on Document Image Classification
Before presenting our region classification method, we briefly review the related works with respect to the document classification. Various approaches have been proposed by researchers. Kang et al. [12] proposed to use convolutional neural network, in which the ReLU (Rectified Linear Units) [15] was used as activation function of neuron, and dropout [10] was employed to prevent overfitting. Kumar et al. [13] proposed a method to measure structural similarity for document image classification and retrieval. In their method, SURF-based codebook is constructed, horizontal-vertical pooling is applied by recursively partitioning the image in horizontal and vertical directions to compute features, and total feature is classified by random forest [6].
3.2 Proposed Method
Basically, object detection is a binary classification problem. However, in our case we formulate the detection problem as a three-class classification problem to build the combined detector of document and business card. Two classes are for document and business card, and the remaining one is for negative class.
In this step, the extracted region from the previous step is normalized by removing perspective distortion and reducing image size, and then the Fisher vector (FV) is extracted from the normalized image. Finally, it is classified via linear-SVM. The proposed method is depicted in Fig. 4.
Image Normalization. In the region extraction step, although input image is processed in the reduced size of \(320 \times 240\), we can recover the original positions of the corner points of the extracted largest quadrangle by scale ratio,
where \(S_w\), \(S_h\), W, and H are the scale ratios of width and height, original width, and height, respectively. After finding the real positions of four corner points, the document region is segmented and perspective distortion of the region is removed. Then, the size of the distortion removed image is normalized to \(640 \times 480\).
Fisher Vector Extraction. We propose to use the FV [18] for representing a document object. Originally, its background theory was proposed by Jaakkola and Haussler [11] to combine the benefits of generative and discriminative methods, in which the Fisher kernel (FK) was derived from the generative model. The FK can be written as a dot-product between normalized vector \(\mathscr {G}_{\lambda }\):
where \(\mathscr {G}_{\lambda }^{X} = L_\lambda G_\lambda ^X = L_\lambda \nabla _\lambda \log u_\lambda (X)\). Here, \(\mathscr {G}_{\lambda }^{X}\) is referred to as the FV of X.
For using the FV framework to encode the normalized document image, local descriptors are extracted from the patches segmented from keypoint locations on the image by using HOG [7], and then GMM is fit to the local descriptors calculated from training samples for representing generative model \(u_\lambda \) as in [16].
Classification. The FV representation can be seen as the extension of the BOV (Bag-of-Visual words) because it not only considers the number of occurrences of each visual word but it also encodes additional information about the distribution of the descriptors. Thus, we can regard the FV as a mapping result of an input to the FK space. For this reason, we use linear-SVM as a region classifier.
4 Experiments
In this section, the performance evaluation of our method is carried out. Firstly, we have built our combined detector using OpenCV [3] and VLFeat [5] libraries, and ported it to an Android-based smartphone. Throughout all experiments, we use the Samsung Galaxy S4 smartphone [4] as experimental device.
4.1 Data Collection
To evaluate the proposed detector, we have collected the total 2,839 images consisting of documents, business cards, and negative class images. The information of the dataset is shown in Table 1.
4.2 Evaluation of Extracting Quadrilateral Region
For evaluating our region extraction method, we have sampled 300 business card images and 300 document images from the collected dataset, and made six test datasets in which each dataset consists of 100 images. This is because most of the images in our dataset have simple background. In constructing the six test datasets, we consider the background condition of images as shown in Table 2.
In testing the datasets in Table 2, we use an evaluation criterion,
where \(A(\mathord {\cdot })\) and th are the area function and threshold value, respectively, and our method decides the success or failure of one test sample depending on whether the overlap ratio is over the threshold value or not. In our experiments, we used 0.8 and 0.9 as threshold values. Tables 3 and 4 show the evaluation results of the six datasets, and some examples of segmentation results are shown in Fig. 5. Average speed on the experimental device is about 1.2 fps.
4.3 Evaluation of Region Classification
To calculate local descriptors from a normalized image, we extract 300 keypoints using FAST corner detector [17], segment \(16 \times 16\) local patches from each keypoint, and describe the patches using HOG. When fitting GMM to the HOG-based local descriptors calculated from training samples, we have experimentally set the number of Gaussian components as three, the minimum value that guarantees good performance.
Using these experimental setting, we have carried out 4-fold cross validation twenty times by randomly dividing total dataset described in Table 1. The experimental results are shown in Fig. 6 and Table 5. Average speed on the experimental device is about 11.4 fps.
5 Conclusion
In this paper, we have presented a combined document/business card detector, which consists of two steps, quadrilateral region extraction and region classification. To extract a document object region, our method exploits the block-based line fitting and the largest quadrangle search. Then, the extracted region is classified into one of three categories: document, business card, and negative class. To do this, after normalizing the extracted region, the FV is extracted from the normalized image. Then, it is classified via linear-SVM.
In this work, we have only evaluated our method on the collected dataset, but, in future work we will evaluate several alternatives. Also, we will plan to improve the performance of quadrilateral region segmentation method when no line segments are fit on some blocks.
References
Abbyy mobile ocr engine. http://www.abbyy.com/mobile-ocr
camcard. http://www.camcard.com
Opencv. http://www.opencv.org
Samsung galaxy s4. http://www.samsung.com/global/microsite/galaxys4/
Vlfeat. http://www.vlfeat.org
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893 (2005)
Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012)
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pp. 487–493 (1999)
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: Proceeding of 22nd International Conference on Pattern Recognition, pp. 3168–3172 (2014)
Kumar, J., Ye, P., Doermann, D.: Structural similarity for document image classification and retrieval. Pattern Recogn. Lett. 43(1), 119–126 (2014)
Matas, J., Galambos, C., Kittler, J.: Robust detection of lines using the progressive probabilistic hough transform. Comput. Vis. Image Underst. 78(1), 119–137 (2000)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
Rosten, E., Drummond, T.W.: Machine learning for high-speed corner detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 430–443. Springer, Heidelberg (2006)
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)
Acknowledgments
The research was supported by the Implementation of Technologies for Identification, Behavior, and Location of Human based on Sensor Network Fusion Program through the Ministry of Trade, Industry and Energy (Grant Number: 10041629), and also by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2015R1A2A2A01004282).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kim, YJ., Kim, Y., Kang, BN., Kim, D. (2015). Combined Document/Business Card Detector for Proactive Document-Based Services on the Smartphone. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9492. Springer, Cham. https://doi.org/10.1007/978-3-319-26561-2_47
Download citation
DOI: https://doi.org/10.1007/978-3-319-26561-2_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26560-5
Online ISBN: 978-3-319-26561-2
eBook Packages: Computer ScienceComputer Science (R0)