1 Introduction

Thanks to the advance of hardware specification of smartphone, the computer vision applications working on the desktop environment have become executable on the smartphone. In particular, the document-based applications such as automatic business card scanning and optical character recognition (OCR) have been released and widely used by many smartphone users. For example, CidT Co., Ltd. has released a business card management application [2], which segments its region from an input image and then registers it into the application by performing OCR. Another example is the ABBYY Mobile OCR engine [1] released by ABBYY Co., which serves functions such as business card recognition, word search, and sentence translation using the result of OCR.

Even though many document-based applications are available, they however have a limitation in that their user has to give the positive image including a document object, namely no detection procedure. This is basically because the documents have various textures depending on document type, and consequently it makes building the document detector hard. If it is possible to build the document detector, it can be used as a service invocator of document-based applications on the smartphone.

In this paper, to build a service invocator of the two widely used applications, business card scanning and OCR, we propose a combined detector of document and business card, in which the detection problem is formulated as a three-class classification problem. Our detector consists of two steps:

  • Quadrilateral Region Extraction: This procedure extracts a region candidate of document or business card from an input image via a block-based line fitting and the largest quadrilateral search.

  • Region Classification: The extracted region from the previous step is normalized and classified into one of document object classes (document or business card) or negative class by using linear-SVM with the Fisher vector [18].

The remaining of this paper is organized as follows. In Sect. 2, the candidate region extraction method of the document object is described in detail, and Sect. 3 presents the description of region classification method. The experimental results are analyzed in Sect. 4, and finally Sect. 5 concludes this work.

2 Quadrilateral Region Extraction

The region extraction method of this paper is illustrated in Fig. 1. At first, an input image is partitioned into four blocks, and the probabilistic Hough transform (PHT) [14] is applied to the edge image of each block for obtaining line segment candidates of document object boundary. Finally, our method detects the boundary of the document object through searching the largest quadrilateral under three constraints.

Fig. 1.
figure 1

Overview of our region extraction method.

2.1 Image Partitioning

When extracting region candidate from an input image, our key observation is as follows. One often takes a picture of some text-contained objects instead of note-taking. Thus the object captured by this purpose usually occupies a large space in the image and has little perspective distortion because one may think readability is important in this case. Following this observation, the input image is normalized to 320 \(\times \) 240, and it is partitioned into four blocks (Top/Bottom/Left/Right) having some overlapped region (see Fig. 2). It means that each region of interest (ROI) to find the line segments of document object boundary is restricted to the specific size smaller than that of the whole image. In other words, the line segment fit on a particular block can be only the candidate corresponding to its side of document object boundary. We set the size of top and bottom block to \(320 \times 80\) and that of left and right block to \(100 \times 240\). It also means that the basic assumption about the size of document object region is that its size is greater than \(120 \times 80\) (gray-colored region in Fig. 2).

Fig. 2.
figure 2

The partitioning method of input image: top/bottom blocks and left/right blocks have the same size, respectively.

2.2 Block-based Line Fitting

To find line segment candidates of document object boundary, we apply the PHT [14] to the edge images of each partitioned block. Compared with the standard HT [8], the PHT reduces the computing time by considering the selected pixels instead of all pixels in voting procedure of the HT.

In line fitting step, the advantages which are able to be obtained when partitioning the input image are two folds. Firstly, we can reduce unnecessary line segments which can be fit on the inside region of document object. It leads to the search time reduction of finding the largest quadrilateral in a later step. Secondly, when finding the line segment candidates in each block through the PHT, we can use different parameter values such as minimum length of line segment. For example, we can set the minimum length of line segment in top and bottom blocks longer than that of left and right blocks.

The comparison results of several test images when using the PHT and the block-based PHT are shown in Fig. 3. We used different colors to distinguish the line segments of different blocks in Fig. 3(b). This figure also indicates our block-based processing is useful for finding true line segments of document object boundary well.

Fig. 3.
figure 3

Comparison results of block-based probabilistic Hough transform with probabilistic Hough transform.

2.3 Searching Largest Quadrilateral

In this step, the RANSAC-like method [9] is used to search the largest quadrilateral, where one line segment is randomly selected from the detected candidates of each block, a quadrangle is formed by the selected four line segments, and the area of the quadrangle is calculated.

We denote \(N_i\) as the number of line segments extracted from i block, and denote \(L_i=\{l^i_j | 1 \le j \le N_i \}\) as the set of line segments of i block, where \(i \in \) {T op, B ottom, L eft, R ight}. We also denote \(\mathcal {Q} = \{Q_1,Q_2,...,Q_k \}\) as the set of all possible quadrangles, where \(Q_k = \{l^{T}_a, l^{B}_b, l^{L}_c, l^{R}_d\}, 1 \le a \le N_T, 1 \le b \le N_B, 1 \le c \le N_L, 1 \le d \le N_R\). Then, the total number of quadrangles is \(|\mathcal {Q}| = {N_T\atopwithdelims ()1} \times {N_B\atopwithdelims ()1} \times {N_L\atopwithdelims ()1} \times {N_R\atopwithdelims ()1}\). To calculate the area of \(Q_k\), its line segments are extended to straight lines, and their intersection points are calculated. Then, we can calculate its area by

$$\begin{aligned} Area(Q_k) = \frac{1}{2}pq \sin \theta , \end{aligned}$$
(1)

where p and q are diagonal line segments, and \(\theta \) is their included angle.

Using the above notations and Eq. (1), the largest quadrangle search problem can be formulated as

$$\begin{aligned} Q^* = \arg \!\max _{Q\in \mathcal {Q}} Area(Q). \end{aligned}$$
(2)

When finding the maximum quadrangle \(Q^*\), our method imposes three constraints on the quadrangle, following our observation (see Sect. 2.1). The first one is that the aspect value of business card must be in the range of lower to upper threshold value. In our experiments, we chose lower and upper threshold values as 0.55 and 0.75, respectively. The second one is that all vertical angles of quadrilateral must be greater than \(75^{\circ }\) and smaller than \(105^{\circ }\). In other words, the pair of opposite sides of the quadrangle must be parallel as possible. The last one is the image boundary condition, which rejects the quadrangles touched on the image boundary.

To find the largest quadrilateral in \(\mathcal {Q}\) under the constraints, our method iteratively does the largest quadrangle search procedure until it converges, or specific number is reached.

3 Region Classification

3.1 Previous Work on Document Image Classification

Before presenting our region classification method, we briefly review the related works with respect to the document classification. Various approaches have been proposed by researchers. Kang et al. [12] proposed to use convolutional neural network, in which the ReLU (Rectified Linear Units) [15] was used as activation function of neuron, and dropout [10] was employed to prevent overfitting. Kumar et al. [13] proposed a method to measure structural similarity for document image classification and retrieval. In their method, SURF-based codebook is constructed, horizontal-vertical pooling is applied by recursively partitioning the image in horizontal and vertical directions to compute features, and total feature is classified by random forest [6].

3.2 Proposed Method

Basically, object detection is a binary classification problem. However, in our case we formulate the detection problem as a three-class classification problem to build the combined detector of document and business card. Two classes are for document and business card, and the remaining one is for negative class.

In this step, the extracted region from the previous step is normalized by removing perspective distortion and reducing image size, and then the Fisher vector (FV) is extracted from the normalized image. Finally, it is classified via linear-SVM. The proposed method is depicted in Fig. 4.

Image Normalization. In the region extraction step, although input image is processed in the reduced size of \(320 \times 240\), we can recover the original positions of the corner points of the extracted largest quadrangle by scale ratio,

$$\begin{aligned} S_w = \frac{W}{320}, \qquad S_h = \frac{H}{240}, \end{aligned}$$
(3)

where \(S_w\), \(S_h\), W, and H are the scale ratios of width and height, original width, and height, respectively. After finding the real positions of four corner points, the document region is segmented and perspective distortion of the region is removed. Then, the size of the distortion removed image is normalized to \(640 \times 480\).

Fig. 4.
figure 4

Overview of our region classification method.

Fisher Vector Extraction. We propose to use the FV [18] for representing a document object. Originally, its background theory was proposed by Jaakkola and Haussler [11] to combine the benefits of generative and discriminative methods, in which the Fisher kernel (FK) was derived from the generative model. The FK can be written as a dot-product between normalized vector \(\mathscr {G}_{\lambda }\):

$$\begin{aligned} K_{FK}(X,Y) = \mathscr {G}_{\lambda }^{X\prime } \mathscr {G}_{\lambda }^{Y}, \end{aligned}$$
(4)

where \(\mathscr {G}_{\lambda }^{X} = L_\lambda G_\lambda ^X = L_\lambda \nabla _\lambda \log u_\lambda (X)\). Here, \(\mathscr {G}_{\lambda }^{X}\) is referred to as the FV of X.

For using the FV framework to encode the normalized document image, local descriptors are extracted from the patches segmented from keypoint locations on the image by using HOG [7], and then GMM is fit to the local descriptors calculated from training samples for representing generative model \(u_\lambda \) as in [16].

Classification. The FV representation can be seen as the extension of the BOV (Bag-of-Visual words) because it not only considers the number of occurrences of each visual word but it also encodes additional information about the distribution of the descriptors. Thus, we can regard the FV as a mapping result of an input to the FK space. For this reason, we use linear-SVM as a region classifier.

4 Experiments

In this section, the performance evaluation of our method is carried out. Firstly, we have built our combined detector using OpenCV [3] and VLFeat [5] libraries, and ported it to an Android-based smartphone. Throughout all experiments, we use the Samsung Galaxy S4 smartphone [4] as experimental device.

4.1 Data Collection

To evaluate the proposed detector, we have collected the total 2,839 images consisting of documents, business cards, and negative class images. The information of the dataset is shown in Table 1.

Table 1. Information of the collected dataset.

4.2 Evaluation of Extracting Quadrilateral Region

For evaluating our region extraction method, we have sampled 300 business card images and 300 document images from the collected dataset, and made six test datasets in which each dataset consists of 100 images. This is because most of the images in our dataset have simple background. In constructing the six test datasets, we consider the background condition of images as shown in Table 2.

Table 2. Six test datasets for evaluating the quadrangle region extraction.

In testing the datasets in Table 2, we use an evaluation criterion,

$$\begin{aligned} R_{overlap} = \frac{A(\text {ground-truth region}) \cap A(\text {detected region})}{A(\text {ground-truth region}) \cup A(\text {detected region})} \ge th, \end{aligned}$$
(5)

where \(A(\mathord {\cdot })\) and th are the area function and threshold value, respectively, and our method decides the success or failure of one test sample depending on whether the overlap ratio is over the threshold value or not. In our experiments, we used 0.8 and 0.9 as threshold values. Tables 3 and 4 show the evaluation results of the six datasets, and some examples of segmentation results are shown in Fig. 5. Average speed on the experimental device is about 1.2 fps.

Table 3. Segmentation rates with threshold value 0.8 on six datasets in Table 2.
Table 4. Segmentation rates with threshold value 0.9 on six datasets in Table 2.
Fig. 5.
figure 5

Examples of region extraction results.

4.3 Evaluation of Region Classification

To calculate local descriptors from a normalized image, we extract 300 keypoints using FAST corner detector [17], segment \(16 \times 16\) local patches from each keypoint, and describe the patches using HOG. When fitting GMM to the HOG-based local descriptors calculated from training samples, we have experimentally set the number of Gaussian components as three, the minimum value that guarantees good performance.

Using these experimental setting, we have carried out 4-fold cross validation twenty times by randomly dividing total dataset described in Table 1. The experimental results are shown in Fig. 6 and Table 5. Average speed on the experimental device is about 11.4 fps.

Fig. 6.
figure 6

Experimental results of the twenty times of 4-fold cross validations on our dataset.

Table 5. Average precision, recall, and accuracy on our dataset.

5 Conclusion

In this paper, we have presented a combined document/business card detector, which consists of two steps, quadrilateral region extraction and region classification. To extract a document object region, our method exploits the block-based line fitting and the largest quadrangle search. Then, the extracted region is classified into one of three categories: document, business card, and negative class. To do this, after normalizing the extracted region, the FV is extracted from the normalized image. Then, it is classified via linear-SVM.

In this work, we have only evaluated our method on the collected dataset, but, in future work we will evaluate several alternatives. Also, we will plan to improve the performance of quadrilateral region segmentation method when no line segments are fit on some blocks.