Keywords

1 Introduction

Nowadays, more and more printed books are accompanied by electronic resources including videos, audios, games, augmented reality and other mobile apps. However, it is not very convenient to access most of these electronic resources, as the association between printed books and electronic resources is not automatically available [1]. Take the accessing of an accompanied video for example: one should first find the right video file corresponding to the book, open it by a video player, and then repeatedly fast-forward or fast-backward to locate the exact position relevant to a certain book page. Beside the fact that this task may take an adult minutes to complete, it is often a challenge for very young children and aged people. There exists a pressing need for associating the printed books with the accompanied electronics resources to enable quick and convenient access to the electronic resources.

The major issue of associating the printed books with accompanied electronic resources is to automatically identify book pages. It is then possible to map the page to its corresponding video/audio position or to a certain scenario of an app/game via a table/database [2]. Existing page identification methods can be generally divided into three categories: Optical IDentification (OID), Page Identifier (PI), and Computer Vision (CV) methods.

The OID-based method usually relies on a device called “talking pen” [3], which reads and identifies invisible codes printed by infrared reflective ink. As the OID-based methods can discriminate about 600000 different codes, vast amounts of pages can be identified stablely. However, this technology is hard to be popularized because high-cost ink and especially made hardware are required.

PI-based method identifies book pages by recognizing an additional page identifier printed on each page. Jeong et al. [4] print a specially designed page identifier on each page of the book, and identify book pages by comparing the characteristics of the captured page identifier with its database. Baik [5] regards the two dimensional code as an ambient media gate to the digital world, and has developed a “scan-to-watch” application to access the TV programs by scanning two dimensional codes on printed materials. Although PI-based technology can be easily integrated into mobile apps and page identifiers are quite robust for recognizing, there are two disadvantages: (1) Page identifiers more or less strip the aesthetics out of books; (2) Books without printed page identifiers cannot use PI-based method for page identification.

CV-based method treats page identification as an image retrieval problem, i.e. taking an image of a printed book page, and then finding out the most similar reference image from a registration dataset in which each reference image has already been mapped to a book page. Iwata et al. [6] use four directional features field to identify book covers for a small-scale library system. Tsai et al. [7] employ Speeded Up Robust Features (SURF) to recognize CD covers. Chae et al. [2] use a mobile phone to take sequence images of printed materials, and then retrieve the reference image from database by a keypoint-based matching and tracking method. CV-based methods don’t require to print additional identifiers on books, thus this technology can be used to identify pages of any book. However, the identification accuracy of most existing CV-based methods cannot provide satisfied user experience. Recently, convolutional neural networks have made impressive progress in many fields of computer vision including image retrieval [8]. This progress make it possible to improve the performance of CV-based book page identification.

This paper presents a book page identification method based on convolutional neural networks (CNNs). As collecting and labelling millions of book page images to train a CNN is time-consuming, a pipeline is proposed to make CNNs trained by another task-unrelated dataset for book page identification. The experimental results on a challenging testing dataset show that the proposed book page identification method achieves a top-5 hit rate of 98.93%.

2 The Proposed Method

As shown in Fig. 1, the pipeline of the proposed book page identification method has five building blocks: (1) An image segmentation module to separate book page from the background; (2) An image correction module to correct geometry and color distortions; (3) A feature extraction module to extract discriminative image features by a pre-trained CNN; (4) A feature compression module to reduce feature dimensions for speeding up; and (5) A Feature matching module to calculate the similarity between a query image and a reference image, and then to find the most similar reference image out. In the offline phase, each reference image only needs to be processed by the feature extraction module and the feature compression module to obtain a compressed feature code. The feature codes of all reference images are stored in a matrix. In the online phase, a query image is processed by all the five modules.

Fig. 1.
figure 1

The pipeline of the proposed book page identification method.

2.1 Book Page Segmentation

Background affects the performance of book page identification seriously, as abundant visual information included in background may be encoded into the feature codes by CNNs. Therefore, book pages need to be separated from background.

Many interactive image segmentation algorithms [9, 10] have been proposed in the last decade. Drawing a bounding box of an interested object, these algorithms can then separate the object from background. However, the “bounding box drawing” interaction degrades the user experience. Although some image segmentation algorithms [11, 12] can initialize the bounding box of an interested object automatically, none of them can provide real-time processing speed on mainstream smart phone and other consumer electronics.

In this subsection, a coarse-to-fine strategy is proposed to segment book page from background full automatically with real-time processing speed. As illustrated in Fig. 2, the proposed image segmentation algorithm consists of three steps: (1) Coarse segmentation to segment book page at pixel level using a fixed bounding box initialization; (2) Bounding box re-initialization to provide a more accuracy bounding box for fine segmentation; (3) Fine segmentation to obtain the final results.

Fig. 2.
figure 2

Coarse-to-fine image segmentation. (a) The procedures of the proposed full automatic image segmentation algorithm. (b) An original query image with a fixed bounding box initialization. (c) Image segmentation result using the initial bounding box in (b). (d) A new bounding box is reinitialized after coarse segmentation. (e) Fine image segmentation result using the bounding box in (d).

A color histogram based Bayes classifier is employed to conduct coarse segmentation. Let H O (b) and H B (b) denote the b-th bin of the non-normalized histogram computed over the initial bounding box (O) and its surrounding background region (B) respectively. Additionally, let bx denote the bin b assigned to the pixel I(x) at location x. Bayes rule [13] is applied to obtain the object likelihood as:

$$ p(x \in O|O,B,b_{x} ) \approx \frac{{p(b_{x} |x \in O)p(x \in O)}}{{\sum\limits_{{\Omega \in \{ O,B\} }} {p(b_{x} |x \in\Omega )p(x \in\Omega )} }} $$
(1)

In particular, the likelihood terms in (1) is estimated directly from color histograms, i.e. \( p(b_{x} |x \in O) \approx H_{o} (b_{x} )/|O| \) and \( p(b_{x} |x \in B) \approx H_{B} (b_{x} )/|B| \). Furthermore, the prior probability can be approximated as \( P(x \in O) \approx |O|/(|O| + |S|) \). Thus, the Bayes classifier simplifies to:

$$ p(x \in O|O,B,b_{x} ) \approx \frac{{H_{O} (b_{x} )}}{{H_{O} (b_{x} ) + H_{B} (b_{x} )}} $$
(2)

The pixels of book page can be coarsely separated from background by (2) with very low computational cost. Then, a new bounding box is fitted on the coarse segmentation result using least-squares approximation. Finally, the book page is segmented by the DenseCut algorithm [10], which is a high quality image segmentation technique with a processing speed of about 15 images per second on general consumer electronics.

2.2 Image Correction

There are mainly two kinds of distortions, i.e. geometry distortion and color distortion, in the original query images. If these distortions are not corrected, the performance of book page identification suffers from them significantly. In this subsection, geometry distortion and color distortion are corrected in a single pass. As illustrated in Fig. 3, the geometry distorted book page in the original query image is converted to a square one by perspective transformation, and meanwhile the distorted color of the book page is corrected to that appears under a canonical light source by chromatic adaptation.

Fig. 3.
figure 3

An example of image distortions correction. (a) An original query image, in which the book page is distorted in both geometry and color. (b) A quadrilateral is fitted to surround the contour of the segmented book page, and is used to correct geometry distortion by perspective transformation. (c) Ambient illumination is estimated from the original image, and is used to correct colors of all pixels of the book page. (d) The corrected image of the book page.

As it is hard to make a handheld camera straight on at the plane of a book page, the rectangle book page in a query image usually distorts to quasi-quadrilateral, mainly due to perspective projection. Thus, perspective transformation is investigated to convert the quasi-quadrilateral book page to a square one. Let (x s , y s ) denote a point in the corrected image. Perspective transformation is used to map (x s , y s ) back to its corresponding point (x q , y q ) in the original image:

$$ \left\{ {\begin{array}{*{20}c} {x_{q} = \frac{{a_{11} x_{s} + a_{21} y_{s} + a_{31} }}{{a_{13} x_{s} + a_{23} y_{s} + a_{33} }}} \\\ {y_{q} = \frac{{a_{12} x_{s} + a_{22} y_{s} + a_{32} }}{{a_{13} x_{s} + a_{23} y_{s} + a_{33} }}} \\ \end{array} } \right. $$
(3)

where \( \left\{ {a_{11} , \, a_{12} , \, a_{13} ; \, a_{21} , \, a_{22} , \, a_{23} ; \, a_{31} , \, a_{32} , \, a_{33} = 1} \right\} \) are elements of the 3 × 3 transformation matrix. This transformation matrix needs to be computed by cues from images.

To compute the transformation matrix, at least four point correspondences between the original image and the corrected image are needed to be established. To this end, a quadrilateral which encloses the contour of the segmented book page is fitted using least-squares approximation (see Fig. 3(b)). After that, four point correspondences, i.e. \( \{ (Q_{0} , \, S_{0} ), \, (Q_{1} , \, S_{1} ), \, (Q_{2} , \, S_{2} ), \, (Q_{3} , \, S_{3} )\} \) in Fig. 3, are established. Then, these four point correspondences are substituted in (3), and the transformation matrix can be determined.

Color distortion in the original query image is mainly caused by ambient illumination. Once the ambient illumination is estimated, the query image can be corrected to an image that appears to be recorded under a canonical illumination using chromatic adaptation [14]:

$$ \left[ {\begin{array}{*{20}c} {R_{\text{s}} } \\ {G_{s} } \\ {B_{s} } \\ \end{array} } \right] = \mathop {\left[ {\begin{array}{*{20}c} {R_{q} } \\ {G_{q} } \\ {B_{q} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\frac{1}{{\sqrt 3 {\text{R}}_{\text{e}} }}} & 0 & 0 \\ 0 & {\frac{1}{{\sqrt 3 G_{\text{e}} }}} & 0 \\ 0 & 0 & {\frac{1}{{\sqrt 3 B_{\text{e}} }}} \\ \end{array} } \right]}\nolimits^{{}} $$
(4)

where \( [R_{q} ,G_{q} ,B_{\text{q}} ]^{\text{T}} \) and \( [R_{s} ,G_{s} ,B_{s} ]^{\text{T}} \) are the pixel color in the original query image and the corrected image respectively, and \( [R_{e} ,G_{e} ,B_{e} ]^{\text{T}} \) is the ambient illumination which needs to be estimated.

Computational color constancy [14, 15] is a powerful tool for estimating ambient illumination from a single image. Exploring the tradeoff between illumination estimation accuracy and computational efficiency, the gray-edge based computational color constancy algorithm [14] is adopted. This algorithm assumes that the average edge difference in a scene is achromatic. Based on this hypothesis, the ambient illumination is estimated as:

$$ \left[ {\begin{array}{*{20}c} {R_{e} } \\ {G_{e} } \\ {B_{e} } \\ \end{array} } \right] = \frac{1}{C}\left[ {\begin{array}{*{20}c} {\left( {\sum\limits_{x \in [0,w),y \in [0,h)}^{{}} {\left( {\nabla R_{q} (x,y)} \right)^{p} } } \right)^{1/p} } \\ {\left( {\sum\limits_{x \in [0,w),y \in [0,h)}^{{}} {\left( {\nabla G_{q} (x,y)} \right)^{p} } } \right)^{1/p} } \\ {\left( {\sum\limits_{x \in [0,w),y \in [0,h)}^{{}} {\left( {\nabla B_{q} (x,y)} \right)^{p} } } \right)^{1/p} } \\ \end{array} } \right] $$
(5)

where w and h are respectively the width and height of the original query image, \( \nabla ( \cdot ) \) denotes the gradient map of the original query image, p is a parameter, and C is the normalization coefficient. In the implementation, p is set as 5.

Once the perspective transformation matrix is computed and the ambient illumination is estimated, the geometry distortion is corrected using (3) and the color distortion is corrected using (5) simultaneously.

2.3 Feature Extraction

Recently, the CNNs have achieved impressive progress in many fields of computer vision including image retrieval. The most direct way to book page identification is collecting a dataset which consists of book page images and then using it to train a CNN. So that, book pages can be identified by the trained CNN in an end-to-end way. However, to train such a CNN, a dataset which contains millions of labelled book page images is required. Collecting and labelling such a large-scale dataset is time-consuming.

Many studies [8, 16,17,18] have shown some qualitative evidence that the features emerging in the upper layers of the CNNs trained for object classification may serve as good descriptors for another unrelated tasks such as image retrieval. Inspired by these works, pre-trained CNNs for object classification are investigated for book page identification in this paper. Exploring the accuracy and speed trade-off of many pre-trained CNNs with different architectures [8, 16,17,18,19,20], the VGG Fast version (VGG-F) convolutional neural network [17] is finally adopted to identify book pages in this paper.

The architecture of VGG-F CNN is illustrated in Fig. 4. It consists of 5 convolutional layers (conv1-5) and 3 full-connected layers (full6-8). The conv1 layer employs 64 kernels of size 11 × 11 × 3 to filter the 224 × 224 × 3 color input images with a stride of 4 pixels. The conv2 layer takes as input the output of conv1 layer and filters it with 256 kernels of size 5 × 5 × 64. The conv3, conv4 and conv5 layers all have 256 convolution kernels of size 3 × 3 × 256. A max-pooling unit follows the convolution unit in layer conv1, conv2 and conv5, but does not in layer conv3 and conv4. Each of the 5 convolutional layers includes a Rectified Linear Unit (ReLU). The full-connect layers full6 and full7 are regularized using dropout, and have 4096 neurons each. The last layer full8 is the output layer and acts as a multi-way soft-max object classifier. The ILSVRC dataset [21] which contains 1.2 millions training images of 1000 object categories is used to train the VGG-F CNN.

Fig. 4.
figure 4

The architecture of CNN used in this paper.

The 4096 dimensional vector output from the full7 layer is extracted as the feature code for book page identification. For computation-saving, the full8 layer of the trained VGG-F CNN is cut off when extracting feature codes from book page images. In the offline phase, all reference images in the book page database are resized to 224 × 224 pixels and input into the trained CNN one by one to extract feature codes, and the extracted feature codes are stored in a matrix. In the online phase, a feature code is also extracted from the image output by the image correction module.

2.4 Feature Compression

To identify a book page, similarities between the feature code of the query image and the feature codes of all reference images are needed to be calculated. As each feature code extracted by the CNN is a 4096-dimensional vector, computing these similarities is inefficient. The most direct solution for improving the efficiency is reducing the dimensions of the feature codes. Babenko et al. [8] use Principle Component Analysis (PCA) to compress feature codes extracted by CNNs, and have obtained a good performance of content based image retrieval while reducing much computational cost. Encouraged by this work, PCA is employed to compress the 4096-dimensional feature codes for speeding up.

Denote the feature code extracted from an image as a vector \( {\mathbf{X}}_{i} \). Suppose there are m reference images in the book page database, all of their feature codes can form a 4096 × m matrix \( {\mathbf{M}} \), i.e. \( {\mathbf{M}} = [{\mathbf{X}}_{1} \, {\mathbf{X}}_{ 2} \cdots {\mathbf{X}}_{m} ] \). Then, the covariance matrix \( {\varvec{\Sigma}} \) of \( {\mathbf{M}} \) can be calculated. And then, the eigen-matrix \( {\mathbf{U}} \) can be obtained by the Singular Value Decomposition (SVD) of \( {\varvec{\Sigma}} \), i.e. \( {\mathbf{U}}{\text{ = SVD(}}{\varvec{\Sigma}} ) \). After that, the compression matrix \( {\mathbf{U}}_{d} \) can be formed by selecting the first d eigen-vectors of \( {\mathbf{U}} \). Finally, a 4096-dimensional feature code \( {\mathbf{X}} \) can be compressed to d dimensions by:

$$ {\tilde{\mathbf{X}}} = {\mathbf{U}}_{d}^{T} {\mathbf{X}} $$
(6)

where \( {\tilde{\mathbf{X}}} \) is the compressed feature code.

In the offline phase, the feature codes of all the reference images are compressed using (6) and stored in a matrix. In the online phase, the feature code of the query image is also compressed to the d dimensions.

2.5 Feature Matching

Two kinds of search methods, i.e. exhaustive search [8] and hashing based search [22, 23], are often employed in the field of image retrieval. In most existing hashing based methods, feature codes of images are first encoded to binary hash codes by a projection and a quantization steps, then Hamming distance is adopted to calculate the distances between the pairs of the query image and each reference image. However, hashing based methods are not appropriate for applications with relatively small amount of reference images, as the computational loss for generating hash codes may outweigh the gain from computing distances. Another risk of hashing based methods is that these methods sometimes produce sub-optimal binary hash codes which will degrade the retrieval performance [24].

Taking account of the trade-off between computational cost and retrieval accuracy, exhaustive search is adopted in this paper. The procedures of exhaustive search is straightforward: (1) Compute the similarities between the query image and each reference image. (2) Rank all the reference images according to their similarities to the query image. (3) Select k top-ranking reference images as the retrieval results.

Cosine distance is adapted to measure the similarity as it experimentally achieves the best performance. Assume that \( {\tilde{\mathbf{X}}}_{i} \) is the compressed feature code extracted from the query image, and \( {\tilde{\mathbf{X}}}_{j} \) is the compressed feature code extracted from a reference image, then the similarity between these two images is measured by:

$$ {\text{S}}_{i,j} = \frac{{{\tilde{\mathbf{X}}}_{i} {\tilde{\mathbf{X}}}_{\text{j}}^{T} }}{{\sqrt {{\tilde{\mathbf{X}}}_{i} {\tilde{\mathbf{X}}}_{i}^{T} } \sqrt {{\tilde{\mathbf{X}}}_{j} {\tilde{\mathbf{X}}}_{j}^{T} } }} $$
(7)

In (7), the term \( \sqrt {{\tilde{\mathbf{X}}}_{i} {\tilde{\mathbf{X}}}_{i}^{T} } \) can be ignored as ignoring it does not change the rank of reference image j, and the term \( \sqrt {{\tilde{\mathbf{X}}}_{j} {\tilde{\mathbf{X}}}_{j}^{T} } \) can be computed offline. Thus, the similarity can be redefined to reduce computational load while maintaining ranking results:

$$ {\tilde{\text{S}}}_{i,j} = p_{j} ({\tilde{\mathbf{X}}}_{i} {\tilde{\mathbf{X}}}_{\text{j}}^{T} ) $$
(8)

where \( p_{j} = 1/\sqrt {{\tilde{\mathbf{X}}}_{j} {\tilde{\mathbf{X}}}_{j}^{T} } \) is computed offline. So that, only (d + 1) multiplication and d addition operations are consumed for matching each reference image in the online phase.

3 Experiments

In this section, the proposed book page identification method is extensively evaluated. The experiments were conducted on a smart phone with an eight-core processor (4 × 2.3 GHz + 4 × 1.8 GHz) and 4 GB RAM, to validate that the proposed book page identification method and book-eResource association system can run on general consumer hardware. The core algorithms of the proposed book page identification method are implemented using optimized multithreaded C++ code.

To evaluate the proposed book page identification method and book-eResource association, a testing database involving 4568 book pages is collected. For each book page, a reference image is captured by a flatbed scanner, and 4 to 8 query images are taken arbitrarily by cameras on different smart phones. As a result, 4568 reference images and 25112 query images are collected in the testing dataset. When taking the query images, factors including geometry distortion, color distortion, highlight, image blur, and cluttered background have been taken into account to simulate severe usage situations.

The top-k hit rate is adopted as the metrics for quantitative evaluation:

$$ \gamma_{k} = \frac{{N_{k} }}{N} $$
(9)

where N is the total testing times, and N k is the times that the right answer is among the first k reference images considered most probable by the book page identification method.

3.1 Overall Performance

Some exemplar results of the proposed book page identification method are illustrated in Fig. 5. The results in Fig. 5(a) and (b) show that the proposed book page identification method can discriminate similar book pages. In Fig. 5(c), the proposed method does not suffer from image blur and the large highlight area in the query image. The results in Fig. 5(d) show that the proposed book page identification method does not suffer from “bad” image segmentation result due to clutter background. The results in Fig. 5(e) demonstrate that the proposed book page identification method can tolerate imperfect corrected image. The results in Fig. 5(f) and (g) show that the proposed book page identification method is not sensitive to the orientation of corrected images. In short, the proposed book page identification method achieves satisfying performance under severe usage situations including clutter background, image blur, highlight, geometry distortion and color distortion.

Fig. 5.
figure 5

The exemplar results of the proposed book page identification method. The right answers are marked with red rectangles. (Color figure online)

The proposed book page identification method is compared with the state-of-the-art end-to-end CNN based image retrieval method [8]. The CNNs used in these two method are pre-trained by the same ILSVRC dataset [21]. Both of the two methods are using a 128 dimensions feature codes compression rate in this experiment. The quantitative comparison results are shown in Table 1. The proposed book page identification method achieves a top-5 hit rate of 98.93%, while the end-to-end CNN [8] only achieves a top-5 hit rate of 55.49%.

Table 1. The hit rates of the proposed method and the end-to-end method

3.2 Effectiveness of the Proposed Pipeline

This experiment is designed to validate the effectiveness of the proposed pipeline. During this experiment, the image correction module is first removed, then the image segmentation module is also removed from the pipeline. To avoid interference, the feature codes are not compressed in this experiment. The hit rates after removing these two modules are shown in Table 2. From the experimental results, one can see that the book page identification method results in inferior performance when removing the image correction and image segmentation modules from the pipeline.

Table 2. The hit rates after removing modules from the pipeline

3.3 Performance of Different Feature Code Compression

This experiment aims to evaluate the performance of different versions of feature codes after PCA compression to different dimensions. The top-1 to top-5 hit rates for different PCA compression rates are illustrated in Fig. 6. It demonstrates that the feature codes extracted by CNN can be compressed to 128 dimensions with slight loss of performance.

Fig. 6.
figure 6

The hit rates of the proposed book page identification method using different PCA compression rates.

3.4 Computation Time

The computation time is relative to the size of the query image and the dimensions of feature codes. When testing computation time in this experiment, the input query image is resized to 400 × 400 pixels for segmentation, the corrected image size is set as 224 × 224 pixels, and the feature codes are compressed to 128 dimensions. The average computation time of the entire pipeline for a query image is 430 millisecond (ms). With this period, image segmentation takes 46 ms, image correction takes 23 ms, feature code extraction takes 342 ms, feature code compression takes 2 ms, and feature matching takes 17 ms (when searching among 4568 reference images).

4 Conclusions

This paper has presented a CNN based book page identification method for associating printed books with electronic resources. A pipeline has been proposed to make CNN trained for another unrelated task available for book page identification. The pipeline has five building blocks: the image segmentation module, image correction module, CNN based feature extraction module, feature compression module, and feature matching module. Under this pipeline, the CNN trained by another task-unrelated dataset can extract effective and robust features for book page identification. The proposed book page identification method has achieved a top-5 hit rate of 98.93% on a challenging testing dataset.