Abstract
In this work, we propose a deep learning approach that combines visual appearance and text information in a Convolutional Neural Network (CNN), with the aim of detecting regions of different textual categories. We define a novel visual representation of the semantic meaning of text that allows a seamless integration in a standard CNN architecture. This representation, referred to as text-map, is integrated with the actual image to provide a much richer input to the network. Text-maps are colored with different intensities depending on the relevance of the words recognized over the image. More specifically, these words are previously extracted using Optical Character Recognition (OCR) and they are colored according to the probability of belonging to a textual category of interest. In this sense, the presented solution is especially relevant in the context of item coding for supermarket products, where different types of textual categories must be identified (e.g., ingredients or nutritional facts). We evaluated our approach in the proprietary item coding dataset of Nielsen Brandbank, which is composed of more than 10,000 images for train and 2,000 images for test. The reported results demonstrate that our method focused on visual and textual data outperforms state-of-the-art algorithms only based on appearance, such as standard Faster R-CNN. These improvements are exhibited in precision and recall, which are enhanced in 42 and 33 points respectively.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The rise of artificial intelligence has revolutionized the automation of complex industrial processes. Nowadays, manual work has been complemented or even replaced by solutions based on artificial intelligence in varied environments. The automated processing of images is a field that has exponentially grown up in this sense due to the improvements in camera features and computer vision. Besides, the popularization of deep learning has helped to enhance the effectiveness of these systems, that can obtain very detailed information from images [31].
In this regard, the detection of regions of interest over images has greatly improved since the success of deep learning methods based on Convolutional Neural Networks (CNNs) [12], such as Faster R-CNN [25]. However, these techniques rely on visual cues to discriminate regions and they are not expected to achieve high performance when the appearance of the regions is not discriminative. A typical example of this is the detection of regions with different categories of text, where the most distinguishing information resides in the semantic meaning of the text.
Within this context, automated item coding is a specific case in which appearance is not enough. Item coding refers to the process of transcribing the characteristics of an item into a database. The automation of this process is essential for increasing its efficiency, because manual item coding is a tedious and expensive task. For instance, a common application where automated item coding is very useful to save manual work is the extraction of data from supermarket products. As can be shown in Fig. 1, different types of textual information are commonly detected over the product packaging images, such as ingredients or nutritional facts. In this example, we can see how the textual information of the regions plays a differential role. Then, the fusion of textual representations with appearance-based techniques can be a promising alternative with respect to traditional computer vision approaches.
In this paper, we present a novel method for transforming textual information into an image-based representation named text-map. This concept allows a seamless integration in standard CNNs. In this sense, our approach does not modify the internal architecture because it only extends the inputs of the network, which is an advantage for adaptability to different problems, including our use case for item coding in market studies. The novelty of our solution with respect to standard state-of-the-art methods only based on visual appearance is discussed in Sect. 2. Besides, we describe in detail our CNN proposal jointly with the associated concept of text-maps in Sect. 3. After that, we present several experiments and results processed for validating our proposal in Sect. 4. Finally, we highlight the main conclusions derived from this work and some future research lines in Sect. 5.
2 Related Work
The state of the art in computer vision during the last decade can not be understood without considering the great influence of deep learning. In this regard, CNNs have helped to disseminate the application of deep learning in image processing, where seminal works considerably improved the performance of previous image classification approaches [12]. Moreover, CNNs have exhibited their successful performance in other traditional problems related to image processing, such as image retrieval [1], semantic segmentation [26] or object detection [27].
In the case of market studies, computer vision has recently demonstrated to be a useful tool for varied tasks focused on recognizing merchandise displayed in the shelves of a supermarket [22, 23]. Besides, the use case of item coding for nutrition information has been also studied in works such as [19, 32], where the authors define approaches for classifying images of different types of food and calculating their calories, instead of processing the text regions related to nutrition data over the product packages. In addition, there are some other works where packaging images are directly used [13, 28], but they only consider standard nutritional facts tables and apply traditional computer vision techniques, which are not so accurate as a CNN-based solution.
The extraction of textual data from product packaging images usually requires effective text processing techniques [14, 17, 29]. The recognition of text based on Optical Character Recognition (OCR) is one of the most common methods for extracting words from images in a high variety of computer vision frameworks, such as the Google Cloud Vision APIFootnote 1. Apart from this, the detection of regions with text is another problem typically studied in the state of the art. In this sense, recent approaches have built upon techniques based on CNN solutions for differentiating text regions [18, 21]. However, these solutions only discern between regions with and without text, but they do not distinguish among different categories of text regions, as required in our use case for ingredients and nutritional facts regions in item coding.
Object detection based on visual appearance is also a related field where CNNs have provided a notable improvement with respect to traditional techniques, as demonstrated by methods such as Faster R-CNN [25], Yolo [24] or SSD [16]. Unfortunately, these CNN architectures are only based on appearance information, so they are not accurate for detecting regions with different types of texts, which is an essential requirement in problems such as item coding. In these cases, a CNN architecture that combines visual appearance and textual information for region detection can be an interesting solution.
In the literature of image recognition, it is common to see works that combine data from different sources, where each one provides complementary information to create a richer input data. For instance, some examples are focused on the combination of RGB with depth maps [2, 5]. Besides, other approaches build upon the fusion of RGB with raw text data in varied use cases [3, 21]. Our proposal for item coding is based on the fusion of RGB image information with visual representations of the semantic meaning of the processed text.
3 Our Method for Combining Appearance and Textual Information in a CNN
The key of our approach for region detection among different textual categories resides in the generation of text-maps. Then, it is essential to describe their main characteristics and how they are inputted into a CNN jointly with the RGB image information.
3.1 Generation of Text-Maps for Injecting Textual Information
We define a text-map as a visual representation of the original image where the zones that contain words are colored with different intensities depending on the relevance of the word. More specifically, the algorithm colors the text zone retrieved from a standard OCR engine according to the probability of the text to belong to a certain category of interest.
Within our item coding system, the categories of interest are ingredients and nutritional facts, as depicted in the examples presented in Fig. 2. For instance, a zone that contains the word milk will have a high probability of belonging to ingredients and will be brightly colored in its respective text-map. Similarly, the word protein will be brightly colored in the text-map corresponding to nutritional facts.
In the specific use case of our item coding approach, we use text-maps composed of 3 channels. Each channel encodes the relevance of each word based on different metrics, which are computed as follows:
-
Red channel: A word detected by the OCR is brighter in this channel attending to a rate, which is defined as the number of occurrences of the word in the ground-truth regions divided into the total occurrences over the whole image. The computation of this channel is analogue for ingredients and nutritional facts.
-
Green channel: It highlights punctuation signs such as commas or parentheses in the case of ingredients, because they are usually separated by these symbols. On the other hand, the case of nutritional facts in this channel is focused on numerical values of nutrients and related symbols (e.g., %).
-
Blue channel: The rates for this channel are computed using predefined dictionaries about ingredients and nutritional facts. These dictionaries are previously generated using the ground-truth data, which contains a set of words typically appeared in products as ingredients or nutritional facts. With the aim of obtaining the rate for a specific word detected by the OCR, a Levenshtein distance [15] is computed between the analyzed word and the words in the respective dictionary.
3.2 Design of the Proposed CNN Approach
After generating a text-map, it is injected into the applied CNN architecture jointly with the original RGB image to detect its specific text regions of interest. Typically, standard CNN architectures receive 3 RGB channels as input of the network. In this regard, the architecture proposed in our item coding system receives the 3 RGB channels plus 3 channels for the text-maps, so it applies 6 channels in total, as represented in Fig. 3 for the nutritional facts case (it is analog for ingredients).
The core of our CNN model can use any standard backbone network. In the case of our item coding approach, ResNet [10] exhibited a satisfactory performance as CNN backbone for the applied architecture. Obviously, the CNN model must be trained using previously labeled data for a proper performance. In this sense, the training of our models is started from pre-trained weights obtained from ImageNet [4], which is a robust dataset commonly applied in some of the most representative proposals in the state of the art [12]. Besides, several works have demonstrated the great transferability of CNN features from images belonging to different datasets and problems, as studied in [30]. This transferability is also expected for the learning of text-maps information, because mid-level image representations on large-scale datasets can be efficiently transferred to other visual recognition tasks, as evidenced in [20].
Additionally, it must be noted that after predicting the text regions of interest using the described CNN proposal, our system easily obtains the resultant ingredients and nutritional facts by post-processing the OCR previously computed for text-maps within these predicted regions. However, it is out of scope of this paper and we prefer to focus on the region detection among different text categories, which is our main contribution presented in the topic of language and vision fusion.
4 Experiments and Results
In this section, we present the main experiments carried out for evaluating our region detection method based on combined visual and textual information. The goal of these tests is to validate the improvements in performance provided by our CNN model for with respect to state-of-the-art approaches for region detection only based on visual appearance. In this case, the reported experiments are focused on the item coding system for supermarket products described along this paper.
4.1 Dataset for Item Coding
The acquisition of images and manually annotated data for evaluating our automated item coding system is a costly process. Due to this, there are not large datasets publicly available for these tasks. We found recent public datasets that are developed for the identification of merchandise in supermarkets, such as the D2S dataset [6]. Unfortunately, this dataset is not suitable for our item coding tests because of the long distance between camera and products, so ingredients and nutritional facts can not be visually distinguished and labeling for them is not available. Then, we have used our own labeled data from Nielsen BrandbankFootnote 2 to train and evaluate our automated item coding solution. This dataset is composed of more than 10,000 annotated images for training and 2,000 images for validation and test.
4.2 Training and Hyperparameters
To train our CNN model, we set up an architecture based on Faster R-CNN with ResNet-101 as backbone using as input data the combination of image and text-map channels. The following main hyperparameters are applied during 10 epochs: learning rate = \(1\cdot 10^{-5}\), weight decay = \(1\cdot 10^{-6}\), dropout keep prob. = 0.8, batch size = 1, Adam optimizer [11] and Xavier initialization [8]. We compare our solution against a model trained with a standard Faster R-CNN only based on appearance. Analog hyperparameters are also used in this case to perform a fair comparison with respect to our approach.
4.3 Quantitative Results
Our experiments are mainly focused on precision and recall results for ingredients and nutritional facts detection. In Table 1, it can be seen how our CNN model based on text-maps clearly outperforms standard Faster R-CNN. Concretely, our approach increases precision and recall in 42 and 33 points respectively. Besides, we enhance in 38 points the total accuracy, which is calculated as the division of true positives between the sum of true positives, false positives and false negatives. According to these results, the improvements given by our solution are demonstrated for region detection among different textual categories.
It is important to point out that the results reported in Table 1 are calculated by considering a confidence threshold of 0.7 to discern between valid and invalid predictions. In our final item coding system, we also use this value to have an adequate precision with a minimum impact in recall. Then, the goal is to have a low number of false positive detections without increasing the amount of false negatives. Within this context, we used the precision and recall curves depicted in Fig. 4 to choose 0.7 as our preferred value for the confidence threshold. In these curves, the values for precision and recall are computed using different confidence thresholds between 0 and 1, with the aim of obtaining a proper perspective to fit the confidence threshold in the final system. As can be seen, these curves are represented for standard Faster R-CNN with respect to our CNN model based on text-maps. In this regard, the precision and recall curves for our approach are reaching higher values along the different confidence thresholds for both ingredients and nutritional facts cases.
4.4 Qualitative Results
Apart from quantitative results, we also depict some qualitative visual results in Fig. 5. In this example, a lot of false positives are wrongly detected by the standard Faster R-CNN model due to the similarity of the visual appearance in the different types of text. However, our CNN model based on visual and textual information is able to correctly predict the ingredients region and false positives are not detected. These results evidence again the suitability of our CNN solution for item coding problems.
5 Conclusions and Future Works
The application of textual information as a mechanism to structure and reason about visual perception has been demonstrated along this paper, where we have described how to take advantage of visual representations derived from the semantic meaning of text.
From this point of view, our innovative CNN model enriched with text-maps has evidenced its effectiveness in detecting different categories of text, especially with respect to state-of-the-art solutions only based on visual appearance (e. g., Faster R-CNN). We presented results associated with our specific use case for item coding in market studies, but the concept of text-maps is applicable to other problems focused on the detection of regions with different textual categories.
In further works, other text regions of interest for item coding are planned to be detected by our system, such as storage information or cooking instructions, among others. Moreover, the proposed technique for generating text-maps and the number of channels could be adapted to other text region detection challenges in future researches. Within this context, single shot scene text retrieval [9] or visual question answering [7] are some examples of active research topics that could be benefited from a model based on text-maps.
References
Arroyo, R., Alcantarilla, P.F., Bergasa, L.M., Romera, E.: Fusion and binarization of CNN features for robust topological localization across seasons. In: International Conference on Intelligent Robots and Systems (IROS), pp. 4656–4663 (2016)
Arroyo, R., Alcantarilla, P.F., Bergasa, L.M., Yebes, J.J., Bronte, S.: Fast and effective visual place recognition using binary codes and disparity information. In: International Conference on Intelligent Robots and Systems (IROS), pp. 3089–3094 (2014)
Bai, X., Yang, M., Lyu, P., Xu, Y., Luo, J.: Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6, 66322–66335 (2018)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: ImageNet: a large-scale hierarchical image database. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)
Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M.A., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: International Conference on Intelligent Robots and Systems (IROS), pp. 681–687 (2015)
Follmann, P., Bottger, T., Hartinger, P., Konig, R., Ulrich, M.: MVTec D2S: densely segmented supermarket dataset. In: European Conference on Computer Vision (ECCV), pp. 581–597 (2018)
Gao, P., et al.: Question-guided hybrid convolution for visual question answering. In: European Conference on Computer Vision (ECCV), pp. 485–501 (2018)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249–256 (2010)
Gomez, L., Mafla, A., Rusinol, M., Karatzas, D.: Single shot scene text retrieval. In: European Conference on Computer Vision (ECCV), pp. 728–744 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference for Learning Representations (ICLR), pp. 1–15 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems (NIPS), pp. 1106–1114 (2012)
Kulyukin, V., Kutiyanawala, A., Zama1, T., Clyde, S.: Vision-based localization and text chunking of nutrition fact tables on android smartphones. In: International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), pp. 314–320 (2013)
Lee, C., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2231–2239 (2016)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. J. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Liu, Y., Wang, Z., Jin, H., Wassell, I.: Synthetically supervised feature learning for scene text recognition. In: European Conference on Computer Vision (ECCV), pp. 449–465 (2018)
Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In: European Conference on Computer Vision (ECCV), pp. 71–97 (2018)
Meyers, A., et al.: Im2Calories: towards an automated mobile vision food diary. In: International Conference on Computer Vision (ICCV), pp. 1233–1241 (2015)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1717–1724 (2014)
Prasad, S., Kong, A.: Using object information for spotting text. In: European Conference on Computer Vision (ECCV), pp. 559–576 (2018)
Qiao, S., Shen, W., Qiu, W., Liu, C., Yuille, A.: ScaleNet: guiding object proposal generation in supermarkets and beyond. In: International Conference on Computer Vision (ICCV), pp. 1791–1800 (2017)
Ray, A., Kumar, N., Shaw, A., Mukherjee, D.P.: U-PC: unsupervised planogram compliance. In: European Conference on Computer Vision (ECCV), pp. 598–613 (2018)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems (NIPS), pp. 91–99 (2015)
Saleh, F.S., Aliakbarian, M.S., Salzmann, M., Petersson, L., Alvarez, J.M.: Effective use of synthetic data for urban scene semantic segmentation. In: European Conference on Computer Vision (ECCV), pp. 86–103 (2018)
Sundermeyer, M., Marton, Z., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6D object detection from RGB images. In: European Conference on Computer Vision (ECCV), pp. 712–729 (2018)
Gundimeda, V., Murali, R.S., Joseph, R., Naresh Babu, N.T.: An automated computer vision system for extraction of retail food product metadata. In: Bapi, R.S., Rao, K.S., Prasad, M.V.N.K. (eds.) First International Conference on Artificial Intelligence and Cognitive Computing. AISC, vol. 815, pp. 199–216. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1580-0_20
Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: European Conference on Computer Vision (ECCV), pp. 372–388 (2018)
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: International Conference on Neural Information Processing Systems (NIPS), pp. 3320–3328 (2014)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Zhang, W., Yu, Q., Siddiquie, B., Divakaran, A., Sawhney, H.S.: Snap-n-Eat: food recognition and nutrition estimation on a smartphone. J. Diab. Sci. Technol. 9(3), 525–533 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Arroyo, R., Tovar, J., Delgado, F.J., Almazán, E.J., Serrador, D.G., Hurtado, A. (2019). Deep Learning of Visual and Textual Data for Region Detection Applied to Item Coding. In: Morales, A., Fierrez, J., Sánchez, J., Ribeiro, B. (eds) Pattern Recognition and Image Analysis. IbPRIA 2019. Lecture Notes in Computer Science(), vol 11867. Springer, Cham. https://doi.org/10.1007/978-3-030-31332-6_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-31332-6_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31331-9
Online ISBN: 978-3-030-31332-6
eBook Packages: Computer ScienceComputer Science (R0)