Keywords

1 Introduction

The rise of artificial intelligence has revolutionized the automation of complex industrial processes. Nowadays, manual work has been complemented or even replaced by solutions based on artificial intelligence in varied environments. The automated processing of images is a field that has exponentially grown up in this sense due to the improvements in camera features and computer vision. Besides, the popularization of deep learning has helped to enhance the effectiveness of these systems, that can obtain very detailed information from images [31].

In this regard, the detection of regions of interest over images has greatly improved since the success of deep learning methods based on Convolutional Neural Networks (CNNs) [12], such as Faster R-CNN [25]. However, these techniques rely on visual cues to discriminate regions and they are not expected to achieve high performance when the appearance of the regions is not discriminative. A typical example of this is the detection of regions with different categories of text, where the most distinguishing information resides in the semantic meaning of the text.

Fig. 1.
figure 1

A visual overview about item coding based on ingredients and nutritional facts. The diagram depicts a visual representation of a processed product packaging image jointly with the nutrition information to analyse.

Within this context, automated item coding is a specific case in which appearance is not enough. Item coding refers to the process of transcribing the characteristics of an item into a database. The automation of this process is essential for increasing its efficiency, because manual item coding is a tedious and expensive task. For instance, a common application where automated item coding is very useful to save manual work is the extraction of data from supermarket products. As can be shown in Fig. 1, different types of textual information are commonly detected over the product packaging images, such as ingredients or nutritional facts. In this example, we can see how the textual information of the regions plays a differential role. Then, the fusion of textual representations with appearance-based techniques can be a promising alternative with respect to traditional computer vision approaches.

In this paper, we present a novel method for transforming textual information into an image-based representation named text-map. This concept allows a seamless integration in standard CNNs. In this sense, our approach does not modify the internal architecture because it only extends the inputs of the network, which is an advantage for adaptability to different problems, including our use case for item coding in market studies. The novelty of our solution with respect to standard state-of-the-art methods only based on visual appearance is discussed in Sect. 2. Besides, we describe in detail our CNN proposal jointly with the associated concept of text-maps in Sect. 3. After that, we present several experiments and results processed for validating our proposal in Sect. 4. Finally, we highlight the main conclusions derived from this work and some future research lines in Sect. 5.

2 Related Work

The state of the art in computer vision during the last decade can not be understood without considering the great influence of deep learning. In this regard, CNNs have helped to disseminate the application of deep learning in image processing, where seminal works considerably improved the performance of previous image classification approaches [12]. Moreover, CNNs have exhibited their successful performance in other traditional problems related to image processing, such as image retrieval [1], semantic segmentation [26] or object detection [27].

In the case of market studies, computer vision has recently demonstrated to be a useful tool for varied tasks focused on recognizing merchandise displayed in the shelves of a supermarket [22, 23]. Besides, the use case of item coding for nutrition information has been also studied in works such as [19, 32], where the authors define approaches for classifying images of different types of food and calculating their calories, instead of processing the text regions related to nutrition data over the product packages. In addition, there are some other works where packaging images are directly used [13, 28], but they only consider standard nutritional facts tables and apply traditional computer vision techniques, which are not so accurate as a CNN-based solution.

The extraction of textual data from product packaging images usually requires effective text processing techniques [14, 17, 29]. The recognition of text based on Optical Character Recognition (OCR) is one of the most common methods for extracting words from images in a high variety of computer vision frameworks, such as the Google Cloud Vision APIFootnote 1. Apart from this, the detection of regions with text is another problem typically studied in the state of the art. In this sense, recent approaches have built upon techniques based on CNN solutions for differentiating text regions [18, 21]. However, these solutions only discern between regions with and without text, but they do not distinguish among different categories of text regions, as required in our use case for ingredients and nutritional facts regions in item coding.

Object detection based on visual appearance is also a related field where CNNs have provided a notable improvement with respect to traditional techniques, as demonstrated by methods such as Faster R-CNN [25], Yolo [24] or SSD [16]. Unfortunately, these CNN architectures are only based on appearance information, so they are not accurate for detecting regions with different types of texts, which is an essential requirement in problems such as item coding. In these cases, a CNN architecture that combines visual appearance and textual information for region detection can be an interesting solution.

In the literature of image recognition, it is common to see works that combine data from different sources, where each one provides complementary information to create a richer input data. For instance, some examples are focused on the combination of RGB with depth maps [2, 5]. Besides, other approaches build upon the fusion of RGB with raw text data in varied use cases [3, 21]. Our proposal for item coding is based on the fusion of RGB image information with visual representations of the semantic meaning of the processed text.

3 Our Method for Combining Appearance and Textual Information in a CNN

The key of our approach for region detection among different textual categories resides in the generation of text-maps. Then, it is essential to describe their main characteristics and how they are inputted into a CNN jointly with the RGB image information.

3.1 Generation of Text-Maps for Injecting Textual Information

We define a text-map as a visual representation of the original image where the zones that contain words are colored with different intensities depending on the relevance of the word. More specifically, the algorithm colors the text zone retrieved from a standard OCR engine according to the probability of the text to belong to a certain category of interest.

Within our item coding system, the categories of interest are ingredients and nutritional facts, as depicted in the examples presented in Fig. 2. For instance, a zone that contains the word milk will have a high probability of belonging to ingredients and will be brightly colored in its respective text-map. Similarly, the word protein will be brightly colored in the text-map corresponding to nutritional facts.

Fig. 2.
figure 2

Stages for obtaining the text-maps required for detecting the regions of interest. (1) Input images. (2) OCR extraction. (3) Text-maps extraction. (4) Output detection. (Color figure online)

In the specific use case of our item coding approach, we use text-maps composed of 3 channels. Each channel encodes the relevance of each word based on different metrics, which are computed as follows:

  • Red channel: A word detected by the OCR is brighter in this channel attending to a rate, which is defined as the number of occurrences of the word in the ground-truth regions divided into the total occurrences over the whole image. The computation of this channel is analogue for ingredients and nutritional facts.

  • Green channel: It highlights punctuation signs such as commas or parentheses in the case of ingredients, because they are usually separated by these symbols. On the other hand, the case of nutritional facts in this channel is focused on numerical values of nutrients and related symbols (e.g., %).

  • Blue channel: The rates for this channel are computed using predefined dictionaries about ingredients and nutritional facts. These dictionaries are previously generated using the ground-truth data, which contains a set of words typically appeared in products as ingredients or nutritional facts. With the aim of obtaining the rate for a specific word detected by the OCR, a Levenshtein distance [15] is computed between the analyzed word and the words in the respective dictionary.

Fig. 3.
figure 3

Proposed CNN with RGB channels plus text-map channels. The presented example is focused on nutritional facts, but it is analog for ingredients. (Color figure online)

3.2 Design of the Proposed CNN Approach

After generating a text-map, it is injected into the applied CNN architecture jointly with the original RGB image to detect its specific text regions of interest. Typically, standard CNN architectures receive 3 RGB channels as input of the network. In this regard, the architecture proposed in our item coding system receives the 3 RGB channels plus 3 channels for the text-maps, so it applies 6 channels in total, as represented in Fig. 3 for the nutritional facts case (it is analog for ingredients).

The core of our CNN model can use any standard backbone network. In the case of our item coding approach, ResNet [10] exhibited a satisfactory performance as CNN backbone for the applied architecture. Obviously, the CNN model must be trained using previously labeled data for a proper performance. In this sense, the training of our models is started from pre-trained weights obtained from ImageNet [4], which is a robust dataset commonly applied in some of the most representative proposals in the state of the art [12]. Besides, several works have demonstrated the great transferability of CNN features from images belonging to different datasets and problems, as studied in [30]. This transferability is also expected for the learning of text-maps information, because mid-level image representations on large-scale datasets can be efficiently transferred to other visual recognition tasks, as evidenced in [20].

Additionally, it must be noted that after predicting the text regions of interest using the described CNN proposal, our system easily obtains the resultant ingredients and nutritional facts by post-processing the OCR previously computed for text-maps within these predicted regions. However, it is out of scope of this paper and we prefer to focus on the region detection among different text categories, which is our main contribution presented in the topic of language and vision fusion.

4 Experiments and Results

In this section, we present the main experiments carried out for evaluating our region detection method based on combined visual and textual information. The goal of these tests is to validate the improvements in performance provided by our CNN model for with respect to state-of-the-art approaches for region detection only based on visual appearance. In this case, the reported experiments are focused on the item coding system for supermarket products described along this paper.

4.1 Dataset for Item Coding

The acquisition of images and manually annotated data for evaluating our automated item coding system is a costly process. Due to this, there are not large datasets publicly available for these tasks. We found recent public datasets that are developed for the identification of merchandise in supermarkets, such as the D2S dataset [6]. Unfortunately, this dataset is not suitable for our item coding tests because of the long distance between camera and products, so ingredients and nutritional facts can not be visually distinguished and labeling for them is not available. Then, we have used our own labeled data from Nielsen BrandbankFootnote 2 to train and evaluate our automated item coding solution. This dataset is composed of more than 10,000 annotated images for training and 2,000 images for validation and test.

4.2 Training and Hyperparameters

To train our CNN model, we set up an architecture based on Faster R-CNN with ResNet-101 as backbone using as input data the combination of image and text-map channels. The following main hyperparameters are applied during 10 epochs: learning rate = \(1\cdot 10^{-5}\), weight decay = \(1\cdot 10^{-6}\), dropout keep prob. = 0.8, batch size = 1, Adam optimizer [11] and Xavier initialization [8]. We compare our solution against a model trained with a standard Faster R-CNN only based on appearance. Analog hyperparameters are also used in this case to perform a fair comparison with respect to our approach.

4.3 Quantitative Results

Our experiments are mainly focused on precision and recall results for ingredients and nutritional facts detection. In Table 1, it can be seen how our CNN model based on text-maps clearly outperforms standard Faster R-CNN. Concretely, our approach increases precision and recall in 42 and 33 points respectively. Besides, we enhance in 38 points the total accuracy, which is calculated as the division of true positives between the sum of true positives, false positives and false negatives. According to these results, the improvements given by our solution are demonstrated for region detection among different textual categories.

Table 1. Standard Faster R-CNN vs our model with text-maps. A confidence of 0.7 is considered in these results, which is the value used in our final system.
Fig. 4.
figure 4

Curves for precision and recall values over different confidence thresholds in the evaluations carried out for standard Faster R-CNN (3 channels) vs. our model with text-maps (6 channels). We present results about ingredients and nutritional facts.

It is important to point out that the results reported in Table 1 are calculated by considering a confidence threshold of 0.7 to discern between valid and invalid predictions. In our final item coding system, we also use this value to have an adequate precision with a minimum impact in recall. Then, the goal is to have a low number of false positive detections without increasing the amount of false negatives. Within this context, we used the precision and recall curves depicted in Fig. 4 to choose 0.7 as our preferred value for the confidence threshold. In these curves, the values for precision and recall are computed using different confidence thresholds between 0 and 1, with the aim of obtaining a proper perspective to fit the confidence threshold in the final system. As can be seen, these curves are represented for standard Faster R-CNN with respect to our CNN model based on text-maps. In this regard, the precision and recall curves for our approach are reaching higher values along the different confidence thresholds for both ingredients and nutritional facts cases.

4.4 Qualitative Results

Apart from quantitative results, we also depict some qualitative visual results in Fig. 5. In this example, a lot of false positives are wrongly detected by the standard Faster R-CNN model due to the similarity of the visual appearance in the different types of text. However, our CNN model based on visual and textual information is able to correctly predict the ingredients region and false positives are not detected. These results evidence again the suitability of our CNN solution for item coding problems.

Fig. 5.
figure 5

A visual example of the detection of ingredients regions using a model trained with a standard Faster R-CNN vs our approach. False negatives are marked in fuchsia, false positives in blue and true positives in green. The confidences given by the networks are depicted in the upper-left corner of the detected bounding boxes. (Color figure online)

5 Conclusions and Future Works

The application of textual information as a mechanism to structure and reason about visual perception has been demonstrated along this paper, where we have described how to take advantage of visual representations derived from the semantic meaning of text.

From this point of view, our innovative CNN model enriched with text-maps has evidenced its effectiveness in detecting different categories of text, especially with respect to state-of-the-art solutions only based on visual appearance (e. g., Faster R-CNN). We presented results associated with our specific use case for item coding in market studies, but the concept of text-maps is applicable to other problems focused on the detection of regions with different textual categories.

In further works, other text regions of interest for item coding are planned to be detected by our system, such as storage information or cooking instructions, among others. Moreover, the proposed technique for generating text-maps and the number of channels could be adapted to other text region detection challenges in future researches. Within this context, single shot scene text retrieval [9] or visual question answering [7] are some examples of active research topics that could be benefited from a model based on text-maps.