1 Introduction

Text plays important roles in our life. Imagining life in a world without text, in which, for example, neither book, newspaper, signboard, menu in a restaurant, texting on smartphone nor program source code exists or they exist in a completely different form, we can rediscovery not only the necessity of text but also importance of reading and interpreting text. Although only human being has been endowed with the ability of reading and interpreting text, researchers have struggled to enable computers to read text.

Focusing on camera-captured text and scene text, some pioneer works were presented in 1990s [21]. Since then, increasing attention was paid for recognizing scene text. Table 1 shows remarkable recent progress of scene text recognition techniques. In the table, most of reported accuracies of the latest methods exceeded 90 % on major benchmark datasets. However, does this mean these methods are powerful enough to read a variety of texts in the real environment? Many people would agree that the answer is no. Text images contained in these datasets are far easier than the real. In the real environment, scene text is more diverse; for example, various designs/styles/shapes of texts under many different illuminations are taken from variety of angles/distances. In this regard, there is a big gap between scene texts contained in these existing datasets and observed in the real environment.

Table 1. Recent improvement of recognition performance in scene text recognition tasks. Based on Table 1 of [1], this table summarizes recognition accuracies of recent methods in percentage terms on representative benchmark datasets in the chronological order. “50,” “1k” and “50k” represent lexicon sizes. “Full” and “None” represent with all per-image lexicon words and without lexicon, respectively.

In this paper, to fill the gap, we present a new dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset) that preserved scene texts observed in the real environment as they were. The dataset contains videos (sequential images) captured in shopping streets in downtown Osaka with an omnidirectional camera equipped with five horizontal and one upward cameras shown in Fig. 1. In total, 30 image sequences (consisting of five shopping streets times six cameras) consisting of 783,150 images were captured. Among them, 27 image sequences consisting of 32,147 images were manually ground truthed. As a result, 935,601 text regions consisting of 797,919 legible and 137,682 illegible text regions were obtained. The legible regions contained 2,808,340 characters. Since the images were captured in Japan, they contained many Japanese texts. However, out of the whole (797,919) legible text regions, 283,940 consisted of only alphabets and digits. These legible text regions contained 1,138,091 non-Japanese characters. Because of the above mentioned features of the dataset, we can say that DOST dataset preserved scene texts in the wild. Figures 3, 4, 5 and 6 show examples of captured images ground truthed and segmented words contained in DOST dataset. Since the sequence images were captured with an omnidirectional camera and continuous in time, a single word was captured many times in multiple view angles. The DOST dataset was evaluated using two existing text detection and one powerful commercial end-to-end scene text recognition methods to measure the difficulty and quality in comparison with existing datasets.

Fig. 1.
figure 1

Point Grey Ladybug3, an omnidirectional camera, captures six images consisting of five horizontal and one upward cameras at once. A panoramic view can be created from the six images.

Fig. 2.
figure 2

Equipment used for capturing.

Fig. 3.
figure 3

Samples of captured images ground truthed. The four images in this page are selected from ones ground truthed. Bounding boxes represent word regions and texts next to bounding boxes text annotations.

Fig. 4.
figure 4

Samples of captured images ground truthed (continued). The four images in this page are selected from ones ground truthed. Bounding boxes represent word regions and texts next to bounding boxes text annotations.

Fig. 5.
figure 5

Samples of captured images ground truthed (continued). The four images in this page are selected from ones ground truthed. Bounding boxes represent word regions and texts next to bounding boxes text annotations.

Fig. 6.
figure 6

Samples of segmented words contained in DOST dataset. “_” means there is partially occluded character(s).

2 Unique Features of DOST Dataset

Features of existing datasets are summarized in Table 2. Major differences of DOST dataset from existing datasets include following.

  1. 1.

    DOST dataset contains only real images. Unlike MJSynth [22] and SynthText [23] aiming at training a better classifier, DOST dataset aims at evaluation of scene text detection/recognition methods.

  2. 2.

    The images were completely not intentionally captured. In this regard, the most similar dataset is the one dedicated to ICDAR 2015 Robust Reading Competition Challenge 4 “incidental scene text.” It is regarded not intentionally captured because images in the dataset were captured with Google Glass without having taken any prior action to cause its appearance in the field of view or improve its positioning or quality in the frame. DOST dataset is completely free from intention even from face direction of the user wearing Google glass.

  3. 3.

    The images are a video dataset (consecutive in time). There are already video datasets. The 2013 and 2015 editions of ICDAR Robust Reading Competition (RRC) Challenge 3 datasets  [5, 24] consists of sequential images. The biggest difference is that DOST dataset was captured with an omnidirectional camera. Another difference is that DOST dataset contains Japanse text while ICDAR RRC datasets consists of Latin text. Another video dataset YVT [25] contained YouTube videos. Some texts in the dataset are not scene texts but just captions.

  4. 4.

    DOST dataset contains multiple word images of a single word taken in different view angles.

  5. 5.

    The scale of DOST dataset is large. In the following discussion, let us exclude synthesized datasets and SVHN consisting of digit. Though the number of total images ground truthed in DOST dataset (32,147) is not very large (almost half of the largest dataset, COCO-Text), the number of word regions (935,601 in total consisting of 797,919 legible and 137,682 illegible) is very large (a factor of 4.6 times larger than the second largest dataset, COCO-Text). This is because image sizes are relatively large (\(1,200 \times 1,600\) pixels) and the images were captured in shopping streets where a lot of texts exist. DOST dataset is also the largest in terms of the number of unique word sequences, which is larger than the second largest, ICDAR2015 Challenge 3 dataset, by a factor of 6.3 times.

Another feature of DOST dataset is that it was manually ground truthed by students. The reason we did not use a crowdsourcing service such as Amazon Mechanical TurkFootnote 1 is most of workers cannot read Japanese text.

Yet another feature of DOST dataset is that it contains many Latin characters, though the images were captured in Japan. The number of characters per category and examples of Japanese characters and symbols are shown in Fig. 7. Kanji (aka. Chinese character) is a logogram. Katakana and Hiragana are syllabaries invented based on Kanji. Though symbols are originally not intended to be ground truthed, some were actually ground truthed. They include often used iteration marks such as “” which represent a duplicated character. In the future, other than the iteration marks would be discarded by rigorously applying the ground truthing policy.

Table 2. Summary of publicly available datasets. “Video?” is whether the images are consecutive in time. “Real?” is whether the dataset consists of real images only (Yes) or not (No; note that captions are regarded as synthesized). #Image represents the total number of images (for a video dataset, the total number of frames). #Word represents the number of word regions ground truthed. In a video dataset, #WS represents the number of word sequences which do not consist of only “don’t care” regions.
Fig. 7.
figure 7

Number of characters per category and examples of Japanese characters and symbols.

3 Construction of DOST Dataset

DOST dataset was constructed through the following procedure.

  1. 1.

    Image capture

    Scene images were captured with an omnidirectional camera, Point Grey Ladybug3, consisting of five horizontal and one upward cameras shown in Fig. 1. It was set up on a cart shown in Fig. 2 with a laptop computer and a battery for car. A pair of students walked in a shopping street putting the cart. Images were captured in 6.5 fps in the uncompressed mode. The resolutions of each captured image were \(1,200 \times 1,600\) pixels. Lens distortion of the captured images was rectified by a provided software by the vendor of the camera. This process completed in the year of 2012. Table 3 summarizes where, how long and how many images we captured.

  2. 2.

    Ground truthing

    Selected sequences were ground truthed by hand, unlike COCO-Text dataset [29] that used existing scene text detection/recognition methods. The reasons we did not use these methods were that scene texts contained in these images were very difficult for these methods. We developed a ground truthing tool shown in Fig. 8 to make it efficient. Similar to LabelMe Video [30], it had a functionality to transfer text information (text label) in a frame to neighboring frames using homography. However, things in the scene were not on a plane as homography assumes. Hence, following homography computation, more precise positions of words were determined by sliding window based template matching. Table 4 shows distribution of lengths of sequences. Each image is checked at least twice by different persons; one for ground truthing and the other for confirmation. When the ground truthing policy is updated, ground truths are updated by the confirmation opportunity. We spent more than 1,500 man hours for this process.

  3. 3.

    Privacy preservation

    Since the captured images preserved the real scene in shopping streets, we cannot avoid capturing passengers. To avoid privacy violation, we blurred face regions of passengers. At first, we used Amazon Mechanical Turk service. Later, however, we decided to ask this task also to our students so as to ensure the quality with less managing efforts.

Fig. 8.
figure 8

Ground truthing software that can transfer text information (label) to neighboring frames.

Table 3. Place, time length (in hour), the number of images of capture.
Table 4. Distribution of lengths of image sequences.

4 Ground Truthing Policy

The ground truthing policy of DOST dataset is almost shared with the 2013 and 2015 editions of ICDAR Robust Reading Competition Challenge 3 datasets  [5, 24]. Since DOST dataset contained not only Latin but also Japanese text, in addition to the ground truthing policy for Latin scripts, we determined one for Japanese text. The ground truthing policy of DOST dataset is summarized below.

  1. 1.

    Basic unit

    A bounding box is created for each basic unit such as a word. In Latin text, word regions segmented by a space is a basic unit. On the other hand, a Japanese sentence is written without some space between words or grammatical units. Hence, as a basic unit of a Japanese sentence, we use bunsetsu which is the smallest unit of words that sounds natural in a spoken sentence. A proper noun is not divided.

    There is an exception. If the quality of text is “low,” multiple texts of low quality are covered by a single bounding box (see “Transcription” below).

  2. 2.

    Partial occlusion and out of frame

    Even if the region of a basic unit is partially occluded or partially out of frame, it is regarded as a single basic unit without division.

  3. 3.

    Bounding box

    To cope with perspective distortion, a bounding box of a basic unit is represented by four isolated points.

  4. 4.

    Transcription

    The transcription of a basic unit region consists of visible characters. If a basic unit region is partially occluded or partially out of frame, visible characters are transcribed and invisible character(s) are represented by a space. For example, there is a segmented word region of “Barcelona” but “ce” is occluded. Then, the transcription should be “Bar lona.” In Fig. 6, an underscore represents a space.

  5. 5.

    ID

    The same ID is assigned to a sequence of a basic unit as long as it can be traced within the frame. An exception is the case a basic unit once completely disappears because it goes out of the frame; in such a case, even if it appears again, a different ID is assigned to the new one.

  6. 6.

    Quality

    Either “high,” “medium” or “low” is assigned to each basic unit based on subjective evaluation. Basic units with “high” and “medium” are regarded as legible. We allowed to enlarge the image to check if they are legible. Basic units with “low” are regarded as “don’t care” regions where even if a text detection method detects such basic units, it is not considered as failure in detection.

  7. 7.

    Language

    Either “Latin” or “Japanese” is assigned to each basic unit. A basic unit consisting of only alphabets and digits is labeled as “Latin.” A basic unit containing at least one non-alphabet or non-digit character is labeled as “Japanese.” This is useful for performing an experiment using only Latin text.

5 Comparison of Datasets

Difficulty of major datasets were compared using two detectors and one end-to-end recognition method. To reduce computational burden, in some datasets, a part of data were randomly sampled and used for the experiment. The datasets compared and how they were processed were described below.

  1. 1.

    ICDAR2003 [4]

    All (258) images in the training set were used in the experiment.

  2. 2.

    ICDAR2013 (Challenge 2)[5] All (229) images in the training set were used.

  3. 3.

    ICDAR2015 (Challenge 3) [24]

    Images were sampled once in every 30 frames in 10 out of 24 training videos. As a result, 207 images were selected.

  4. 4.

    ICDAR2015 (Challenge 4) [24]

    All (1,000) images in the training set of “End to End” task (Task 4.4) of ICDAR 2015 Robust Reading Competition Challenge 4 were used.

  5. 5.

    SVT [3]

    All (350) images in both training and test sets were used.

  6. 6.

    YVT [25]

    Images were sampled once in every 30 frames in all (30) videos. As a result, 420 images were selected.

  7. 7.

    COCO-Text [29]

    300 images were randomly sampled from ones containing words annotated as English, legible and machine printed (say, target words). The 300 images contained 2,403 target words and words which do not satisfy the condition of the target words (say, non-target words). The non-target words were treated as “don’t care” regions.

  8. 8.

    DOST (this paper)

    Images were sampled once in every 30 frames in all ground truthed sequences. As a result, 1,075 images were selected.

  9. 9.

    DOST Latin (this paper)

    This is to evaluate DOST dataset as a Latin scene text dataset containing only alphabets and digits. In text detection and recognition, the same images as “DOST” presented above were used. In evaluation, words containing characters other than alphabets and digits were treated as “don’t care” regions. Thus, even if Japanese texts are detected, it does not affect the result.

Two detection methods were used for evaluation. One was the scene text detection method contained in the OpenCV API version 3.0. It was based on Neumann et al. [31]. The other was Matsuda et al. [32]. We were privately given the source code by courtesy of the authors of the paper. In addition, Google Vsion APIFootnote 2 was used as a powerful commercial end-to-end recognition system. We could designate the language of texts. Only for “DOST,” we designated Japanese. In this mode, English texts are also able to be detected and recognized while accuracies are expected to be lower. For other datasets including “DOST Latin,” we designated English.

Table 5. Detection and Recognition results on selected datasets. Evaluation criteria are recall (R), precision (P) and F-measure (F) in percentage.

In performance evaluation, regardless of datasets, we shared the same evaluation criteria. For both text detection and end-to-end word recognition tasks, we followed the evaluation criteria used in the challenge of “incidental scene text” (Challenge 4) of ICDAR 2015 Robust Reading Competition. That is, for the scene text detection task, based on a single Intersection-over-Union (IoU) criterion with a threshold of 50 %, a detected bounding box was regarded as correct if it overlapped by more than 50 % with a ground truth bounding box. Recall and precision were simply calculated by the following equations.

$$\begin{aligned} \mathrm {Recall}&= \frac{\text {Number of correctly detected bounding boxes}}{\text {Number of bounding boxes in ground truth}} \end{aligned}$$
(1)
$$\begin{aligned} \mathrm {Precision}&= \frac{\text {Number of correctly detected bounding boxes}}{\text {Number of detected bounding boxes}} \end{aligned}$$
(2)

Then, F-measure was calculated as the harmonic mean of precision and recall. For the end-to-end word recognition task, a detected bounding box was regarded as correct if it satisfies the condition of the scene text detection task as well as the estimated transcription was completely correct. Recall, precision and F-measure were calculated in the same way as the detection task.

Results are summarized in Table 5. As can be seen, results of “DOST” and “DOST Latin” were far worse than others. This indicates that DOST dataset reflecting the real environment is more challenging than the major benchmark datasets.

6 Conclusion

Although many scene text datasets publicly available already exist, none of them are intentionally constructed to reflect the real environment. Hence, even though scene text detection/recognition methods achieve high accuracies on these existing major benchmark datasets, it was not possible to evaluate how they are good for practical use. To address the problem, we presented a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). Unlike most of existing datasets consisting of scene images intentionally captured, DOST dataset consists of uncontrolled scene images; use of an omnidirectional camera enabled us to capture videos (sequential images) of whole scenes surrounding the camera. Since the dataset preserved the real scenes containing texts as they were, in other words, they are scene texts in the wild. Through the evaluation conducted in the paper to know the difficulty and quality in comparison with existing datasets, we demonstrated that DOST dataset is more challenging than the major benchmark datasets.