Keywords

1 Introduction

Reading digital video clocks (or called time recognition) is an active research problem because the clock time plays a critical role in video event detection and event inference [1, 2, 7, 8, 13,14,15,16,17]. This paper considers the common case in which a digital video clock has been superimposed on video. Although current videos already have a text channel that can be used to store the encoded clock or timestamp information, the algorithm presented in this paper does not need to use these encoded clocks or timestamps. Most sports and surveillance videos have superimposed digital video clocks or timestamps for various reasons – such as to show game-related time in sports videos or to show the time of the recording in surveillance videos. For, example, the video clock in a soccer video indicates the game time lapsed at the current frame, whereas the reversely-running video clock in a basketball video indicates the remaining game time at the current frame. In surveillance videos and sports videos recorded from TV programs by digital recorders, superimposed digital video clock or timestamps is one method guard against malicious tampering of the encoded timestamp information stored in the text channel [1, 2, 7, 8, 13,14,15,16,17]. Hence, it is highly desired to develop the algorithms for reading the superimposed digital video clocks, independently of the clock or timestamp encoded in the text channel.

Reading digital video clocks is a special case of reading text from videos that is a very challenging problem [3, 9, 11]. The recent algorithms for reading text based on the sliding window scanning and deep networks, being a kind of region-based method [4,5,6, 10,11,12]. This region-based method reaches the best performance (accuracy in 83.3%) for object detection [12]. The flow of this method includes the steps of generating candidate regions and then detecting the object within the candidate regions. The detection accuracy of this method depends on the recall of identifying candidate regions. Although region-based methods employs inexpensive features to do the selective search of candidate regions. It still requires much running time for detection task. YOLO (You Only Look Once) [12] was proposed in 2016 to predicts bounding boxes and class probabilities directly from full images in one evaluation. It is quickly applied to solve the various problems due to that it is simple, fast, and high performance [12]. However, YOLO has not applied to the problem of reading digital video clocks.

The researchers have been designing custom algorithms for reading clocks since no general algorithms can have a satisfactory performance for reading text from images or videos [1, 2, 7, 8, 14,15,16,17]. The problem of reading digital video clocks can be divided into two sub-problems: clock-area localization and clock-digit recognition. The first sub-problem is a special case of the general text (character) localization problem. The second sub-problem is a special case of the text recognition problem within the text area. This problem appears after the text area is localized. The researchers have proposed a batch of methods designed custom methods for these two problems [1, 2, 7, 8, 14,15,16,17]. In the early years, multiple algorithms adopted image processing approach to localize clock digits in video [1, 2]. These algorithms only have a low accuracy. Then, some improved algorithms were proposed [7, 8]. They use the method based on clock digit periodicity to verify the localized characters, but they still use image processing approach to localize the candidates of clock digits. Particularly, they find the character candidates by doing character segmentation and the connected component analysis (CCA) on the detected clock board. Then they monitor all character candidates to find the one whose color change is approximately of secondly periodicity, called region periodicity. They are tedious yet not as robust due to they use error-prone process of character segmentation and CCA.

In 2012 a pixel periodicity method was proposed and a custom algorithm based on this method was proposed to localize clock digits that discarded the tedious image processing components in [14]. This paper designed a set of functions to describe the second pixel periodicity and the heuristic algorithm of using those functions achieved 100% of accuracy on the second-digit place localization. However, the algorithm is a custom one but it is not a neural network algorithm and the thresholds in those functions were set manually but not through a learning process. The periodicity of the value change of the s-digit pixel disclosed in [14, 15] can be used to design the algorithm for reading clocks, but it is difficult in designing the features to represent this periodicity. Hence, a batch of mathematical functions is designed to describe this periodicity. Additionally, the algorithm in [14, 15] only uses the periodicity of second-digit place pixels but the constancy of neighbouring second-digit place pixels.

With the advance of deep network and the high performance of YOLO this paper is to use a batch of neural networks, particularly YOLO, to replace the heuristic components in the algorithm presented in [14, 15], aiming to eliminate the demerits of the algorithm. The general idea is to use the properties of working digital video clocks to customize the deep networks to form the deep networks or the connected deep works to conduct the tasks of reading digital video clocks. The convolutional neural networks (CNNs) is first used to identify the relatively constancy pixels. Then based on [14, 15], a frame-aligned pixel recognition network (PRN) is proposed to identify the s-digit pixels that their color values change periodically within the neighbour pixels of the identified constancy pixels. Compared to the functions it gets rid of the job of setting thresholds for functions. More importantly, deep networks parameters have potential to take use of the properties of digital video clocks better than the heuristic functions. After the second-digit place is localized, the area that contains all the digits of the clock can be decided. The remaining task is to localize and recognize all the digits in this area. This paper proposes two YOLO based procedures that mainly take YOLO framework with the customized deep networks. Thus, two heuristic procedures of finding the bounding boxes of digits and recognizing the clock digits in [14, 15] were done by the neural networks.

The rest of the paper is organized as follows. Section 2 presents the technical details of the proposed algorithm for reading digital video clocks. Experimental results are presented in Sect. 3. Section 4 concludes the paper.

2 Two Phases of Deep Networks for Reading Clocks

2.1 Notations and Overview of the Proposed Algorithm

This paper divides the problem of reading digital video clock into two tasks: clock area localization and clock digit recognition. The task of clock area localization is to find the area that contains all the digits of a clock; and the task of clock digit recognition is to identify each digit and recognize it. For the first task, a phase of customized deep networks are proposed. It first uses a CNN based procedure to identify the constancy pixels; then it is to localize s-digit place by pixel recognition network (PRN) and YOLO [12] with Clock-Digit Recognition Network (CDRN) as its first several layers. The CDRN is an clock digit classifier network which based on the deep network proposed by LeCunn in paper [4]. In our paper, CDRN is the base of YOLO, which is used for feature extraction of digit in video clock. The CDRN only trained for 11 classes, which contains the digit classes of from 0 to 9 and the clock area of none-digit. The YOLO localizes the bounding box of s-digit. Then we localizes clock area based on the bounding box of s-digit. Finally in this clock area, sliding the bounding box of s-digit with YOLO to do the localization and recognization of other digits.

Definition 1:

(s-digit, x-digit) In video clock, a digit on the second place of the clock is called as an s-digit; any digit representing ten second, minute, and ten minute in video clock are called as an x-digit.

An algorithm for reading video clocks is described in Algorithm 1. The proposed algorithm for detecting digital video clocks has two main phases: clock area localization and clock digit recognition.

figure a

2.2 A Phase of Deep Networks for Localizing the Clock Area

This section presents a phase of neural networks for localizing the clock area by taking use of the properties of digital clocks.

Some Properties of S-Digit Pixels. Some properties of s-digit pixels are presented so that the proposed methods can be understood. Figure 2 shows the flow of this pixel periodicity on the s-digit place. Refer paper [15] for some notations and concepts used in this paper, and relevant formulae for computing the s-digit bounding box are presented.

Let W and H be the width and the height of the images of a given video. Let B be the set of all pixels within an image. Let \(F _{i} \) be the considered frame. Then \(F _{i-R} \), \(F _{i-R+1} \), ... , \(F _{i-1} \) and \(F _{i} \), \(F _{i+1}\), ... , \(F _{i+R-1}\) are the R frames in the preceding second and the succeeding second, respectively. Let c(kp) be the grey value of pixel p in frame \(F _{k} \). Then we have following definitions.

Definition 2:

(Constancy Pixel) Let \(F _{k} \) for \(k=1\) to be L frames including at least 3 second consecutive frames. Pixel p is called as a constancy pixel if it meets the following condition.

\( \left| c(k,p)-C _{1} \right| < \beta _{1} \) for \(k=i\) to L, where \(C _{1}= \dfrac{1}{L}\sum _{k=1}^{L}c(k,p) \); where \( \beta _{1} \) is a threshold.

Definition 3:

(Constancy Adjacency Pixel) A non-constancy pixel (NCP) is called as a constancy adjacency pixel (CAP) if it is a neighbour pixel of a constancy pixel (CP), i.e. \(dist(NCP, CP_i)<\beta _{2}\).

We design a CNN based procedure to identify the constancy pixels according to Definition 2. It uses the mean of pixel values and the variances of pixel values to identify the constancy pixels. After getting constancy pixels, all of the constancy adjacency pixels can be found according to Definition 3. Next, PRN is used to find s-digit pixel in the constancy adjacency pixels .

Finding the S-Digit Pixels with the Periodicity of S-Digit Pixels. We localize the pixels belong to s-digit place by finding the pixel pairs of a constancy pixel and a pixel with the periodicity. A sample of the periodic variation of the gray value of the second pixel is shown in Fig. 2.

Fig. 1.
figure 1

The number of digits in the digital box changes continuously for 10 s (the video frame rate is 25 fps), and the red dot indicates one of the second pixels. (Color figure online)

Fig. 2.
figure 2

The upper figure shows the gray value map of the second pixel point (red dot) for 10 consecutive seconds, and the lower one shows the gray value map of the frame difference for 10 consecutive seconds. (Color figure online)

As shown in Fig. 1, during frame conversion of s-digit pixels, the change of second pixel gray value is significantly larger than other time periods. Thus, we proposed an efficient pixel recognition network based on frame-align. We convert the \(n\_seconds*25\) length frame sequence into a \(n\_seconds*25\) two-dimensional matrix, so that their transit-frames are aligned just as Fig. 3. the structure of CDRN showed in Fig. 4.

Fig. 3.
figure 3

The structure of PRN, its data input is shown, k indicates kernel size, s indicates stride and n indicates the number of conv layer.

Fig. 4.
figure 4

The structure of YOLO with CDRN. the structure of 0–5 layers is CDRN, and the structure of layers of 6–12 is the rest of YOLO. The 9th layer is the combination of the output of 5th layer and 8th layer, In the 11th layer, the number of filters is 80, because each grid in YOLO predicts 5 boxes and each boxe has 16 parameters. which contains 11 classes probabilities, 4 coordinate parameters for each box and 1 confidence.

The reason why the PRN could recognize s-digit pixel well. During the frame conversion, the difference of gray-values is obvious. However, the values in other conditions are constant. Through the frame alignment, the pixel data stream would be transferred into two-dimensional, and regarding the pixel data stream as a gray image. In the gray image, the change of second-pixel gray value is periodic. Which causes larger gray values existing in adjacent columns. Thus, gray image features can be seen as some vertical stripes, and the pixel recognition network (Based CNN) can learn these features. The experiment results shows that the pixel recognition network is generalized to detect certain periodic problems and contains better learning performance.

S-Digit Localization: CNNs [4] and YOLO [12] are customized to get the bounding box of s-digit. The CDRN is used for clock digit feature extraction inside of the area identified by YOLO. As shown in Fig. 4, the structure of CDRN is designed more simpler than ResNet-50 and DarkNet-19 because the CDRN only recognizes clock digits.

Deciding the Clock Area: A procedure is designed to decide the clock area based on the preceding outcomes such as the found s-digit place and the following two facts: (1) digits in clock area usually are the same in color and bounding box size. (2) the pixels around clock area are background, which are constant. Based on this two facts, we can localize clock area by s-digit bounding box.

2.3 Reading Clock Digits by a Phase of Deep Networks

A CDRN based procedure is proposed to localize and recognize s-digits in the found clock area because the traditional OCR can not achieve a satisfactory performance for this task.

Custom Networks for Localizing and Recognizing Clock Digits. After finding the bounding box of s-digit by YOLO with CDRN. We use digit sequence to recognize s-digit by CDRN. This procedure is built based on the following facts. Frames from \(t+k *R+1\) to \(t+(k+1) *R\) have the same s-digit if frame t is s-digit transit frame because the s-digit transits every R frames. Thus, the s-digit in the frames \(t+k *R+1\) to \(t+(k+1) *R\) is number k if the s-digit in the frames from t to \(t+R\) is “0”. In other words, the s-digits in the frames from t to \(t+v *R\) form a digit periodic increasing sequence according to the clock knowledge, supposed that the input clip is v second long (\(v < 10\)). Based on these facts, we use 3-digit sequence CNN recognition procedure for finding both s-digit transit frames and recognizing s-digits, denoted as Procedure I.

figure b

Once s-digit transit frames are known by Procedure I, all the transit frames for all x-digits are known. Thus, we can take at least 50 frames with the same digit for any x-digit from a 4 second long clip (Notice that our video is 25 frames per second). Hence an odd number of frames from these 50 frames can be selected to recognize an x-digit in Procedure II.

3 Experimental Results

The algorithm for reading digital video clocks is implemented in C++. To evaluate the proposed algorithm of dataset is built. This dataset comprises of 300 broadcast soccer videos and 300 broadcast basketball videos, where each clip is 15 second long. Each of 300 broadcast soccer videos has a single clock; each of 300 broadcast basketball videos has two clocks. All clocks in the clips are working clocks. These clips vary in digit color, digit background color, size, and font.

By setting different threshold parameters, the CPP method [15] can achieve good results, but these threshold parameters are difficult to set. Our experimental data was collected based on CPP method and the threshold parameters provided.

3.1 Experiments on the S-Digit Pixel Identification

In order to verify the effectiveness of Pixel Recognition Network (PRN), this paper compares it with several commonly used methods, namely SVM, FCN (fully connected network). PRN is implemented by caffe(c++) and its detail described in Sect. 2. We use libraries of libsvm(c++, svm_type=c_svc, kernel_type=rbf) and FCN(layer=[125, 10, 2], activation=sigmoid) implemented by caffe(c++). The results are presented in Table 1.

Train: 20162 positive samples, 21003 negative samples

Test: 20143 positive samples, 20925 negative samples

Table 1. Comparison with SVM, and FCN for recognizing s-digit pixel in Test

According to Table 1 we draw the following conclusions. First, the recall value of the proposed method is generally higher than the precise value, due to the amount of none-s-digit pixel larger than the s-digit pixel around the stable pixel. Second, during the periodicity of the s-digit pixel, the result of Pixel Recognition Network (PRN) is relatively best with a little time consumption. In addition, the PRN can be generalized to detect certain periodic problems.

3.2 Experiments on Finding the S-Digit Bounding Box

According to the s-digit pixel detected in Sect. 2, we can generate the s-digit region, and then we use the Clock-Digit Recognition Network (CDRN) and YOLO to get s-digit bounding box. Unlike general YOLO detection framework, we use Clock-Digit Recognition Network as the backbone instead of the commonly used as VGG, ResNet, and DarkNet. The Clock-Digit Recognition Network structure is simpler as shown in Table 2 and is more suitable for feature extraction of video clock digit. The Clock-Digit Recognition Network is improved on the basis of LeNet-5. The experiments show that the Clock-Digit Recognition Network extracts the digital features of the video clock better.

In this step, we use the algorithm presented in [14, 15] to collect a variety of s-digit region images amounted 2w+ by setting best threshold parameters. The training set contains 10435 and the test set is 10779. Then we convert dataset to a gray image and resize 8 times larger, which makes s-digit region’s resolution higher. The result of localizing s-digit bounding box showed in Table 2.

Table 2. The result of the localizing s-digit bounding box in Test

From the Table 2 we can draw following conclusions. First, our method locates the bounding box of s-digit more accurately and with minimal time. Second, compared with ResNet-50, the structure of Clock-Digit Recognition Network (CDRN) is simpler in structure, and the effect of localizing s-digit is the same. Third, it can be proved that CDRN is more suitable for extracting clock digital features.

3.3 Experiments on Clock Digit Localization and Recognition in Clock Area

In this step, we use the algorithms in [14, 15] by setting best threshold parameters to collect a variety of clock area images amounted 2w+. Next, we use sliding-CRDN to locate and recognize digits in clock area showed in Sect. 3 Procedure II. The library is caffe(c++), and the result showed in Table 3. The accuracy indicates ratio of digits recognized correctly account for all digits in total clock areas.

Table 3. The result of the recognition in clock area

4 Conclusions and Future Work

This paper has presented an algorithm for reading digital video clocks to eliminate the demerits of existing heuristic algorithms by using two phases of connected neural networks. The first phase of neural networks is used to localize the clock area. This phase of neural networks takes the approach that first find the s-digit place and then expands to obtain the clock area. The second phase of neural networks adopts YOLO as framework and uses the deep networks customized by making use of properties of digital clocks to work as the bases of YOLO. The experimental results has showed that the proposed algorithm can achieve a high accuracy in second digit localization and reading all the digits of clocks. This paper has the following contributions. First, a pixel recognition network (based on frame alignment) to identify the periodic s-digit pixels. This is the first neural network that can identify individual pixels by taking use of the periodicity of pixel values. Second, it proposed the first algorithm constituted by a batch of neural networks to localize and recognize s-digit and x-digits. Compared to the method that uses a batch of functions to localize s-digit place, it gets rid of the job of setting thresholds for functions. And the trained deep networks have potential to take use of the properties of digital video clocks better than the heuristic functions.

The two future jobs can be done to enhance the proposed algorithm. First, it is to improve the algorithm design to achieve an accuracy of 100% to reach the accuracy level of the existing heuristic algorithms. Second, it is to further integrate the connected deep networks into one whole deep network as YOLO localizes and recognize object in one pipe.