Keywords

1 Introduction

Vision is the most important way for humans to perceive the outside world. Common graph data sets generated in daily life, have laid an important data foundation for visual understanding methods of deep learning, such as Microsoft COCO [16], Pascal VOC [5], etc. The earth remote sensing data set represented by DOTA [24] and the autonomous driving data set represented by KITTI [7] have greatly promoted the development of perception algorithms in related fields.

The full markup tools represented by LabelMe [22] have been widely used in object detection, semantic segmentation, instance segmentation and other visual understanding tasks of data marking. The marked data makes it possible to establish a differential relationship between data and tags, and also provides a basis for the evaluation of algorithms on large-scale data.

The full mark method is a method of marking all object areas in an image from scratch. Namely, simply enter a picture and mark it with tools like LabelMe. However, the full markup method is very redundant for large-scale data.

Take VOC as an example. For the detection task, there are 11540 pictures and 27,450 objects belonging to 20 categories. There are less than \(1024 \times 1024\) pixels for each image and less than 12100567040 pixels for the entire dataset. However, for the DOM-7M dataset of Chang’E-2 [1], there are 844 sub-frames, each sub-frame has \(29055 \times 29005\) pixels, with a total of 712498913100 pixels, which is greater than 59 times of VOC. It means that the labeling of lunar data is very costly (Fig. 1).

Fig. 1.
figure 1

(a) and (d) are original image of VOC, (c) and (f) proposal from the auxiliary method, (b) and (e) are ground truth.

On the other hand, the semantic information on the surface of the moon is less rich in everyday life. According to the manifold law of data distribution, it can be seen that the same type of high-dimensional data in nature is concentrated in the vicinity of a low-dimensional manifold [13, 14]. Intuitively speaking, in pedestrian detection, although the objective detection frame of each person is different because of the human body type, the position and shape of the detection frame are consistent with a certain rule. For example, the length is mostly greater than the width, and the proportion is within a small interval [15]. However, the distribution of such data itself is difficult to express through formal methods, and its features are often high-dimensional.

Therefore, based on such objective facts, the work of the auxiliary mark for the detection task is to obtain a function \(f(X)=Y\), where \(X \in R^{W \cdot H \cdot 3}\) is an RGB picture, \(y_i \in Y\) is a sequence. For \(y_i \in R^4\), \(y_i = (x,y,w,h)\) represents a rectangle box, which is the smallest rectangle box containing all pixels of the object.

A common method for fitting the function f is to use detection networks such as the series of R-CNN [2, 8, 9]. This method has achieved great success on the COCO dataset but requires strong computing resources, and the marking task is often performed on the desktop computer. Our method uses a lightweight network to get real-time feedback in the case of only using CPU.

In general, our method adds the auxiliary mark module on the basis of the existing full mark method, and forms a human-friendly design for the label organization. In the experiment, it is found that with our method, there is a 2.5 multiplier on the general dataset and a 2.8 multiplier on the lunar dataset.

2 Related Work

The auxiliary marking task can be started from three aspects: data source, auxiliary method and marking tool. Our method was tested simultaneously on the general data set and Chang’E series dataset. The auxiliary method adopted the method of object detection, and the marking tool adopted the full marking method represented by LabelMe. These three parts will be introduced below.

For the marking of object detection task, researchers have proposed a large number of detection algorithms. Due to the limited length of the paper, we only introduce the detection algorithm most closely related to the work of our paper. Interested readers can refer to the more detailed overview [17].

2.1 Object Detection Algorithm

Object detection is a fundamental topic in computer vision research. Its basic definition can be simply considered as using a minimum horizontal rectangle to include all the pixels where the object is located. For the improvement schemes such as polygon detection, this article will not go into details. In recent years, the Deep Convolutional Neural Networks (DCNN) have achieved great success in object detection tasks. The object detection models need to consider the diversity of the appearance and scale of the objects. The classic methods are the Two-stage method represented by Faster-RCNN [20] and the One-stage method by YOLO [19] SSD [6, 18].

Two-Stage Method. Faster-RCNN achieves high-precision object detection through feature sharing and region proposal network (RPN). In recent years, the methods such as R-FCN [4] Cascade RCNN [2] MS RCNN [11] have reached new heights in general data sets. But these high-precision detection algorithms are not applicable to the scene of auxiliary marking because of the huge. Computations introduced by he multi-step dispersion method and cascade form. Therefore, this method adopts a more lively one-stage method. With the development of current network quantization methods such as ONNX [21]TVM [3], it will be of practical value to use two-stage method as an auxiliary network in the future.

One-Stage Method. Compared to Two-stage, One-stage method lacks the Proposal process and directly returns to the Bounding Box on the Feature Map. It has a great advantage in speed and is therefore often used in devices with less computational resources, such as various kinds of embedded machine. The SSD method is representative of this series, which directly predicts the detection box and its categories, and does not generate a proposal process. It uses the anchor mechanism to solve the problem about different size object detection blocks. In recent years, many research scholars have proposed R-SSD [12] on the basis of SSD, and its use of multi-feature fusion has a higher improvement on small objects. At present, this field has a rapid development due to the research of many scholars. But the basic idea of the SSD series is not fundamentally changed (Fig. 2).

Fig. 2.
figure 2

The three stages of moon picture.

Lightweight Neural Network. Mini networks are usually divided into two kinds: compression model and direct training small model. The compression model refers to transforming a large network into a network with fewer parameters and less computation by methods such as quantization, compression, hashing, pruning, and vector coding. The second method is to train small models directly. The MobileNets [10] used in our method focuses on optimizing the delay while considering small networks.

MobileNet is designed for lightweight and efficient network for the mobile device. Its basic architecture is manifold architecture (Streamlined), and it uses the Separable convolution (Depthwise Separable Convolutions) to build a lightweight neural network. What’s more, it also has two orders of magnitude fewer parameters than the original VGG [23]. In recent years, lightweight neural network is also an important research direction and the proposal of ShuffleNet [25] etc. also provides an alternative to MobileNet.

2.2 Full Mark Tool

LabelMe is a full mark tool which is currently widely used. It has pixel-level operations and uses JSON as a form of exchange data. The binary stream of the edited image is characterized in the JSON data. The organization of the label is a sequence of points. The area enclosed by the closed polygon is the area where the label is obtained. This approach groups different data types together and provides good robustness across different data types.

The full mark operation is all operated by the marker, which inevitably depends on the marker. Marker’s attention and efficiency in a large number of repetitive markings will decrease.

Fig. 3.
figure 3

Architecture of our model

3 Proposed Methods

The task of the auxiliary mark can be simplified to the progress of the following Fig. 3. First using a suggested network to get a suggestion box for a set of interesting objects. Next fine tune the results of the proposed network.

3.1 The Algorithm Processes

Let \(I \in R^{W \times H \times 3}\) is a RGB image. Manual marking time as following.

$$\begin{aligned} T_{completely} = Time(I, H_{completely}) \end{aligned}$$
(1)

\(H_{completely}\) mean the process of completely manual marking. While for the auxiliary process, The time can be express as

$$\begin{aligned} T_{proposal}=Time(P(I),H_{proposal}) \end{aligned}$$
(2)

Function P is the proposal network. \(H_{proposal}\) is the process of adjusting for proposal object box. In the case of consistent accuracy, the sum of the time of P(i) and adjusting should less than the time of completely marking.

3.2 Quantification

For such a pipeline process, we use a quantitative method to measure time. Completely mark is Fig. 4. Auxiliary method is Fig. 5.

Fig. 4.
figure 4

Full mark method pipeline

For the completely mark method it can be quantized into the above pipeline. Each step is as follows.

  • A, Marker software loads images.

  • B, Marker makes a preliminary observation of the picture, according to the experimental principle of human-computer interaction.

  • C, Confirm the goals inside the images and start marking.

  • D, This process is marking and classifying. Specifically, it can be divided into serveral E, F processes.

  • E, Confirm the target boundary.

  • F, Put a label on the confirmed boundary.

  • G, Save and mark the next one.

It can be seen form the actual operation the there is no method for parallelizing pipeline stages in the traditional completely method. And the pipeline for the process with auxiliary marks is show in the Fig. 5. Each step is as follows.

Fig. 5.
figure 5

Auxiliary mark method pipeline

  • 1, Time of the proposal network generate recommendations.

  • 2, Fine-tune the suggested label. Such as target error.

  • 3, Confirm the goals inside the images and start marking.

3.3 Propose Network Design

Propose Network uses the SSD framework, while we replace the backbone of VGG to MobileNets-v1, which has fewer parameters and can run in PC, only has a CPU, in real time. In the experiment, our training process were run on the GPU server and the test process were run on the CPU personal notebook.

The SSD algorithm is to put the feature block in the feature map of each layer to a classifier and a regressor. Its main computational complexity is in the feature extraction part. Because in this task, the mAP has a non-linear relationship with the mark time of auxiliary process. In response to this problem, this paper demonstrates in the experimental part.

3.4 Segmentation and Combine the Chang’E Image

The original image has a length and width of 2e5 pixels. It is too large too load. We layer it with 4-Node tree and each node is an image, which is \(512 \times 512\) pixels, like Fig. 6(b).

The visual field of each node is 4 times that of the next layer. In this way we can get different object with large and small size as different levels. In Fig. 6(a), the bound box has robust size for crater in lunar.

Fig. 6.
figure 6

A, Combine the same image to a complete image. B, Using N-node tree to divide a large picture into small picture.

4 Experiments

4.1 Pascal VOC Dataset

We comprehensively evaluate our method on the PASCAL VOC 2007. This dataset consists of about 5k train-val images and 5k test images over 20 object categories. Table 1 show the mark time with SSD-MobildNet proposal network.

In this experiment, three marker personnel mark three set of photos per person and we average their time. The three set of photo is completely set, auxiliary set and observe set. Completely set means marker should label an image without any supplementary information. Auxiliary set means the input for marker is the result of proposal net, which has some weak mark. Observe set means marker only observer the image but do not mark it.

Table 1. The time for different methods in VOC with SSD-MobildNet which mAP@0.5 is 75.4

4.2 CCD Data of Chang’E Series

The Chang’E-1 mission is China’s first lunar exploration mission. The Chang’E-2 mission is based on the backup satellite of the lunar exploration project. After technical improvement, Chang’E-2 serves as the pilot satellite of the second phase of the lunar exploration project and tests some key technologies of the second phase of the lunar exploration project.

Its main scientific goal is to use a stereo camera (CCD) to capture three-dimensional images of the lunar surface. The Chang ’e-2 orbit is 100 km high, and the distance between pixels with the strongest resolution of the CCD stereo camera is 7 m.

Table 2 show the improve of proposal net for mark in Chang’E.

Table 2. The time for different methods in Chang’E with SSD-MobileNet which mAP@0.5 is 71.0

4.3 The Relationship Between mAP and Auxiliary Cost

In this experiment we explore the relationship between the auxiliary net with different mAP and the cost time. In Fig. 7, the x axis is the marking time and the y axis is the mAP of different proposal network.

Fig. 7.
figure 7

Time with mAP.

5 Conclusion

In this paper, we proposed the useful suggested network in image marking process and practiced in common dataset and Chang’E dataset. In order to solve the problem of too large to load Chang’E CCD data, we propose a method of layering by quad tree. There has been an increase in the time stamping of generic data sets and private data. It also explores the relationship between time and mAP. Improvements to the network and interactive methods in feature are good research directions.