Help LabelMe: A Fast Auxiliary Method for Labeling Image and Using It in ChangE’s CCD Data

Lu, Yunfan; Hu, Yifan; Xiao, Jun

doi:10.1007/978-3-030-34120-6_65

Yunfan Lu¹⁴,
Yifan Hu¹⁴ &
Jun Xiao¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11901))

Included in the following conference series:

International Conference on Image and Graphics

1982 Accesses

Abstract

Automatic detection of regions of interest in lunar image data is a basic subject for understanding the moon. In recent years, deep learning has greatly promoted the development of detection algorithms, but it requires a lot of markup data. In addition, the current common full-mark method has a large amount of work, a high degree of repetition and complex operation steps. In view of this situation, we combine the target detection technology with the whole marking method to design a new auxiliary marking method. We performed a labeling comparison experiment on the universal data set Pascal voc2012/2007. The results show that our method improves 2.5 times in efficiency with consistent accuracy. Based on the Common datasets, we also conducted experiments on the Chang’E series lunar CCD datasets, which also showed good performance and improved 2.8 times compared with the full mark indicates that our method can be applied to both general data sets and private data sets and also provides a research direction for the current auxiliary marking methods.

You have full access to this open access chapter, Download conference paper PDF

LabelDetection: Simplifying the Use and Construction of Deep Detection Models

CTLC: A Co-Training-Based Label Correction Method for Webly-Supervised Learning

Article 12 October 2023

Xueshuai Lv, Mengyao Li, … Ling Gao

The improvement of ground truth annotation in public datasets for human detection

Article Open access 08 April 2024

Sotheany Nou, Joong-Sun Lee, … Takashi Obi

Keywords

1 Introduction

Vision is the most important way for humans to perceive the outside world. Common graph data sets generated in daily life, have laid an important data foundation for visual understanding methods of deep learning, such as Microsoft COCO [16], Pascal VOC [5], etc. The earth remote sensing data set represented by DOTA [24] and the autonomous driving data set represented by KITTI [7] have greatly promoted the development of perception algorithms in related fields.

The full markup tools represented by LabelMe [22] have been widely used in object detection, semantic segmentation, instance segmentation and other visual understanding tasks of data marking. The marked data makes it possible to establish a differential relationship between data and tags, and also provides a basis for the evaluation of algorithms on large-scale data.

The full mark method is a method of marking all object areas in an image from scratch. Namely, simply enter a picture and mark it with tools like LabelMe. However, the full markup method is very redundant for large-scale data.

Take VOC as an example. For the detection task, there are 11540 pictures and 27,450 objects belonging to 20 categories. There are less than $1024 \times 1024$ pixels for each image and less than 12100567040 pixels for the entire dataset. However, for the DOM-7M dataset of Chang’E-2 [1], there are 844 sub-frames, each sub-frame has $29055 \times 29005$ pixels, with a total of 712498913100 pixels, which is greater than 59 times of VOC. It means that the labeling of lunar data is very costly (Fig. 1).

On the other hand, the semantic information on the surface of the moon is less rich in everyday life. According to the manifold law of data distribution, it can be seen that the same type of high-dimensional data in nature is concentrated in the vicinity of a low-dimensional manifold [13, 14]. Intuitively speaking, in pedestrian detection, although the objective detection frame of each person is different because of the human body type, the position and shape of the detection frame are consistent with a certain rule. For example, the length is mostly greater than the width, and the proportion is within a small interval [15]. However, the distribution of such data itself is difficult to express through formal methods, and its features are often high-dimensional.

Therefore, based on such objective facts, the work of the auxiliary mark for the detection task is to obtain a function $f(X)=Y$, where $X \in R^{W \cdot H \cdot 3}$ is an RGB picture, $y_i \in Y$ is a sequence. For $y_i \in R^4$, $y_i = (x,y,w,h)$ represents a rectangle box, which is the smallest rectangle box containing all pixels of the object.

A common method for fitting the function f is to use detection networks such as the series of R-CNN [2, 8, 9]. This method has achieved great success on the COCO dataset but requires strong computing resources, and the marking task is often performed on the desktop computer. Our method uses a lightweight network to get real-time feedback in the case of only using CPU.

In general, our method adds the auxiliary mark module on the basis of the existing full mark method, and forms a human-friendly design for the label organization. In the experiment, it is found that with our method, there is a 2.5 multiplier on the general dataset and a 2.8 multiplier on the lunar dataset.

2 Related Work

The auxiliary marking task can be started from three aspects: data source, auxiliary method and marking tool. Our method was tested simultaneously on the general data set and Chang’E series dataset. The auxiliary method adopted the method of object detection, and the marking tool adopted the full marking method represented by LabelMe. These three parts will be introduced below.

For the marking of object detection task, researchers have proposed a large number of detection algorithms. Due to the limited length of the paper, we only introduce the detection algorithm most closely related to the work of our paper. Interested readers can refer to the more detailed overview [17].

2.1 Object Detection Algorithm

Object detection is a fundamental topic in computer vision research. Its basic definition can be simply considered as using a minimum horizontal rectangle to include all the pixels where the object is located. For the improvement schemes such as polygon detection, this article will not go into details. In recent years, the Deep Convolutional Neural Networks (DCNN) have achieved great success in object detection tasks. The object detection models need to consider the diversity of the appearance and scale of the objects. The classic methods are the Two-stage method represented by Faster-RCNN [20] and the One-stage method by YOLO [19] SSD [6, 18].

Two-Stage Method. Faster-RCNN achieves high-precision object detection through feature sharing and region proposal network (RPN). In recent years, the methods such as R-FCN [4] Cascade RCNN [2] MS RCNN [11] have reached new heights in general data sets. But these high-precision detection algorithms are not applicable to the scene of auxiliary marking because of the huge. Computations introduced by he multi-step dispersion method and cascade form. Therefore, this method adopts a more lively one-stage method. With the development of current network quantization methods such as ONNX [21]TVM [3], it will be of practical value to use two-stage method as an auxiliary network in the future.

One-Stage Method. Compared to Two-stage, One-stage method lacks the Proposal process and directly returns to the Bounding Box on the Feature Map. It has a great advantage in speed and is therefore often used in devices with less computational resources, such as various kinds of embedded machine. The SSD method is representative of this series, which directly predicts the detection box and its categories, and does not generate a proposal process. It uses the anchor mechanism to solve the problem about different size object detection blocks. In recent years, many research scholars have proposed R-SSD [12] on the basis of SSD, and its use of multi-feature fusion has a higher improvement on small objects. At present, this field has a rapid development due to the research of many scholars. But the basic idea of the SSD series is not fundamentally changed (Fig. 2).

Lightweight Neural Network. Mini networks are usually divided into two kinds: compression model and direct training small model. The compression model refers to transforming a large network into a network with fewer parameters and less computation by methods such as quantization, compression, hashing, pruning, and vector coding. The second method is to train small models directly. The MobileNets [10] used in our method focuses on optimizing the delay while considering small networks.

MobileNet is designed for lightweight and efficient network for the mobile device. Its basic architecture is manifold architecture (Streamlined), and it uses the Separable convolution (Depthwise Separable Convolutions) to build a lightweight neural network. What’s more, it also has two orders of magnitude fewer parameters than the original VGG [23]. In recent years, lightweight neural network is also an important research direction and the proposal of ShuffleNet [25] etc. also provides an alternative to MobileNet.

2.2 Full Mark Tool

LabelMe is a full mark tool which is currently widely used. It has pixel-level operations and uses JSON as a form of exchange data. The binary stream of the edited image is characterized in the JSON data. The organization of the label is a sequence of points. The area enclosed by the closed polygon is the area where the label is obtained. This approach groups different data types together and provides good robustness across different data types.

The full mark operation is all operated by the marker, which inevitably depends on the marker. Marker’s attention and efficiency in a large number of repetitive markings will decrease.

3 Proposed Methods

The task of the auxiliary mark can be simplified to the progress of the following Fig. 3. First using a suggested network to get a suggestion box for a set of interesting objects. Next fine tune the results of the proposed network.

3.1 The Algorithm Processes

Let $I \in R^{W \times H \times 3}$ is a RGB image. Manual marking time as following.

$$\begin{aligned} T_{completely} = Time(I, H_{completely}) \end{aligned}$$

(1)

$H_{completely}$ mean the process of completely manual marking. While for the auxiliary process, The time can be express as

$$\begin{aligned} T_{proposal}=Time(P(I),H_{proposal}) \end{aligned}$$

(2)

Function P is the proposal network. $H_{proposal}$ is the process of adjusting for proposal object box. In the case of consistent accuracy, the sum of the time of P(i) and adjusting should less than the time of completely marking.

3.2 Quantification

For such a pipeline process, we use a quantitative method to measure time. Completely mark is Fig. 4. Auxiliary method is Fig. 5.

For the completely mark method it can be quantized into the above pipeline. Each step is as follows.

A, Marker software loads images.
B, Marker makes a preliminary observation of the picture, according to the experimental principle of human-computer interaction.
C, Confirm the goals inside the images and start marking.
D, This process is marking and classifying. Specifically, it can be divided into serveral E, F processes.
E, Confirm the target boundary.
F, Put a label on the confirmed boundary.
G, Save and mark the next one.

It can be seen form the actual operation the there is no method for parallelizing pipeline stages in the traditional completely method. And the pipeline for the process with auxiliary marks is show in the Fig. 5. Each step is as follows.

1, Time of the proposal network generate recommendations.
2, Fine-tune the suggested label. Such as target error.
3, Confirm the goals inside the images and start marking.

3.3 Propose Network Design

Propose Network uses the SSD framework, while we replace the backbone of VGG to MobileNets-v1, which has fewer parameters and can run in PC, only has a CPU, in real time. In the experiment, our training process were run on the GPU server and the test process were run on the CPU personal notebook.

The SSD algorithm is to put the feature block in the feature map of each layer to a classifier and a regressor. Its main computational complexity is in the feature extraction part. Because in this task, the mAP has a non-linear relationship with the mark time of auxiliary process. In response to this problem, this paper demonstrates in the experimental part.

3.4 Segmentation and Combine the Chang’E Image

The original image has a length and width of 2e5 pixels. It is too large too load. We layer it with 4-Node tree and each node is an image, which is $512 \times 512$ pixels, like Fig. 6(b).

The visual field of each node is 4 times that of the next layer. In this way we can get different object with large and small size as different levels. In Fig. 6(a), the bound box has robust size for crater in lunar.

4 Experiments

4.1 Pascal VOC Dataset

We comprehensively evaluate our method on the PASCAL VOC 2007. This dataset consists of about 5k train-val images and 5k test images over 20 object categories. Table 1 show the mark time with SSD-MobildNet proposal network.

In this experiment, three marker personnel mark three set of photos per person and we average their time. The three set of photo is completely set, auxiliary set and observe set. Completely set means marker should label an image without any supplementary information. Auxiliary set means the input for marker is the result of proposal net, which has some weak mark. Observe set means marker only observer the image but do not mark it.

Table 1. The time for different methods in VOC with SSD-MobildNet which mAP@0.5 is 75.4

Full size table

4.2 CCD Data of Chang’E Series

The Chang’E-1 mission is China’s first lunar exploration mission. The Chang’E-2 mission is based on the backup satellite of the lunar exploration project. After technical improvement, Chang’E-2 serves as the pilot satellite of the second phase of the lunar exploration project and tests some key technologies of the second phase of the lunar exploration project.

Its main scientific goal is to use a stereo camera (CCD) to capture three-dimensional images of the lunar surface. The Chang ’e-2 orbit is 100 km high, and the distance between pixels with the strongest resolution of the CCD stereo camera is 7 m.

Table 2 show the improve of proposal net for mark in Chang’E.

Table 2. The time for different methods in Chang’E with SSD-MobileNet which mAP@0.5 is 71.0

Full size table

4.3 The Relationship Between mAP and Auxiliary Cost

In this experiment we explore the relationship between the auxiliary net with different mAP and the cost time. In Fig. 7, the x axis is the marking time and the y axis is the mAP of different proposal network.

5 Conclusion

In this paper, we proposed the useful suggested network in image marking process and practiced in common dataset and Chang’E dataset. In order to solve the problem of too large to load Chang’E CCD data, we propose a method of layering by quad tree. There has been an increase in the time stamping of generic data sets and private data. It also explores the relationship between time and mAP. Improvements to the network and interactive methods in feature are good research directions.

References

Data release and information service system of China’s lunar exploration program, national astronomical observatories of China. http://moon.bao.ac.cn. Accessed 19 May 2019
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 578–594 (2018)
Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 (2017)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Article Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Huang, Z., Huang, L., Gong, Y., et al.: Mask scoring R-CNN (2019)
Google Scholar
Jeong, J., Park, H., Kwak, N.: Enhancement of SSD by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587 (2017)
Lei, N., Luo, Z., Yau, S.T., Gu, D.X.: Geometric understanding of deep learning. arXiv preprint arXiv:1805.10451 (2018)
Lei, N., Su, K., Cui, L., Yau, S.T., Gu, X.D.: A geometric view of optimal transportation and generative model. Comput. Aided Geom. Des. 68, 1–21 (2019)
Article MathSciNet Google Scholar
Li, J., Liang, X., Shen, S., Xu, T., Feng, J., Yan, S.: Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimed. 20(4), 985–996 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, L., et al.: Deep learning for generic object detection: a survey. arXiv preprint arXiv:1809.02165 (2018)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Redmon, J., Farhadi, A.: Yolo: real-time object detection. Pjreddie.com (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rotem, N., et al.: Glow: graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907 (2018)
Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1–3), 157–173 (2008)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Xia, G.S., et al.: DOTA: a large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983 (2018)
Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Google Scholar

Download references

Acknowledgments

This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDA23090304), National Natural Science Foundation of China (No. 61471338), Youth Innovation Promotion Association CAS (2015361), Key Research Program of Frontier Sciences CAS (QYZDY-SSW-SYS004), Beijing Nova program (Z171100001117048), Beijing science and technology project (Z181100003818019), and the Open Research Fund of Key Laboratory of Space Utilization, Chinese Academy of Sciences.

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, 100049, China
Yunfan Lu, Yifan Hu & Jun Xiao

Authors

Yunfan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Xiao .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Beijing, China
Baoquan Chen
The Technical University of Munich, Munich, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, Y., Hu, Y., Xiao, J. (2019). Help LabelMe: A Fast Auxiliary Method for Labeling Image and Using It in ChangE’s CCD Data. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_65

Download citation

DOI: https://doi.org/10.1007/978-3-030-34120-6_65
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)