1 Introduction

With computer vision booming, the field of computer vision based web services is strongly influenced and is developing rapidly. In the past few years, lots of companies and their web services, handling online computer vision tasks, have emerged rapidly. Most of these are commercial products and therefore are not open source for research purpose. In addition, most commercial software does not provide training systems and therefore could only be deployed for inference, their parameters are fixed and their models are pre-trained. In a real commercial application, it faces lots of problems and limitations. For example, the business models are changing rapidly while the collected dataset is also growing rapidly, which is driving the need for refinement in deep learning and computer vision models. It is needless to say that such product has greatly improved the application of computer vision. It is still a necessity to push it into the next phase of computer vision web services.

In this paper, we focus on providing an open source versatile framework for handling various types of computer vision tasks, including both classic tasks such as image segmentation, object detection etc., and newly developed tasks such as image synthesis as well. At the same time, it enables model training, testing and deployment with easy-to-use steps. With CVTron Web, all the phases of computer vision deployments are connected together.

The CVTron Visualization is made of three modules

  1. 1.

    Model Library of commonly used computer vision tasks.

  2. 2.

    Backend Serving for serving the training and inference system. It is also called CVTron-serve.

  3. 3.

    Frontend User Interface for monitoring, testing and to input parameters. It is also called CVTron visualization.

Each module is an absolutely independent project. This separation of the three modules makes the extension and modification of modules much easier.

2 Related Work

Though there are many web-based libraries and products for handling computer vision tasks, little work has been conducted for both training and inference in computer vision tasks. Although Clarifai and some new services are proposed for specific training processes, CVTron Visualization is much more ambitious and focuses on providing an open source framework for both commercial and research use.

The main advantages of CVTron Visualization are its two features. Firstly, it provides full workflow for handling the computer vision tasks which includes both training, inference, and some other utilities. While the other refers to that it is fully open source software which solves the following shortcomings of the closed source software.

  1. 1.

    Security Concerns. For deep learning based computer vision tasks, it usually takes lots of data to train a suitable model. In most cases, what user uploads are private and valuable data and should be well protected. However, we cannot make sure that the dataset we upload is used only for our tasks since we have little knowledge of the software itself. From a commercial aspect, security has aroused the greatest concern. Some companies adopting computer vision algorithms prefer the models can only be used in offline or local network. With CVTron Visualization, users can deploy it on local area network for greater security, which addresses the security issues immensely. The source of CVTron Visualization is open and can be used for auditing as well.

  2. 2.

    Rapid Progress. Computer Vision is one of the hottest research fields and has aroused the greatest interest in recent years. There are hundreds of new models appearing each year in different fields such as ICNet [1] in real-time image segmentation and YOLO series [2] in object detection. However, for closed source software, they mainly concentrate on specific tasks and don’t follow the research trend. At the same time, many models are still in the research stage and cannot be carried out commercially. By combining research accomplishments with application framework, CVTron Visualization provides an interface between these. With CVTron Visualization, users are able to implement their own models according to the documentation and put it online taking simple steps.

  3. 3.

    Price/Performance. In some specific field, the training workload of computer vision models is intensive. Out of this consideration, companies in this field prefer to buy their own machine equipped with GPU rather than use cloud services, since the cloud services are much more expensive for the intensive workload. For these companies, their cost would be reduced by adopting and deploying CVTron Web on their own machine. Additionally, the convenience of such training processes would not be compromised by using the built-in web dashboard.

Table 1. Difference between different computer vision services.

To sum up, we compare CVTron Web with some other computer vision services. The result is as shown in Table 1. It reveals that CVTron Web is more ambitious and has lots of attractive features benefiting from its open source.

3 Architecture

CVTron Visualization is mainly made of three components. In this paper, we mainly focus on the backend serving system and the frontend user interface.

3.1 Model Library

CVTron Web is based on the parent project CVTron, which is a collection of tools, utilities and pre-trained models for computer vision research and application. Since this paper mainly focuses on the web services interface of CVTron, the model library will be introduced briefly.

The Model Library is a collection of commonly used and developed deep learning based models such as DeepLab [3], YOLO [2], and Inception [4] etc. The design principles of CVTron Model Library is simple and easy to understand, referring to ChainerCV [7], include:

  1. 1.

    High Level API. The library provides several cohesive interface for calling. Handlers for different types of job are wrapped into an interface class. All the interface classes serve as a handler initialization method and a plug-and-play method for calling.

  2. 2.

    Layer Based Implementation. The library implements lots of deep learning models inside, and therefore the implementation principles are quite important for the library. From this aspect, we first introduce layer based implementation which means implementing the commonly used layers first and then, assemble it as the model. The advantages of such an implementation include both reproducibility and compositionality.

  3. 3.

    Reusable Utilities. Lots of utilities such as logging, formatting etc. are either used in the library or further used in the Visualization toolkit. Taking the wide usage into consideration, the design of utilities must regard the reusability as the most important concern. In detail, most of the utilities in the model library can be utilized in the backend server system directly, including logging tools, image reader and writer etc.

3.2 Backend

The architecture of backend is showed in Fig. 1. In this chapter, we introduce the design principles and extensions of the backend.

Fig. 1.
figure 1

Backend architecture.

Resource Based Endpoint Design. In recent years, RESTful (REpresentational State Transfer) services have emerged as backend design standard. It basically assumes that every endpoint refers to a resource, and therefore there should not be any verbs in the endpoint. In the CVTron backend, the resources are called classifier, detector, and segmenter and so on respectively.

Each task is assigned to a class which extends the Resolver class. In other words, the Resolver class is the parent of all task handlers such as classifier, detector, and segmenter etc. Each class should implement the constructor, get and post methods.

The constructor method will load all the parameters for initializing for instance. When the client is making a get request, an instance of the specific class will be initialized for further inference. In other words, the get request is defined as the initialization of classes while the post request is defined as the trigger of tasks.

Except the classes extended from Resolver, there are some static resource classes providing device information, configuration etc. They are mounted to the endpoint named device and node, for requesting the hardware information and the status of this node respectively.

For compatibility issues, this work still introduces version control in the endpoint design as usual. Taking all these into consideration, the endpoint looks like https://api.example.com/v1/classifier etc.

Service Mount. In some cases, users may need to load multiple models for a specific task. Taking classification as an example, some companies may need to use different models for identifying food images and scene images. In order to handle this problem, CVTron Web use a unique id to identify different models, and then mount it into different endpoint. To achieve this, CVTron-Serve introduces the following modules.

  1. 1.

    Service Handler & Endpoint. When creating a new endpoint for handling one specific tasks amongst the all kinds of supported tasks, client first request for initialization. At the server side, it requires an endpoint and a handler to resolve this kind of request. Therefore, on the basis of the basic endpoint design as stated before, the Service endpoint is added. It receives the post request from the client and returns the unique id and related information generated by the following mechanism.

  2. 2.

    UUID for Identifying Services. In order to identify different services under the same task handler, the system requires a unique identifier to mark each services. As long as the associated model has been initialized, it will be assigned to a unique identifier and served as additional storage or endpoint uses. After that, the endpoint looks like https://api.example.com/v1/classifier/5abc3a. In this case, 5abc3a is the unique identifier. This kind of identification does not break the principle of endpoint design since the identifier refers to a specific resource.

  3. 3.

    Service Database. Database is not a requirement but strongly recommended in this system. The database is for querying additional information from the unique identifier. For example, when the server program tries to initialize the model at a specific endpoint identified by a unique identifier, it needs to find the file path of this identified model. Therefore, the model file path is stored in the database. As long as the query works, the system could use any kinds of database not only limited to relational database system but also NoSQL systems such as MongoDB.

Distributed Computing. As stated before, the resources are mounted to the endpoint, which is suggesting that an endpoint is able to corresponds to several computing resources, i.e. the computing resources could be deployed into different machines to implement the distributed computing.

Though there are some distributed deployment frameworks for machine learning models such as KubeFlow, CVTron Visualization keeps the minimalist implementation since the use cases are quite simple. The implementation includes the following essential parts

  1. 1.

    Status Indicating. Each computing node has an endpoint to indicate the node status, which includes OK, READY, OCCUPIED and DEAD. OK indicates that the node is working normally and can be initialized by a get request, while READY means the node has been initialized and is ready for a post request and further inference. OCCUPIED refers to the node is being used as either inference node or training node and is not able to perform other incoming tasks. Intuitively, DEAD status indicates that the node cannot be connected or cannot work properly.

  2. 2.

    Node Chain. If there are more than one nodes available for handling tasks, they will be organized as a chain structure (Fig. 2). When the client is requesting, it firstly goes through this chain and search for the first available node, then assign the job to this node. More specifically, the node chain works as a load balancer which always assigns jobs to the first available node. What needs to be clarified is that the node chain or balancer is not a requirement for CVTron visualization. Users will also be able to specify the node by providing the IP address and port to use if they prefer.

Fig. 2.
figure 2

The node chain structure.

3.3 Frontend and User Interface

Single Page Application. CVTron Visualization adopts single page application (SPA) design. SPA refers to such a web application that loads a single page and dynamically updates that page during the entire interaction process between users and the application. It is contrary to multiple pages application (MPA). MPA follows traditional cognition that every change shall be rendered as a new page to display the modified parts. It is obvious that MPAs require constant page reloading and therefore, might have the negative impact on the user experience. On the contrary, the advantages of SPAs are that this kind of apps will be quite smaller and faster than MPAs since most resources are loaded during the interaction, which is the life cycle of such an application at the same time. Taking all these into consideration, CVTron Visualization adopts SPA pattern as its design. More specifically, the CVTron Visualization could be separated into Components, Pages and Services.

Components. Components of CVTron Visualization could be divided into two parts, one of which is considered as basic component while the other is the high-level component. The basic components focus on handling single specific tasks and have zero dependencies. The basic components include upload component, line chart component, image loading component, table component etc. Multiple basic components constitute a high-level component. For high-level components, they are supposed to handle both training and inference tasks. In the training phase, almost each kind of tasks including all of the detection, classification etc., is composed of a similar group of basic components. For the inference phase, different tasks usually require different component to be displayed. Therefore, in the design of the inference phase, it contains the following types of components.

Fig. 3.
figure 3

A classification example with image & table component.

  1. 1.

    Image & Table. Several tasks use the table to show the result. Taking classification as an example (as showed in Fig. 3), the backend returns pairs of confidence and label at the same time which is quite suitable to be shown in table form. Therefore, for this kind of task, CVTron Visualization uses an image component for the original image and a table for the inference result.

  2. 2.

    Image & Graph. Detection is one of the most important computer vision related tasks and has been widely used in security, like UAVs etc. For these kind of tasks, CVTron Visualization uses the canvas to draw the detection result. More specifically, it is the frontend draws rectangles or circles on the detected object instead of the backend, which is the main difference between this kind of component and the Image & Image component.

  3. 3.

    Image & Image. Lots of tasks are based on the image to image relations, which are not limited to classic label map applications such as image segmentation, but also for pixel to pixel applications such as image synthesis etc. For these kind of tasks, the generated image, no matter whether it is a label map or a generated image, it is processed by the backend. After the process, the backend will provide a URL to the post-processed image. Therefore, the front end only needs to load both images in two places (as showed in Fig. 4).

Fig. 4.
figure 4

A segmentation example with image to image component.

Pages. The page of CVTron Visualization is based on many functional components. The components of a page cooperate together to complete a series of related tasks such as to configure and trigger the training process. In the light of the design principles, pages include four items, which are setting page, hardware information page, training page and testing page. In setting page, users can input the IP address and port to specify the node they want to use. They can also select the languages and some other configurations as well. As for information page, it is intended to display information of the node such as status and hardware. The most important pages are training page and testing page. The training page provides interactive steps for users to use and then they could launch a training task. Training page consists of upload component, plain component, table component and line chart component, which utilize the data visualization technology to track the training process. Users can choose which way to observe the changes of various indicators (as showed in Fig. 6) in the training process according to their demands. Last but not least, the testing page mainly combines upload component, inference component and display components. It guides users to choose their task and model, and then returns the predicted results, which are quite similar to the training page. According to the form of final results, testing page will be adjusted and the main display component can be image & table, image & graph or image & image as mentioned above. In general, specific components make up functional pages to accomplish specific tasks. In other words, pages are the interfaces that are rendered to users and interact with users. The hierarchy of components and pages is illustrated in Fig. 5. What needs to be mentioned is that not all components and pages are in the Fig. 5 in order to make it clear and easy to understand.

Fig. 5.
figure 5

The hierarchy of pages and components.

Services. Services are the bridge between the backend server and the frontend interface. There are three types of service handling all the tasks in CVTron Web, which are file upload, file upload with configurations and simple query from the backend. The file upload and file upload with configurations are wrapped in an HTTP post request while the query service is intuitively wrapped in a get request thus dealing with different tasks.

  1. 1.

    File Upload Service. For almost all of the computer vision tasks, client requires an image to be uploaded to the backend. File Upload Service is therefore implemented to handle this task and retrieve the inference result. When the server receives this kind of task, it will begin to read the input image and predict a corresponding result. This result is immediately returned to the frontend.

  2. 2.

    File Upload with Configuration Service. For training tasks, it usually needs a dataset file and lots of configurations. In the frontend, the configurations are formatted as a json object. This object along with the dataset file is then uploaded to the backend server. The backend will then immediately return the location of the training log files which enable the front end to keep requesting the log files during the training process and render pages for monitoring purpose.

  3. 3.

    Query Service. As stated before, the frontend needs to query for the log file during the training process. Besides, it requires the frontend to query the device information, general configurations and so on. All these requests are wrapped in a get request to determine the endpoint. To response to this kind of service, the backend queries the requested content and returns the information.

Fig. 6.
figure 6

The training page with a line chart showing the changes of MIOU

4 Conclusion and Further Work

CVTron Web is a versatile and flexible framework for web based computer vision tasks. It consists of three modifiable component and can be easily extended or modified. It allows those having little programming or deep learning knowledge to easily build computer vision services, at the same time, for those having solid knowledge, it still enables them to extend and modify. In addition, it summaries the type of computer vision tasks as three types, which are image to table, image to graph and image to image. This summary is helpful for the future implementation of other computer vision tasks.

The current system has some issues as well. The distributed computing system always assigns jobs to the first available node and therefore is not a real load balancing system.

In the future, we will extend the system for other tasks, such as GAN [5], content-aware image resize [6] and so on. The evaluation of the training process is also a direction for further exploration. Last but not least, the implementation of distributed computing is also a point to be promoted for commercial use.