1 Introduction

Measured for example by the interest and participation of industry at the annual NIPS conferenceFootnote 1, it is save to say that deep learning [49] has successfully transitioned from pure research to application [32]. Major research challenges still exist, e.g. in the areas of model interpretability [39] and robustness [1], or general understanding [53] and stability [25, 67] of the learning process, to name a few. Yet, and in addition, another challenge is quickly becoming relevant: in the light of more than 180 deep learning publications per day in the last yearFootnote 2, the growing number of deep learning engineers as well as prospective researchers in the field need to get educated on best practices and what works and what doesn’t “in the wild”. This information is usually underrepresented in publications of a field that is very competitive and thus striving above all for novelty and benchmark-beating results [38]. Adding to this fact, with a notable exception [20], the field lacks authoritative and detailed textbooks by leading representatives. Learners are thus left with preprints [37, 57], cookbooks [44], codeFootnote 3 and older gems [28, 29, 58] to find much needed practical advice.

In this paper, we contribute to closing this gap between cutting edge research and application in the wild by presenting case-based best practices. Based on a number of successful industry-academic research & development collaborations, we report what specifically enabled success in each case alongside open challenges. The presented findings (a) come from real-world and business case-backed use cases beyond purely academic competitions; (b) go deliberately beyond what is usually reported in our research papers in terms of tips & tricks, thus complementing them by the stories behind the scenes; (c) include also what didn’t work despite contrary intuition; and (d) have been selected to be transferable as lessons learned to other use cases and application domains. The intended effect is twofold: more successful applications, and increased applied research in the areas of the remaining challenges.

We organize the main part of this paper by case studies to tell the story behind each undertaking. Per case, we briefly introduce the application as well as the specific (research) challenge behind it; sketch the solution (referring details to elsewhere, as the final model architecture etc. is not the focus of this work); highlight what measures beyond textbook knowledge and published results where necessary to arrive at the solution; and show, wherever possible, examples of the arising difficulties to exemplify the challenges. Section 2 introduces a face matching application and the amount of surrounding models needed to make it practically applicable. Likewise, Sect. 3 describes the additional amount of work to deploy a state-of-the-art machine learning system into the wider IT system landscape of an automated print media monitoring application. Section 4 discusses interpretability and class imbalance issues when applying deep learning for images-based industrial quality control. In Sect. 5, measures to cope with the instability of the training process of a complex model architecture for large-scale optical music recognition are presented, and the class imbalance problem has a second appearance. Section 6 reports on practical ways for deep reinforcement learning in complex strategy game play with huge action and state spaces in non-stationary environments. Finally, Sect. 7 presents first results on comparing practical automated machine learning systems with the scientific state of the art, hinting at the use of simple baseline experiments. Section 8 summarizes the lessons learned and gives an outlook on future work on deep learning in practice.

2 Face Matching

Designing, training and testing deep learning models for application in face recognition comes with all the well known challenges like choosing the architecture, setting hyperparameters, creating a representative training/dev/test dataset, preventing bias or overfitting of the trained model, and more. Anyway, very good results have been reported in the literature [9, 42, 50]. Although the challenges in lab conditions are not to be taken lightly, a new set of difficulties emerges when deploying these models in a real product. Specifically, during development, it is known what to expect as input in the controlled environment. When the models are integrated in a product that is used “in the wild”, however, all kinds of input can reach the system, making it hard to maintain a consistent and reliable prediction. In this section, we report on approaches to deal with related challenges in developing an actual face-ID verification product.

Fig. 1.
figure 1

Schematic representation of a face matching application with ID detection, anti-spoofing and image quality assessment. For any pair of input images (selfie and ID document), the output is the match probability and type of ID document, if no anomaly or attack has been detected. Note that all boxes contain at least one or several deep learning (DL) models with many different (convolutional) architectures.

Although the core functionality of such a product is to quantify the match between a person’s face and the photo on the given ID, more functionality is needed to make the system perform its task well, most of it hidden from the user. Thus, in addition to the actual face matching module, the final system contains at least the following machine learnable modules (see Fig. 1):

 

Image orientation detection :

When a user takes a photo of the ID on a flat surface using a mobile phone, in many cases the image orientation is random. A deep learning method is applied to predict the orientation angle, used to rotate the image in the correct orientation.

Image quality assessment :

consists of an ensemble of analytical functions and deep learning models to test if the photo quality is sufficient for a reliable match. It also guides the user to improve the picture taking process in case of bad quality.

User action prediction :

uses deep learning to predict the action performed by the user to guide the system’s workflow, e.g. making a selfie, presenting an ID or if the user is doing something wrong during the sequence.

Anti-Spoofing :

is an essential module that uses various methods to detect if a person is showing his “real” face or tries to fool the system with a photo, video or mask. It consists of an ensemble of deep learning models.  

For a commercial face-ID product, the anti-spoofing module is both most crucial for success, and technically most challenging; thus, the following discussion will focus on anti-spoofing in practice. Face matching and recognition systems are vulnerable to spoofing attacks made by non-real faces, because they are not per se able to detect whether or not a face is “live” or “not-live”, given only a single image as input in the worst case. If control over this input is out of the system’s reach e.g. for product management reasons, it is then easy to fool the face matching system by showing a photo of a face from screen or print on paper, a video or even a mask. To guard against such spoofing, a secure system needs to be able to do liveness detection. We’d like to highlight the methods we use for this task, in order to show the additional complexity of applying face recognition in a production environment over lab conditions.

Fig. 2.
figure 2

Samples from the CASIA dataset [66], where photo 1, 2, and 3 on the left hand side show a real face, photo 4 shows a replay attack from a digital screen, and photos 5 and 6 show replay attacks from print.

One of the key features of spoofed images is that they usually can be detected because of degraded image quality: when taking a photo of a photo, the quality deteriorates. However, with high quality cameras in modern mobile phones, looking at image quality only is not sufficient in the real world. How then can a spoof detector be designed that approves a real face from a low quality grainy underexposed photo taken by an old \(640 \times 480\) web cam, and rejects a replay attack using a photo from a retina display in front of a 4K video camera (compare Fig. 2)?

Most of the many spoofing detection methods proposed in the literature use hand crafted features, followed by shallow learning techniques, e.g. SVM [18, 30, 34]. These techniques mainly focus on texture differences between real and spoofed images, differences in color space [7], Fourier spectra [30], or optical flow maps [6]. In more recent work, deep learning methods have been introduced [3, 31, 63, 64]. Most methods have in common that they attempt to be a one-size-fits-all solution, classifying all incoming cases with one method. This might be facilitated by the available datasets: to develop and evaluate anti-spoofing tools, amongst others CASIA [66], MSU-USSA [43], and the Replay Attack Database [12] exist. Although these datasets are challenging, they turn out to be too easy compared to the input in a production environment.

The main differences between real cases and training examples from these benchmark databases are that the latter ones have been created with a low variety of hardware devices and only use few different locations and light conditions. Moreover, the quality of images throughout the training sets is quite consistent, which does not reflect real input. In contrast, the images that the system receives “in the wild” have the most wide range of possible used hardware and environmental conditions, making the anticipation of new cases difficult. Designing a single system that can classify all such cases with high accuracy seems therefore unrealistic.

We thus create an ensemble of experts, forming a final verdict from 3 independent predictions: the first method consists of 2 patch-based CNNs, one for low resolution images, the other one for high resolution images. They operate on fixed-size tiles from the unscaled input image using a sliding window. This technique proves to be effective for low and high quality input. The second method uses over 20 image quality measures as features combined with a classifier. This method is still very effective when the input quality is low. The third method uses a RNN with LSTM cells to conduct a joint prediction over multiple frames (if available). It is effective in discriminating micro movements of a real face against (simple) translations and rotations of a fake face, e.g. from a photo on paper or screen. All methods return a real vs. fake probability. The outputs of all 3 methods are fed as input features to the final decision tree classifier. This ensemble of deep learning models is experimentally determined to be much more accurate than using any known method individually.

Note that as attackers are inventive and come up with new ways to fool the system quickly, it is important to update the models with new data quickly and regularly.

Fig. 3.
figure 3

Good (a) and bad (b) segmentations (blue lines denote crop marks) for realistic pages, depending on the freedom in the layout. Image (c) shows a non-article page that is excluded from automatic segmentation. (Color figure online)

3 Print Media Monitoring

Content-based print media monitoring serves the task of delivering cropped digital articles from printed newspapers to customers based on their pre-formulated information need (e.g., articles about their own coverage in the media). For this form of article-based information retrieval, it is necessary to segment tens of thousands of newspaper pages into articles daily. We successfully developed neural network-based models to learn how to segment pages into their constituting articles and described their details elsewhere [35, 57] (see example results in Fig. 3a–b). In this section, we present challenges faced and learnings gained from integrating a respective model into a production environment with strict performance and reliability requirements.

Exclusion of Non-article Pages. A common problem in print segmentation are special pages that contain content that doesn’t represent articles in the common sense, for example classified ads, reader’s letters, TV program, share prices, or sports results (see Fig. 3c). Segmentation rules for such pages can be complicated, subjective, and provide little value for general use cases. We thus utilize a random forest-based classifier on handcrafted features to detect such content and avoid feeding respective pages to the general segmentation system to save compute time.

Model Management. One advantage of an existing manual segmentation pipeline is the abundance of high quality, labeled training data being produced daily. To utilize this constant flow of data, we have started implementing an online learning system [52] where results of the automatic segmentation can be corrected within the regular workflow of the segmentation process and fed back to the system as training data.

After training, an important business decision is the final configuration of a model, e.g. determining a good threshold for cuts to weigh between precision and recall, or the decision on how many different models should be used for the production system. We determined experimentally that it is more effective to train different models for different publishers: the same publisher often uses a similar layout even for different newspapers and magazines, while differences between publishers are considerable. To simplify the management of these different models, they are decoupled from the code. This is helpful for rapid development and experimentation.

Fig. 4.
figure 4

Architecture of the overall pipeline: the actual model is encapsulated in the “FCNN-based article segmentation” block. Several other systems are required to warrant full functionality: (a) the Proxy is responsible to control data input and output from the segmentation model; (b) RabbitMQ controls the workflow as a message broker; (c) MongoDB stores all segmentation results and metrics; (d) the Lectorate UI visualizes results for human assessment and is used to create training data.

Technological Integration. For smooth development and operation of the neural network application we have chosen to use a containerized microservices architecture [14] utilizing Docker [62] and RabbitMQ [26]. This decoupled architecture (see Fig. 4) brings several benefits especially for machine learning applications: (a) a separation of concerns between research, ops and engineering tasks; (b) decoupling of models/data from code, allowing for rapid experimentation and high flexibility when deploying the individual components of the system. This is further improved by a modern devops pipeline consisting of continuous integration (CI), continuous deployment (CD), and automated testing; (c) infrastructure flexibility, as the entire pipeline can be deployed to an on-premise data center or in the cloud with little effort. Furthermore, the use of Nvidia-docker [62] allows to utilize GPU-computing easily on any infrastructure; (d) precise controlling and monitoring of every component in the system is made easy by data streams that enable the injection and extraction of data such as streaming event arguments, log files, and metrics at any stage of the pipeline; and (e) easy scaling of the various components to fit different use cases (e.g. training, testing, experimenting, production). Every scenario requires a certain configuration of the system for optimal performance and resource utilization.

4 Visual Quality Control

Manual inspection of medical products for in-body use like balloon catheters is time-consuming, tiring and thus error-prone. A semi-automatic solution with high precision is thus sought. In this section, we present a case study of deep learning for visual quality control of industrial products. While this seems to be a standard use case for a CNN-based approach, the task differs in several interesting respects from standard image classification settings:

Fig. 5.
figure 5

Balloon catheter images taken under different optical conditions, exposing (left to right) high reflections, low defect visibility, strong artifacts, and a good setup.

Data collection and labeling are one the most critical issues in most practical applications. Detectable defects in our case appear as small anomalies on the surface of transparent balloon catheters, such as scratches, inclusions or bubbles. Recognizing such defects on a thin, transparent and reflecting plastic surface is visually challenging even for expert operators that sometimes refer to a microscope to manually identify the defects. Thus, approx. 50% of a 2-year project duration was used on finding and verifying the optimal optical settings for image acquisition. Figure 5 depicts the results of different optical configurations for such photo shootings. Finally, operators have to be trained to produce consistent labels usable for a machine learning system. In our experience, the labeling quality rises if all involved parties have a basic understanding of the methods. This helps considerably to avoid errors like e.g. only to label a defect on the first image of a series of shots while rotating a balloon: while this is perfectly reasonable from a human perspective (once spotted, the human easily tracks the defect while the balloon moves), it is a no-go for the episodic application of a CNN.

Network and training design for practical applications experiences challenges such as class imbalance, small data regimes, and use case-specific learning targets apart from standard classification settings, making non-standard loss functions necessary (see also Sect. 5). For instance, in the current application, we are looking for relatively small defects on technical images. Therefore, architectures proposed for large-scale natural image classification such as AlexNet [27], GoogLeNet [59], ResNet [24] and modern variants are not necessarily successful, and respective architectures have to be adapted to learn the relevant task. Potential solutions for the class imbalance problem are for example:

  • Down-sampling the majority class

  • Up-sampling the minority class via image augmentation [13]

  • Using pre-trained networks and applying transfer learning [41]

  • Increasing the weight of the minority class in the optimization loss [8]

  • Generating synthetic data for the minority class using SMOTE [11] or GANs [21]

Selecting a suitable data augmentation approach according for the task is a necessity for its success. For instance, in the present case, axial scratches are more important than radial ones, as they can lead to a tearing of the balloon and its subsequent potentially lethal remaining in a patient’s body. Thus, using \(90^{\circ }\) rotation for data augmentation could be fatal. Information like this is only gained in close collaboration with domain experts.

Fig. 6.
figure 6

Visualizing VGG19 feature responses: the first row contains two negative examples (healthy patient) and the second row positives (containing anomalies). All depicted samples are correctly classified.

Interpretability of models received considerable attention recently, spurring hopes both of users for transparent decisions, and of experts for “debugging” the learning process. The latter might lead for instance to improved learning from few labeled examples through semantic understanding of the middle layers and intermediate representations in a network. Figure 6 illustrates some human-interpretable representations of the inner workings of a CNN on the recently published MUsculoskeletal RAdiographs (MURA) dataset [45] that we use here as a proxy for the balloon dataset. We used guided-backpropagation [56] and a standard VGG19 network [55] to visualize the feature responses, i.e. the part of the X-ray image on which the network focuses for its decision on “defect” (e.g., broken bone, foreign object) or “ok” (natural and healthy body part). It can be seen that the network mostly decides based on joints and detected defects, strengthening trust in its usefulness. We described elsewhere [2] that this visualization can be extended to an automatic defense against adversarial attacks [21] on deployed neural networks by thresholding the local spatial entropy [10] of the feature response. As Fig. 7 depicts, the focus of a model under attack widens considerably, suggesting that it “doesn’t know where to look” anymore.

Fig. 7.
figure 7

Input, feature response and local spatial entropy for clean and adversarial images, respectively. We used VGG19 to estimate predictions and the fast gradient sign attack (FGSM) method [21] to compute the adversarial perturbation.

5 Music Scanning

Optical music recognition (OMR) [46] is the process of translating an image of a page of sheet music into a machine-readable structured format like MusicXML. Existing products exhibit a symbol recognition error rate that is an order of magnitude too high for automatic transcription under professional standards, but don’t leverage deep learning computer vision capabilities yet. In this section, we therefore report on the implementation of a deep learning approach to detect and classify all musical symbols on a full page of written music in one go, and integrate our model into the open source system AudiverisFootnote 4 for the semantic reconstruction of the music. This enables products like digital music stands based on active sheets, as most of todays music is stored in image-based PDF files or on paper.

We highlight four typical issues when applying deep learning techniques to practical OMR: (a) the absence of a comprehensive dataset; (b) the extreme class imbalance present in written music with respect to symbols; (c) the issues of state-of-the-art object detectors with music notation (many tiny and compound symbols on large images); and (d) the transfer from synthetic data to real world examples.

Fig. 8.
figure 8

Symbol classes in DeepScores with their relative frequencies (red) in the dataset. (Color figure online)

Synthesizing Training Data. The notorious data hunger of deep learning has lead to a strong dependence of results on large, well annotated datasets, such as ImageNet [48] or PASCAL VOC [16]. For music object recognition, no such dataset has been readily available. Since labeling data by hand is no feasible option, we put a one-year effort in synthesizing realistic (i.e., semantically and syntactically correct music notation) data and the corresponding labeling from renderings of publicly available MusicXML files and recently open sourced the resulting DeepScores dataset [60].

Dealing with Imbalanced Data. While typical academic training datasets are nicely balanced [16, 48], this is rarely the case in datasets sourced from real world tasks. Music notation (and therefore DeepScores) shows an extreme class imbalance (see Fig. 8). For example, the most common class (note head black) contains more than 55% of the symbols in the entire dataset, and the top 10 classes contain more than 85% of the symbols. At the other extreme, there is a class which is present only once in the entire dataset, making its detection by pattern recognition methods nearly impossible (a “black swan” is no pattern). However, symbols that are rare are often of high importance in the specific pieces of music where they appear, so simply ignoring the rare symbols in the training data is not an option. A common way to address such imbalance is the use of a weighted loss function, as described in Sect. 4.

This is not enough in our case: first, the imbalance is so extreme that naively reweighing loss components leads to numerical instability; second, the signal of these rare symbols is so sparse that it will get lost in the noise of the stochastic gradient descent method [61], as many symbols will only be present in a tiny fraction of the mini batches. Our current answer to this problem is data synthesis [37], using a three-fold approach to synthesize image patches with rare symbols (cp. Fig. 8): (a) we locate rare symbols which are present at least 300 times in the dataset, and crop the parts containing those symbols including their local context (other symbols, staff lines etc.); (b) for rarer symbols, we locate a semantically similar but more common symbol in the dataset (based on some expert-devised notion of symbol similarity), replace this common symbol with the rare symbol and add the resulting page to the dataset. This way, synthesized sheets still have semantic sense, and the network can learn from syntactically correct context symbols. We then crop patches around the rare symbols similar to the previous approach; (c) for rare symbols without similar common symbols, we automatically “compose” music containing those symbols.

Then, during training, we augment each input page in a mini batch with 12 randomly selected synthesized crops of rare symbols (of size \(130 \times 80\) pixels) by putting them in the margins at the top of the page. This way, that the neural network (on expectation) does not need to wait for more than 10 iterations to see every class which is present in the dataset. Preliminary results show improvement, though more investigation is needed: overfitting on extreme rare symbols is still likely, and questions remain regarding how to integrate the concept of patches (in the margins) with the idea of a full page classifier that considers all context.

Fig. 9.
figure 9

Schematic of the Deep Watershed Detector model with three distinct output heads. N and M are the height and width of the input image, \(\#{\mathrm {classes}}\) denotes the number of symbols and \(\#{\mathrm {energy\_levels}}\) is a hyperparameter of the system.

Enabling and Stabilizing Training. We initially used state-of-the-art object detection models like Faster R-CNN [47] to attempt detection and classification of musical symbols on DeepScores. These algorithms are designed to work well on the prevalent datasets that are characterized by containing low-resolution images with a few big objects. In contrast, DeepScores consists of high resolution musical sheets containing hundreds of very small objects, amounting to a very different problem [60]. This disconnect lead to very poor out-of-the-box performance of said systems.

Region proposal-based systems scale badly with the number of objects present on a given image, by design. Hence, we designed the Deep Watershed Detector as an entirely new object detection system based on the deep watershed transform [4] and described it in detail elsewhere [61]. It detects raw musical symbols (e.g., not a compound note, but note head, stem and flag individually) in their context with a full sheet music page as input. As depicted in Fig. 9, the underlying neural network architecture has three output heads on the last layer, each pertaining to a separate (pixel wise) task: (a) predicting the underlying symbol’s class; (b) predicting the energy level (i.e., the degree of belonging of a given pixel location to an object center, also called “objectness"); and (c) predicting the bounding box of the object.

Initially, the training was unstable, and we observed that the network did not learn well if it was directly trained on the combined weighted loss. Therefore, we now train the network on each of the three tasks separately. We further observed that while the network gets trained on the bounding box prediction and classification, the energy level predictions get worse. To avoid this, the network is fine-tuned only for the energy level loss after being trained on all three tasks. Finally, the network is retrained on the combined task (the sum of all three losses, normalized by their respective running means) for a few thousand iterations, giving excellent results on common symbols.

Fig. 10.
figure 10

Top: part of a synthesized image from DeepScores; middle: the same part, printed on old paper and photographed using a cell phone; bottom: the same image, automatically retrofitted (based on the dark green lines) to the original image coordinates for ground truth matching (ground truth overlayed in neon green boxes). (Color figure online)

Generalizing to Real-World Data. The basic assumption in machine learning for training and test data to stem from the same distribution is often violated in field applications. In the present case, domain adaptation is crucial: our training set consists of synthetic sheets created by LilyPond scripts [60], while the final product will work on scans or photographs of printed sheet music. These test pictures can have a wide variety of impairments, such as bad printer quality, torn or stained paper etc. While some work has been published on the topic of domain transfer [19], the results are non-satisfactory. The core idea to address this problem here is transfer learning [65]: the neural network shall learn the core task of the full complexity of music notation from the synthetic dataset (symbols in context due to full page input), and use a much smaller dataset to adapt to the real world distributions of lighting, printing and defect.

We construct this post-training dataset by carefully choosing several hundred representative musical sheets, printing them with different types of printers on different types of paper, and finally scanning or photographing them. We then use the BFMatcher function from OpenCV to align these images with the original musical sheets to use all the ground truth annotation of the original musical sheet for the real-world images (see Fig. 10). This way, we get annotated real-looking images “for free” that have much closer statistics to real-world images than images from DeepScores. With careful tuning of the hyperparameters (especially the regularization coefficient), we get promising - but not perfect - results during the inference stage.

6 Game Playing

In this case study, deep reinforcement learning (DRL) is applied to an agent in a multi-player business simulation video game with steadily increasing complexity, comparable to StarCraft or SimCity. The agent is expected to compete with human players in this environment, i.e. to continuously adapt its strategy to challenge evolving opponents. Thus, the agent is required to mimic somewhat general intelligent behavior by transferring knowledge to an increasingly complex environment and adapting its behavior and strategies in a non-stationary, multi-agent environment with large action and state spaces. DRL is a general paradigm, theoretically able to learn any complex task in (almost) any environment. In this section, we share our experiences with applying DRL to the above described competitive environment. Specifically, the performance of a value-based algorithm using Deep Q-Networks (DQN) [36] is compared to a policy gradient method called PPO [51].

Dealing with Competitive Environments. In recent years, astounding results have been achieved by applying DRL in gaming environments. Examples are Atari games [36] and AlphaGo [54], where agents learn human or superhuman performance purely from scratch. In both examples, the environments are either stationary or, if an evolving opponent is present, it did not act simultaneously in the environment; instead, actions were taken in turns. In our environment, multiple evolving players act simultaneously, making changes to the environment that can not be explained solely based on changes in the agent’s own policy. Thus, the environment is perceived as non-stationary from the agent’s perspective, resulting in stability issues in RL [33]. Another source of complexity in our setting is a huge action and state space (see below). In our experiments, we observed that DQN got problems learning successful control policies as soon as the environment became more complex in this respect, even without non-stationarity induced by opponents. On the other hand, PPO’s performance is generally less sensitive to increasing state and action spaces. The impact of non-stationarity to these algorithms is subject of ongoing work.

Reward Shaping. An obvious rewarding choice is the current score of the game (or its gain). Yet, in the given environment, scoring and thus any reward based on it is sparse, since it is dependent on a long sequence of correct actions on the operational, tactical and strategic level. As any rollout of the agent without scoring is not contributing to any gain in knowledge, the learning curve is flat initially. To avoid this initial phase of no information gain, intermediate rewards are given to individual actions, leading to faster learning progress in both DQN and PPO.

Additionally, it is not sufficient for the agent to find a control policy eventually, but it is crucial to find a good policy quickly, as training times are anyhow very long. Usually, comparable agents for learning complex behaviors in competitive environments are trained using self-play [5], i.e., the agents are always trained with “equally good” competitors to be able to succeed eventually. In our setting, self play is not a straightforward first option, for several reasons: first, to jump-start learning, it is easier in our setting to play without an opponent first and only learn the art of competition later when a stable ability to act is reached; second, different from other settings, our agents should be entertaining to human opponents, not necessarily winning. It is thus not desirable to learn completely new strategies that are successful yet frustrating to human opponents. Therefore, we will investigate self-play only after stable initializations from (scripted) human opponents on different levels.

Fig. 11.
figure 11

Heuristic encoding of actions to prevent combinatorial explosion.

Complex State and Action Spaces. Taking the screen frame (i.e., pixels) as input to the control policy is not applicable in our case. First, the policy’s input needs to be independent of rendering and thus of hardware, game settings, game version etc. Furthermore, a current frame does not satisfy the Markov property, since attributes like “I own item x” are not necessarily visible in it. Instead, some attributes need to be concluded from past experiences. Thus, the state space needs to be encoded into sufficient features, a task we approach with manual pre-engineering.

Next, a post-engineering approach helps in decreasing the learning time in case of DQN by removing unnecessary actions from consideration as follows: in principal, RL algorithms explore any theoretically possible state-action pair in the environment, i.e., any mathematically possible decision in the Markov Decision Process (MDP). In our environment, the available actions are dependent on the currently available in-game resources of the player, i.e., on the current state. Thus, exploring currently impossible regions in the action space is not efficient and is thus prevented by a post-engineered decision logic built to block these actions from being selected. This reduces the size of the action space per time stamp considerably. These rules where crucial in producing first satisfying learning results in our environment using DQN in a stationary setting of the game. However, when training the agent with PPO, hand-engineered rules where not necessary for proper learning.

The major problem however is the huge action and state space, as it leads to ever longer training times and thus long development cycles. It results from the fact that one single action in our environment might consist of a sequence of sub-decisions. Think e.g. of an action called “attack” in the game of StarCraft, answering the question of WHAT to do (see Fig. 11). It is incompletely defined as long as it does not state WHICH opponent is to be attack using WHICH unit. In other words, each action itself requires a number of different decisions, chosen from different subcategories. To avoid the combinatorial explosion of all possible completely defined actions, we perform another post-processing on the resource management: WHICH unit to choose on WHICH type of enemy, for example, is hard-coded into heuristic rules.

This case study is work in progress, but what becomes evident already is that the combination of the complexity of the task (i.e., acting simultaneously on the operational, tactical and strategic level with exponentially increasing time horizons, as well as a huge state and action space) and the non-stationary environment prevent successful end-to-end learning as in “Pong from pixels”Footnote 5. Rather, it takes manual pre- and post-engineering to arrive at a first agent that learns, and it does so better with policy-based rather than DQN-based algorithms. A next step will explore an explicitly hierarchical learner to cope with the combinatorial explosion of the action space on the three time scales (operational/tactical/strategic) without using hard-coded rules, but instead factorizing the action space into subcategories.

7 Automated Machine Learning

One of the challenging tasks in applying machine learning successfully is to select a suitable algorithm and set of hyperparameters for a given dataset. Recent research in automated machine learning [17, 40] and respective academic challenges [22] accurately aimed at finding a solution to this problem for sets of practically relevant use cases. The respective Combined Algorithm Selection and Hyperparameter (CASH) optimization problem is defined as finding the best algorithm \(A^*\) and set of hyperparameters \(\lambda _*\) with respect to an arbitrary cross-validation loss \(\mathscr {L}\) as follows:

$$\begin{aligned} A^*, \lambda _* = \mathop {\hbox {argmin}}\limits _{A \in \mathscr {A}, \lambda \in \Lambda _{A} } \frac{1}{K} \sum _{i=1}^{K} \mathscr {L}(A_\lambda ,D_{train}^{(i)},D_{valid}^{(i)}) \end{aligned}$$

where \(\mathscr {A}\) is a set of algorithms, \(\Lambda _A\) the set of hyperparameters per algorithm A (together they form the hypothesis space), K is the number of cross validation folds and D are datasets. In this section, we compare two methods from the scientific state-of-the-art (one uses Bayesian optimization, the other genetic programming) with a commercial automated machine learning prototype based on random search.

Scientific State-of-the-Art. Auto-sklearn [17] is the most successful automated machine learning framework in past competitions [23]. The algorithm starts with extracting meta-features from the given dataset and finds models which perform well on similar datasets (according to the meta-features) in a fixed pool of stored successful machine learning endeavors. Auto-sklearn then performs meta-learning by initializing a set of model candidates with the model and hyperparameter choices of k nearest neighbors in dataset space; subsequently, it optimizes their hyperparameters and feature preprocessing pipeline using Bayesian optimization. Finally, an ensemble of the optimized models is build using a greedy search. On the other side, Tree-based Pipeline Optimization Tool (TPOT) [40] is toolbox based on genetic programming. The algorithm starts with random initial configurations including feature preprocessing, feature selection and a supervised classifier. At every step, the top 20% best models are retained and randomly modified to generate offspring. The offspring competes with the parent, and winning models proceed to the next iteration of the algorithm.

Commercial Prototype. The Data Science Machine (DSM) is currently used inhouse for data science projects by a business partner. It uses random sampling of the solution space for optimization. Machine learning algorithms in this system are leveraged from Microsoft Azure, scikit-learn and can be user-enhanced. DSM can be deployed in the cloud, on-premise, as well as standalone. The pipeline of DSM includes data preparation, feature reduction, automatic model optimization, evaluation and final ensemble creation. The question is: can it prevail against much more sophisticated systems even at this early stage of development?

Table 1. Comparison of different automated machine learning algorithms.

Evaluation is performed using the protocol of the AutoML challenge [22] for comparability, confined to a subset of ten datasets that is processable for the current DSM prototype (i.e., non-sparse, non-big). It spans the tasks of regression, binary and multi-class classification. For applicability, we constrain the time budget of the searches by the required time for DSM to train 100 models using random algorithm selection. A performance comparison is given in Table 1, suggesting that Bayesian optimization and genetic programming are superior to random search. However, random parameter search lead to reasonably good models and useful results as well (also in commercial practice). This suggests room for improvement in actual meta-learning.

8 Conclusions

Does deep learning work in the wild, in business and industry? In the light of the presented case studies, a better questions is: what does it take to make it work? Apparently, the challenges are different compared to academic competitions: instead of a given task and known (but still arbitrarily challenging) environment, given by data and evaluation metric, real-world applications are characterized by (a) data quality and quantity issues; and (b) unprecedented (thus: unclear) learning targets. This reflects the different nature of the problems: competitions provide a controlled but unexplored environment to facilitate the discovery of new methods; real-world tasks on the other hand build on the knowledge of a zoo of methods (network architectures, training methods) to solve a specific, yet still unspecified (in formal terms) task, thereby enhancing the method zoo in return in case of success. The following lessons learned can be drawn from our six case studies (section numbers given in parentheses refer to respective details):

 

Data :

acquisition usually needs much more time than expected Sect. 4, yet is the basis for all subsequent success Sect. 5. Class imbalance and covariate shift are usual Sects. 2, 4, 5.

Understanding :

of what has been learned and how decisions emerge help both the user and the developer of neural networks to build trust and improve quality Sects. 4, 5. Operators and business owners need a basic understanding of used methods to produce usable ground truth and provide relevant subject matter expertise Sect. 4.

Deployment :

should include online learning Sect. 3 and might involve the buildup of up to dozens of other machine learning models Sects. 2, 3 to flank the original core part.

Loss/reward shaping :

is usually necessary to enable learning of very complex target functions in the first place Sects. 5, 6. This includes encoding expert knowledge manually into the model architecture or training setup Sects. 4, 6, and handling special cases separately Sect. 3 using some automatic pre-classification.

Simple baselines :

do a good job in determining the feasibility as well as the potential of the task at hand when final datasets or novel methods are not yet seen Sects. 4, 7. Increasing the complexity of methods and (toy-)tasks in small increments helps monitoring progress, which is important to effectively debug failure cases Sect. 6.

Specialized models :

for identifiable sub-problems increase the accuracy in production systems over all-in-one solutions Sects. 2, 3, and ensembles of experts help where no single method reaches adequate performance Sect. 2.  

Best practices are straightforward to extract on the general level (“plan enough resources for data acquisition”), yet quickly get very specific when broken down to technicalities (“prefer policy-based RL given that ...”). An overarching scheme seems to be that the challenges in real-world tasks need similar amounts of creativity and knowledge to get solved as fundamental research tasks, suggesting they need similar development methodologies on top of proper engineering and business planning.

We identified specific areas for future applied research: (a) anti-spoofing for face verification; (b) the class imbalance problem in OMR; and (c) the slow learning and poor performance of RL agents in non-stationary environments with large action and state spaces. The latter is partially addressed by new challenges like Dota 2Footnote 6, Pommerman or VizDoomFootnote 7, but for example doesn’t address hierarchical actions. Generally, future work should include (d) making deep learning more sample efficient to cope with smaller training sets (e.g. by one-shot learning, data or label generation [15], or architecture learning); (e) finding suitable architectures and loss designs to cope with the complexity of real-world tasks; and (f) improving the stability of training and robustness of predictions along with (d) the interpretability of neural nets.