Abstract
With the rise of machine learning in production, we need to talk about operational data science. The talk introduces a pragmatic method on how to measure the response quality of a recommendation service. To that end the definition of a successful response is introduced and guidelines how to capture the rate of successful responses are presented.
There are several changes that can happen during the serving phase of a model which negatively affect the quality of the algorithmic response. A few examples are:
• The model is updated and the new version is inferior to the previous one.
• The latest deployment of the stack that processes the request and serves the model contains a bug.
• Changes in the infrastructure lead to performance loss. An example in an e-commerce setting is switching to a different microservice to obtain article metadata used for filtering the recommendations.
• The input data changes. Typical reasons might be a client application that releases a bug (e.g., lowercasing a case sensitive identifier) or changes a feature in a way that affects the data distribution such as allowing all users to use the product cart instead of previously allowing it only for logged in users. If the change is not detected training data and serving data diverge.
Current monitoring solutions mostly focus on the completion of a request without errors and the request latency. That means the mentioned examples would be hard to detect despite the response quality being significantly degraded, sometimes permanently.
In addition to not being able to detect the mentioned changes, it can be argued that current monitoring practices are not sufficient to capture the performance of a recommender system or any other data driven service in a meaningful way. We might for instance have returned popular articles as a fallback in a case where personalized recommendations were requested. We should record that response as unsuccessful.
A new paradigm for measuring response quality should fulfil the following criteria:
• comparable across models
• simple and understandable metrics
• measurements are collected in real time
• allows for actionable alerting on problems
The response quality is defined as an approximation of how well the response fits the defined business and modelling case. The goal is to bridge the gap between metrics used during model learning and technical monitoring metrics. Ideally we would like to obtain Service Level Objectives (SLO)[1] that contain this quality aspect and can be discussed with the different client applications based on the business cases, e.g., "85% of the order confirmation emails contain personalized recommendations based on the purchase."
A case study will illustrate how algorithmic monitoring was introduced in the recommendation team at Zalando. Zalando is one of Europe's largest fashion retailers and multiple recommendation algorithms serve many online and offline use cases. You will see several examples of how the monitoring helped to identify bugs or diagnose quality problems.