1 Introduction

The number of evaluation effectiveness metrics in information access tasks is very large, and growing: in Information Retrieval (IR) alone, more than 100 effectiveness metrics exist, not taking into account the user oriented and Web-oriented ones (Amigó et al. 2014); in clustering various accuracy measures are used, even in official experimental initiatives and sometimes with undesirable properties (Amigó et al. 2009); in filtering the situation is analogous (Amigó et al. 2011); and of course, when considering other tasks the number of used metrics grows even more.

Being the evaluation scenario so rich and complex, it is not surprising that attempts have been made to understand, model, and formalize it. More in detail, researchers have defined formal properties (or axioms, or constraints) that must be satisfied by metrics. This has happened both in the early years (Van Rijsbergen 1974; Bollmann 1984) and more recently, with a renewed interest and several studies published in the last 5 years or so (Amigó et al. 2009, 2011, 2013, 2014, 2015; Moffat 2013; Busin and Mizzaro 2013; Maddalena and Mizzaro 2014; Ferrante et al. 2015; Sebastiani 2015; Ferrante et al. 2017). All these studies have in common a formal attitude: to try to understand in a formal way properties of effectiveness metrics. This paper follows the path of these studies, and addresses the formalization of effectiveness evaluation.

As it will be detailed in the following, this paper differs from previous studies in several respects. First, we aim at a more general approach: we do not focus on a specific task only, as others have done, but we take into account several information access tasks at the same time and we provide a uniform account. In this respect, one issue is that the number of tasks in information access is very large, and their variety is quite high: researchers build systems, for example, to classify tweets, cluster terms, retrieve documents, filter news, recommend movies, summarize texts, etc. To be able to provide a general and systematic account, we make two choices. First, we focus on the tasks that can be modeled by the assignment of a value to each item; most of the above mentioned examples match this description (the only exception being summarization). Second, we make an abstraction effort and we distill these many existing tasks into just four: Classification (assigning a category to each document), Clustering (organising documents into groups), Ranking (sorting documents), and Quantitation (assigning a numeric value to each document). In an attempt to avoid confusion, we try to use a specific terminology: we use the term abstract task to refer to the latter four only (i.e., Classification, Clustering, Ranking, and Quantitation) and we use the term task to refer to all the former ones. Although the terminology is somehow different from what commonly used in the literature, we believe that the small effort is worthwhile, and it allows us to precisely define the four abstract tasks we are addressing.

A second difference from previous studies is that we ground on measurement theory, that provides several useful notions including the measurement scale. We are not by any means the first ones to use measurement theory as a tool. However, we use it in a way that is different from previous work, and that is important to state upfront also to avoid confusion. It is probably intuitive to directly link the notion of metric with measurement: a metric measures system accuracy/effectiveness. This is the approach of the seminal work by Van Rijsbergen (1974) and also of more recent proposals by Ferrante et al. (2017, 2019). However, we do not do so, and in our approach measurement theory is used in a different way: our starting point is the fact that both system outputs and gold standards can be seen as assignments of values to documents, for all four abstract tasks. For example, the output of a classification system can be seen as an assignment of values on the nominal scale type; a rank of documents (a typical search engine output) can be seen as an assignment of values to documents, to be interpreted on the ordinal scale type (i.e., considering only the rank induced by the assigned values); and so on. Thus, measurement theory allows us to model both system outputs and gold standards as assignments of values as well as to state a direct relationship between the abstract tasks and the scale types.

Then, coming to the third difference from previous studies, in our framework an evaluation metric compares two assignments of numbers to documents: one provided by a system, and another one provided by human assessments. We define the effectiveness/accuracy of the system as the closeness of the two assignments. In other terms, measurement theory allows us to distinguish the above four abstract tasks on the basis of the scale used when assigning the values. For example, the nominal scale is used for classification and clustering (in which the task can be modeled as assigning values whose only important properties are equalities and inequalities), the ordinal scale for ranking (in which the order of the values is the important property), and the interval and ratio scales for quantitation (in which differences and ratios are important, respectively).

But we need to add a notion of closeness, that measurement theory does not include. Indeed, we will define two different kinds of closeness: this will allow us to take into account all four abstract tasks, and the related metrics, by just varying two parameters (measurement scale type and kind of closeness). The need for two different kinds of closeness will be detailed in the following, but can be immediately understood at an intuitive level by observing that for both categorization and clustering the scale type is nominal, but the two are clearly different: in the former the values used in the assignments are important (otherwise a misclassification occurs), whereas in the latter the equality and inequality relations are important, so any measurement which is equivalent for the nominal scale type (another notion provided by measurement theory) is adequate.

So, to summarize, we have three aims in this paper. First, to provide a general definition of metric valid for different abstract tasks such as classification, clustering, ranking, or quantitation (i.e., value prediction). This definition is based on measurement theory, but measurement theory is not enough and we also need to capture the concept of closeness, or proximity; we define two kinds of closeness. Second, to make explicit a correspondence between the four information access abstract tasks and a two dimensional space defined by: (i) the family of the metric (defined on the basis of two different approaches to closeness, based on values and equivalence), and (ii) the scale type to be used to analyse the correspondence between system and gold. Third, to state some theorems showing how our general definitions and axioms specialise into the metric properties defined in the literature for each particular information access task. Thus, the theoretical limitations of specific metrics that have been derived from the basic axioms in the literature can also be derived in our framework.

This paper is structured as follows. In Sect. 2 we extensively survey the previous attempts to formalize metrics properties across various abstract tasks. We then turn to defining our framework, which is based on measurement theory. The reader can find in Appendix A a basic background in measurement theory, a well settled discipline that provides the foundations and tools for our formalization. We include basic definitions and some examples. We ground on measurement theory to generalize the notion of evaluation metric in terms of closeness at specific scale types, as discussed in Sect. 3, where two kinds of closeness are defined. Section 4 focuses on effectiveness metrics: we distinguish two families of metrics; for each of the two we provide a formal definition and state some properties as axioms. In Sects. 5 to 7 we exploit the framework: we state some theorems, that formally capture both general metric properties already proposed in the literature and properties of specific metrics, and that can be derived as particular cases from the definitions and axioms presented in our framework (proofs are in Appendix B). We finally show the generality of our framework by applying it to novel metrics and tasks in Sect. 8. Section 9 summarizes the main results, discusses consequences, assumptions and limits of this study, and sketches future work.

2 Related work

Several authors have proposed formal properties of metrics by defining axiomatics focused on a specific task. Terminology needs some clarification. Different authors have used “properties”, “constraints”, or “axioms”, and there is even some debate on which term is correct. In this paper we privilege the last term, although sometimes we also use as synonym one of previous two, especially when describing previous work (that we do usually by using the same terminology as in the original). As already mentioned, we distinguish between the concepts of information access task and abstract task. The former is related with the user context and goals. Examples of tasks could be: searching web pages for generating a report about a topic, recommending products for online sales, spam filtering, sentiment analysis over tweets for reputation analysis, novelty detection in news, etc. All these tasks share the common characteristic that they consists of organising information items (web pages, mails, etc.). In this paper we refer to information items as documents.Footnote 1 An abstract task can be seen as an attempt of formalising the tasks and/or their basic components, and it is related with the characteristics of system outputs and goldstandards (i.e., human judgments/assessments, also called simply golds). We focus on the four abstract tasks listed above (Categorization, Clustering, Ranking, and Quantitation), and we survey the main approaches, grouped by abstract task. We also note some details that are useful in the following of the paper. We call basic axioms those common to different authors and somehow related to the abstract task but independent from the task. Tables 1, 2, and 3 list the main properties and can be a useful reference; some of the notes in the tables are described in Sect. 6.

2.1 Classification axiomatics

Some authors group classification metrics according to their properties. For instance, Ferri et al. (2009) discriminated between probabilistic measures (which consider the deviation from the true probability of errors) and measures based on a qualitative understanding of errors (which focus on the idea of utility). The authors do not offer a formal distinction between probabilistic and qualitative measures.

Table 1 The main classification axioms proposed in the literature, with some notes (discussed in more detail in the text) and their correspondence with the axioms and theorems in our framework (presented in the following of the paper)

More recently, Sebastiani (2015) proposed eight axioms. The first one is the Strict Monotonicity axiom: it states that, given two classification outputs such that they only differ on one decision, i.e., the category of a document, then if one of them is correct on that document, it must be reflected in an increase in its metric score. This idea is also captured by Sokolova’s (2006) properties (see below). Sebastiani proved that the traditional F-measure (based on Precision and Recall) does not satisfy this property, as it fails when components of the contingency matrix have zero value. A similar problem is identified for the metric Lam% (Qi et al. 2010). Note, however, that zero values in components of the contingency matrix represent in general a very particular situation. We provide a slight generalization of MON into a Generalized Strict Monotonicity Axiom (GMON) in the following sections.

Sebastiani’s second axiom, Continuous Differentiability, states that the evaluation measure must be continuous and differentiable over the true positives and true negatives. According to the author, measures fail to satisfy this axiom, again, in the case of zero values in the contingency matrix. Something similar happens with his third and fourth axioms, Strong Definiteness and Weak Definiteness, which state that the measures must be definable under any gold or system output. One might argue that the Strict Monotonicity axiom subsumes these two axioms, because the metric score of every system output must be definable in order to produce a score increase in the Strict Monotonicity conditions.

The fifth axiom (Fixed Range) sets a restriction about the measure value range. The sixth and seventh axioms (Robustness to Chance and Robustness to Imbalance) are related to the idea of probabilistic measures proposed by Ferri et al. (2009), and state that random or trivial classifiers must achieve the same score regardless the goldstandard. Measures such as Accuracy, Utility or F-measure (Precision and Recall), although widely adopted, do not satisfy this property. The reason is that actually, there are situations and user contexts in which not every trivial or random classifier has the same effectiveness. For instance, putting randomly just a few mails in the spam directory is less problematic for the user than putting most of mails. The F-measure tackles this aspect by returning a fixed precision for any random output while increasing recall when returning most of e-mails to the user. The eighth and last axiom, Symmetry, “enforces the notion that the evaluation measure should be invariant with respect to switching the roles of the class and its complement”. Thus, replacing the positive samples by negative ones in both the system output and the gold produces the same classification score. However, again, this axiom is task-dependent and, therefore, not every metric is designed to satisfy it. For instance, class oriented metrics, such as the F-measure combination of Precision and Recall, do not satisfy it. Utility metrics assign a utility weight to each class, so they do not satisfy it. A system correctly labeling as spam 8 out of 10 spam messages would be useful to the user, even if not perfect, but a system correctly labeling as non-spam 8 out of 10 non-spam messages would probably be unacceptable. For this reason, non symmetric metrics such as class oriented metrics are employed in these tasks.

In an earlier paper, Sokolova (2006) proposed a formal categorisation according to a set of properties based on the invariance of measures under a change in the contingency matrix (TP, TN, FP, FN, i.e., True Positive, True Negative, False Positive, and False Negative, respectively).

These properties have a correspondence with axioms proposed by other authors. The invariance under the swap of TP with TN and FP with FN corresponds to Sebastiani’s Symmetry axiom. The second property is the invariance under the change in TN when all other matrix entries remain the same. This property characterises the class-oriented measures (Precision and Recall). If the measure is not sensitive to one of the components, then increasing the amount of returned documents when the document classification is random, can be always beneficial. This property is complementary to the sixth and seventh Sebastiani’s axioms; thus, this property is also task-dependent. The next property is the invariance under the change in FP when all other matrix entries remain the same. The non-invariance is necessary if the measure satisfies the Strict Monotonicity axiom. The fourth and last property is the invariance under the classification scaling. According to the author, it is only satisfied by Precision, which is a partial measure that does not satisfy the Strict Monotonicity axiom.

In summary, we see that some axioms and properties are related, equivalent, or subsumed, and others are task-dependent. The only exception we find is the Strict Monotonicity axiom, which is common across all authors and is generally satisfied by most metrics. According to this analysis, we consider Strict Monotonicity as the unique commonly accepted basic axiom for the classification abstract task.

2.2 Clustering axiomatics

Dom (2001) proposed five formal desirable properties for clustering metrics. These were later extended to seven by Rosenberg and Hirschberg (2007). Basically, this axiomatics consists of stating a bijective correspondence between clusters and classes in the gold. It assumes a set of useful clusters with high correspondence with classes (peer to peer) and a set of noisy (small) clusters. The axioms state that increasing the amount of noisy clusters, splitting or joining useful clusters decrease the score. These properties are implicitly subsumed by the Generalized Homogeneity/Completeness axiom that we instroduce below, given that these movements require breaking correct relationships and increasing the amount of incorrect relationships.

Table 2 The main clustering axioms (same notation as in Table 1)

Meila (2003) proposed an entropy-based metric (Variation Information) and listed twelve desirable properties associated with it. Most of these properties are not directly related to the quality aspects captured by a metric, but rather to other intrinsic features such as the ability to scale or computational cost (Amigó et al. 2009). The exceptions are the properties 4 and 7, related with Cluster size versus Quantity (task-dependent), and properties 10 and 11, related with Completeness.

Amigó et al. (2009) proposed an axiomatics consisting of four constraints that, by focusing on extreme situations in which one system output should outperform another, capture the essence of previously proposed axiomatics (Dom 2001; Meila 2003):

  • Cluster Homogeneity: given a certain system output document distribution, splitting documents that do not belong to the same class must increase the output quality. This restriction was first proposed by Rosenberg and Hirschberg (2007). Although it seems a very basic constraint, measures based on editing distance do not satisfy it (Amigó et al. 2009).

  • Cluster Completeness: the counterpart to the first constraint is that documents belonging to the same class should be in the same cluster. This intuition is also captured by Dom’s constraints. Measures based on set matching, such as Purity and Inverse Purity, do not satisfy this contraint.

  • Rag Bag: introducing disorder into a disordered cluster (rag bag) is less harmful than into a clean cluster. In general, all traditional measures fail to comply with this constraint. However, it can be considered task-dependent, for example, in an early alarm detection task, considering a few related messages in isolation could be crucial.

  • Cluster size versus quantity: a small error in a big cluster is preferable to a large number of small errors in small clusters. This constraint prevents the problem that measures based on counting pairs (Amigó et al. 2009; Meila 2003; Halkidi et al. 2001) overweight big clusters (these measures are sensitive to the combinatory explosion of pairs in big clusters, and fail on this constraint). Although this principle is shared by several authors (Amigó et al. 2009; Dom 2001; Meila 2003), we can consider a task in which this axiom is not mandatory, as one could be interested in penalizing errors in large clusters more than multiple errors in small clusters.

Cluster Homogeneity and Cluster Completeness constraints can be generalized into a single one: given two identical clustering outputs with the exception that the second output contains correct clustering relationships that do not appear in the first output, or it does not contain incorrect clustering relationships that appear in the first system, then the metric must strictly increase. In the following sections we formalise this as Generalized Homogeneity/Completeness (GHC).

In summary, we can conclude that the basic axioms which are shared by different analyses are Completeness and Homogeneity which can be generalized into a unique axiom.

2.3 Ranking axiomatics

Most of the work on ranking axiomatics has been developed in the context of IR (as discussed in the first part of this section). There has been some discussion about ordinal classification too (as discussed in the last part). It is however important to understand that, by being focused on those concrete tasks, researchers have proposed, if not taken for granted, some properties, that do not apply for the abstract task of ranking. For example, it might sound surprising but top-heaviness is not mandatory for the abstract task of ranking. It is possible to imagine concrete ranking tasks that do not reward correctness in earlier rank positions more than in later ones: to provide a simple example, if one has to evaluate an approximate algorithm alphabetically sorting an array, Kendall’s Tau correlation would be a reasonable measure, although it is not top-heavy — and indeed it would not make any sense to reward the sorting algorithms that are more correct for “A” than for “Z”.

In one of the early works on formalizing IR evaluation, Van Rijsbergen (1974) already suggested to use measurement theory (as we do extensively in this paper). However, that seminal work does not state formal properties for simple evaluation measures, but for the combination of them (e.g., the well known F-measure), using the Conjoint Measurement Theory. Other authors tried to exploit measurement theory and to define a notion of similarity between measurements when formalizing IR (Busin and Mizzaro 2013; Maddalena and Mizzaro 2014; Ferrante et al. 2015). This paper extends that idea, but using measurement theory to state definitions and axioms for evaluation measures equally valid for different information access abstract tasks, not just IR.

Table 3 The main ranking axioms (same notation as in Table 1)

Moffat (2013) listed seven properties that IR metrics should satisfy. The first one is Boundedness. It is not about quality; it requires the existence of a bounded range of scores. The second one is called Monotonicity and states that if a ranking of length k is extended so that \(k+1\) elements are included, the metric value never decreases. This second property is task-dependent: in some situations, reducing the size of the returned document list could be useful, for instance if it contains only irrelevant documents. In addition, according to the author, this property gets in contradiction with the next one, the Convergence property, that reflects the basic principle that relevant documents must occur above irrelevant documents in the system output ranking. The property states that swapping two documents in the ranking in concordance with the relevance judgements strictly increases the metric scores. The fourth property, Top-weightedness, explicits that in IR the first positions in the rank are the most important ones. As anticipated at the beginning of this section, this constraint, although widely accepted, is task-dependent, because it is related with the cost of exploring the ranking produced by the system, i.e., the probability that the user actually explores each ranking position. Usually it is reasonable to assume that this probability decreases by going down the rank, but one might imagine situations where this is not true and a very persistent user explores all the ranking positions. The fifth one is referred as Localisation: a metric value at a given rank position k should depend only on the documents in the first k positions. This axiom can be applied only to metrics that assume a certain deepness threshold as input parameter. That is, it is not a general axiom. The sixth property, Completeness, states that the metric must be definable even when there are no relevant documents in the collection. Finally, Realisability states that the maximal score can be achieved even when there exists only one relevant document in the collection. This property is related with the normalisation of scores across test cases.

Ferrante et al. (2015) proposed two axioms: Replacement (replacing an irrelevant document in the output ranking with a relevant one increases the score) and Swapping (swapping two documents in the ranking output in concordance with the gold annotation relative relevance increases the score). In fact, if we assume that documents out of the output ranking are located together in an additional last ranking position, then Replacement is subsumed by Swapping. In addition, Swapping is equivalent to Convergence.

Amigó et al. (2013) proposed five axioms. The first one (Priority (PRI)) states again that swapping documents in a correct way increases the score. It is similar to Ferrante’s Swapping axiom, and equivalent to it, although somehow more relaxed since it requires a metric score increase only when the ranking position of the swapped documents are contiguous. The next three axioms are related with assumptions about the user task: deeper positions in the ranking are less likely to be explored (Deepness axiom); there is an area at the top of the ranking which is always explored by the user (Closeness Threshold axiom); and there is an area deep enough that is never explored by the user (Deepness Threshold axiom). The last two axioms are formalised in terms of comparing one relevant document in the first position, n relevant documents in the 2n first ranking positions, and a huge amount of relevant documents after a huge amount of irrelevant documents. The fifth axiom states that given a ranking containing only irrelevant items, the shorter the ranking the higher the metric score (Confidence axiom). This axiom is also task-dependent. We could think that reducing the ranking length, instead of avoiding user effort, adds uncertainty about the possibility of finding relevant documents at lower ranks.

In summary, according to our analysis, most of the axioms and properties are task-dependent. The basic axioms, common to all studies, are related with the correctness of priority relationships: the Swapping constraint proposed by both Ferrante et al. (2015) and Moffat (2013), and its relaxed version, the Priority axiom proposed by Amigó et al. (2013).

We also briefly mention the task of assigning items to categories which have an order, named Ordinal Classification: is a sort of mixture between classification and ranking. It is a quite popular task in some situations: besides the common assignment of “stars” in several Web reviews sites, let us consider the polarity detection task, in which text fragments must be classified in terms of sentiment analysis according to a few categories such as “Very positive”, “Positive”, “Neutral”, “Negative” and “Very negative”. The task is defined in an ordinal manner, but the perfect system should return exactly the same values, so it is not a ranking problem like traditional IR. Then, some evaluation campaigns have applied a classification oriented metric, such as in Barbieri et al. (2016). In other campaigns, systems were evaluated with ranking metrics, such as in Amigó et al. (2013); in other ones, different tasks were defined in concordance with different metrics (ranking or classification), such as in Rosenthal et al. (2017).

We can see a similar situation in semantic textual similarity (Agirre et al. 2015) in which text pairs must be categorised according to a few classes (high / average similarity, etc.): the organizers used Pearson correlation. The prediction of stars in product recommendation also fits into this case: sorting products in a correct way is desirable, as well as assigning the correct amount of stars. In fact, recommender systems were initially evaluated in terms of accuracy, before the community began to work under ranking based metrics. We can find a few studies in the literature that analysed the behavior of some traditional metrics in this task such as Mean Average Error, Mean Squared Error, linear correlation, or Accuracy with n (Gaudette and Japkowicz 2009), proposed a method to make Ordinal Classification metrics robust to imbalance (Baccianella et al. 2009), and analysed the suitablity of traditional metrics by means of a particular case and proposed the Ordinal Classification Index metric (Cardoso and Sousa 2011). However, differently from the previous abstract tasks, these studies do not define properties to be satisfied, and in general there is not a clear choice of the metric to be used when predicting a few labels that keep an ordinal relationship.

2.4 Quantitation axiomatics

Quantitation,Footnote 2 i.e., the abstract task of assigning numeric values to documents, is perhaps less common, but it is not only theoretical and some examples can be found. In the Semantic Textual Similarity task at SemEval-2016 (Agirre et al. 2016), systems are asked to return a numeric value predicting the similarity between two snippet of texts. In some sentiment polarity detection tasks, the absolute polarity values returned by systems are compared with the reference values. In both cases the stated goal consists of maximising the linear correlation between system outputs and golds. The most frequent metric in these cases is the Pearson coefficient, although this choice is being criticised (Reimers et al. 2016) and might change in the future.

In other cases the goal consists of predicting the exact values. One example is the proposal to evaluate information retrieval systems not only on the basis of the rank of the retrieved documents, but on the basis of the numeric relevance values assigned to document. With this approach, and assuming a continuous notion of relevance, Della Mea and Mizzaro (2004) proposed ADM (Average Distance Measure). More recently, magnitude estimation has been proposed as a technique to gather relevance assessment on a ratio scale (Maddalena et al. 2017). One might even claim that there is a trend to go beyond the classical (category relevance and ranking retrieval) situation, although we are not aware of any attempt to capture the properties of metrics for quantitation. Therefore we include this abstract task in our analysis.

3 Measurement theory and closeness

Measurement theory is briefly recalled in Sects. 3.1 and 3.2 (for more details see Appendix A), where we also discuss the notion of closeness. We then outline the structure of our framework in Sect. 3.3. Sections 3.43.6 provide some definitions, and Sect. 3.7 provides an example (and the reader might find it useful to go back to Sects. 3.43.6 after having read it).

3.1 Measurement theory

Appendix A recalls some of the basic concepts of measurement theory, that we assume as known in the following: assignment, measurement, scale types and permissible transformation functions, equivalence, and meaningfulness (see Definitions A.1A.4 and A.6). The reader familiar with measurement theory can probably just skim the appendix to get acquainted with our notation, or maybe even skip it and refer to specific parts when needed.

Briefly, measurement theory studies the properties of value assignments to objects like, for instance, temperature, height, and distance. At the core of measurement theory there is the notion of scale type. The classical scale types are nominal, ordinal, interval, and ratio. At each scale type, some relationships between values make sense and others are not meaningful. For instance, at the nominal scale type only equality and inequality of values can be taken into account, whereas considering the ratio or the interval between values makes no sense (e.g., “red” divided by “green”). For each scale type, a set of permissible transformation functions is defined. These, when applied to a measurement, determine which are the measurements that, though different, are equivalent, i.e., carry the same meaning. For example, for the ordinal scale type, value assignments that order objects in the same way are equivalent, and the set of permissible transformation functions are the strictly monotonically increasing functions.

This matches with our abstract tasks: in general, we can say that systems assign values to items (relevance, topics, categories, priority, etc.), and evaluation consists in comparing the system output assignment against the human annotated assignment (gold). As an example, suppose that we want to categorize some documents into “physics”, “biology”, and “social sciences”. This is a classification problem, and there are no ordinal or interval relationships between classes. The goal consists in maximizing the amount of equalities between system output values and gold values, and there is no consideration about the range or interval of errors. In other words, the predicted categories are either accurate or not. In other terms, classification problems can be mapped onto the nominal scale type. Similarly, ranking problems can be mapped onto the ordinal scale type, and quantitation problems onto the interval and ratio scales. However the situation is slightly more complex; this can be seen when considering that clustering can be mapped onto the nominal scale type as well, but it is an intrinsically different problem from classification. Anticipating what we will discuss in detail in the following, we model this difference on the basis of the kind of closeness that can be defined between two measurements: for classification we seek for equal measurements; for clustering we seek for equivalent measurements. We will also discuss that it is not even necessary to assume that system outputs and golds are measurements; assignment is enough.

Incidentally, we remark that by exploiting the basic concepts of measurement theory, like scale types and meaningful statements, it would be possible to directly derive some consequences on effectiveness metrics. For example, the well known Mean Reciprocal Rank (MRR) metric is computed by considering the rank of the first relevant document, computing its reciprocal, and averaging these values. But by doing so, one is neither applying permissible transformation functions, nor deriving meaningful statements, for the scale type at hand, which is the ordinal one; a similar remark has been made by Fuhr (2018). Also the widely adopted Normalized Discounted Cumulative Gain (nDCG) metric transforms an ordinal relevance scale (e.g., Highly relevant, Relevant, Marginally relevant, Not relevant) into numerical gains which are on a ratio scale type, and this is intrinsically arbitrary.

But we believe that the consequences of measurement theory on evaluation are deeper; in the rest of the paper we discuss those more foundational aspects.

3.2 Closeness

Measurement theory focusses on equivalence rather than closeness. For instance, according to Suppes and Zinnes (1963) the two first fundamental problems in measurement theory are the Representation Theorem and Uniqueness. Both are related with finding homomorphisms between the empirical and numerical structures, thus studying which assignments are indeed measurements. The third problem is the Meaningfulness Problem (see Sect. A.4). There is nothing that discusses the similarity, or distance, or closeness of two assignments or measurements. However, the main goal of evaluation is to compare system outputs against gold standards or references, which are almost never equivalent. In fact, equivalence would mean perfect effectiveness, which is an extremely rare event in information access. So, a simple binary comparison (equivalent versus non equivalent) is not enough, and, in this sense, the concept of closeness between two assignments (and, consequently, measurements) needs to be added to measurement theory concepts to formalise the evaluation scenario. As we will see in Sect. 4, this will allow to define the notion of evaluation metrics. Going back to the classification problem of the example in Sect. 3.1, this means that we need to model the closeness between system output values and gold values at the nominal scale type.

As another example, suppose that we are interested in grouping documents correctly. That is, the system must assign to documents about physics the same value, but different from biology documents. In this case, the document tags generated by the system (assigned values) are not necessarily equal to the category identifiers provided by humans in the gold. For this, we can exploit the notion of equivalence in measurement theory. That is, two assignments are equivalent at the nominal scale type if they keep the same equality relationships between objects. For instance, the assignments (\(A=1\), \(B=1\), \(C=2\)) and (\(A=3\), \(B=3\), \(C=1\)) are equivalent at the nominal scale type, given that, in both cases, \(A=B\ne C\) is true. This matches with our clustering problem. In terms of measurement theory we can say that the goal of clustering evaluation is to quantify how close is the system output to be equivalent to the gold at the nominal scale type.

Let us finally consider the abstract task of sorting documents according to their relevance, the classical document ranking problem. In this case, the evaluation must consider the ordinal relationships between assigned values, while the interval or ratio between assigned values is not important. That is, documents must be sorted in the same way as in the gold, i.e., relevant documents earlier than irrelevant documents. In terms of measurement theory, the goal of ranking is generating a relevance assignment equivalent to the gold. We can interpret document ranking as an ordinal equivalence oriented problem.

We state two general definitions of evaluation metric, value-oriented and equivalence-oriented, and we prove that, depending on the scale type, existing metrics and their desirable properties defined in the literature for each abstract task fit into the corresponding definition. But then, more situations appear spontaneously in the proposed model. For instance, in polarity detection, (positive, neutral, negative) categories present an ordinal relationship, while the intervals between polarity levels are undefined. However, unlike in ranking problems, predicting the specific polarity category of documents must be rewarded. This is an Ordinal Classification problem (see Sect. 2.3), but there is not a consensus about how this problem must be formalized. Our framework identifies it as a value-oriented ordinal problem, fitting in our definition.

The proposed model goes further, and includes abstract tasks at interval and ratio scales. The model also allows us to analyze evaluation metrics, and to understand when they can be used appropriately. For example, we prove in this paper that Pearson coefficient fits into our equivalence-oriented metric definition at the interval scale type, the popular cosine distance fits into the equivalence-oriented metric definition at the ratio scale type, and the commonly used error rate or mean average error fit into the value-oriented metric definition at the ratio or interval scale type.

3.3 Structure of the framework

By exploiting some basic results of measurement theory, as well as the notion of closeness, we now turn to defining our framework. Figure 1 sketches the overall structure of the framework and can be a useful reference in the following. We ground on some definitions from measurement theory (orange boxes with dotted borders); we have briefly recalled them above but the reader is referred to the corresponding Definitions A.1A.6 in Appendix A. Then, we analyse the notion of proximity, or closeness (dashed green boxes). We start from a simple but general definition of closeness (Definition 1) and we particularize it for the four classical scale types (Definition 2). On this basis, we define two kinds of closeness measures for assignments, depending on whether they focus on value matching (Definition 3) or on measurement equivalence (Definition 4). These definitions can be particularised for any scale type: nominal, ordinal, interval, or ratio. Then, moving to the yellow boxes, we define system outputs and golds as assignments of numeric values to items (Definition 5); let us remark that this is done for compatibility with measurement theory and without loss of generality as long as we consider the abstract tasks. Finally, the definition of metric in general is directly derived from the notion of closeness (Definition 6), as well as the the two specific definitions of value- and equivalence-oriented metrics (Definitions 7 and 8).

Fig. 1
figure 1

Overall structure of the framework: from basic definitions of measurement theory, through closeness, to the two families of metrics

3.4 Defining closeness for a scale type

In this subsection, we start from a simple but general definition of closeness between single values, and then we propose a definition dependent on the scale types. Exploiting these definitions, a more intuitive description of closeness for the scale types is then derived in Lemma 1; this will be used to define closeness between assignments in the next subsections.

A first important remark is that it is possible to define different distance functions, and therefore, multiple closeness notions. There is some unavoidable arbitrariness. A second remark is that the closeness between values also depends on the scale type of reference. For the nominal scale type, two values are similar when they are equal (e.g., since \(3\ne 4=4\), 4 is more similar to 4 than 3). For the ordinal scale type, we observe relative closeness when values are in sequence (e.g., since \(3<4<5\), 3 is more similar to 4 than to 5). For the interval scale type, when the absolute difference is lower (e.g., since \(|3-5|<|1000-5|\), 3 is more similar to 5 than 1000). Notice that due to the subsumption effect across scale types (Formulas (30) and (31)), each assertion is valid for the other higher scale types (e.g., if \(3\ne 4=4\) then \(3<4\le 4\) and \(|3-4|>|4-4|\)).

For our framework a simple definition of closeness is sufficient, based on a distance between two values \(x ,y \in {{\,{{\mathbb {R}}}\,}}\) computed as \(|x-y|\). Moreover, we are interested in comparing closeness values, not in precise closeness values. Therefore we do not define a function returning a closeness value, but simply the following relationship.

Definition 1

(Closeness) Let \(x, y, r \in {{\,{{\mathbb {R}}}\,}}\). We say that x is (strictly) closer to a reference value r than y, and we write \({x} \mathrel {{\preccurlyeq }^{r}} {y}\) (\({x} \mathrel {{\prec }^{r}} {y}\) for the strict case), if and only if:Footnote 3

$$\begin{aligned} {x} \mathrel {{\preccurlyeq }^{r}} {y}&\Longleftrightarrow |r-x|\le |r-y| \end{aligned}$$
(1)
$$\begin{aligned} {x} \mathrel {{\prec }^{r}} {y}&\Longleftrightarrow {x} \mathrel {{\preccurlyeq }^{r}} {y}\wedge \lnot ({y} \mathrel {{\preccurlyeq }^{r}} {x}) \Longleftrightarrow |r-x|<|r-y| . \end{aligned}$$
(2)

We now particularize closeness for a certain scale type. We refer to the four classic scale types, i.e., Nominal (denoted with \(\mathtt {N}\) from now on), Ordinal (\(\mathtt {O}\)), Interval (\(\mathtt {I}\)), and Ratio (\(\mathtt {R}\)). There is a natural order on the scale types, going from the lowest scale type \(\mathtt {N}\) to the highest scale type \(\mathtt {R}\). This order is derived from the inclusion chain of the permissible transformation functions of the four scale types: if \({\mathcal {F}}_{\mathtt {T}}\) denotes the set of permissible transformation functions for the scale type \(\mathtt {T}\), then \({\mathcal {F}}_{\mathtt {R}}\subset {\mathcal {F}}_{\mathtt {I}}\subset {\mathcal {F}}_{\mathtt {O}}\subset {\mathcal {F}}_{\mathtt {N}}\). This allows us to write \(\mathtt {N}< \mathtt {O}< \mathtt {I} < \mathtt {R}\) and to speak of higher and lower scale types accordingly. See Appendix A for further details.

Definition 2

(Closeness for a scale type) Let \(x, y, r \in {{\,{{\mathbb {R}}}\,}}\), and \({\mathcal {F}}_{\mathtt {T}}\) the set of permissible transformation functions for the scale type \(\mathtt {T}\). We say that x is closer to a reference r than yfor a certain scale type\(\mathtt {T}\), and we write \({x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y}\) if and only if it is closer for at least one permissible transformation function in \({\mathcal {F}}_{\mathtt {T}}\):

$$\begin{aligned} {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y}&\Longleftrightarrow \exists f \in {\mathcal {F}}_{\mathtt {T}}\left( {f(x)} \mathrel {{\preccurlyeq }^{f(r)}} {f(y)} \right) . \end{aligned}$$
(3)

The associated strict relationship is:

$$\begin{aligned} {x} \mathrel {\prec ^{r}_{\mathtt {T}}} {y}&\Longleftrightarrow {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y} \wedge \lnot \left( {y} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {x}\right) . \end{aligned}$$
(4)

Given a fixed reference r, non-strict closeness is a binary relationship that satisfies reflexivity (\(\forall x \left( {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {x} \right)\)), transitivity (\(\forall x,y,z \left( {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y} \wedge {y} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {z} \Longrightarrow {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {z} \right)\)), and is a connex relation (\(\forall x,y \left( {x} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {y} \vee {y} \mathrel {{\preccurlyeq ^{r}_{\mathtt {T}}}} {x}\right)\)). Non-strict closeness is not a total order since it is not anti-symmetric, as two different values can be equally non-strict close to the reference. The strict closeness relationship is irreflexive, transitive, asymmetric (\(\forall x,y \left( {x} \mathrel {\prec ^{r}_{\mathtt {T}}} {y} \Longrightarrow \lnot ( {y} \mathrel {\prec ^{r}_{\mathtt {T}}} {x})\right)\)), and acyclic.

Table 4 illustrates the behavior of these definition by analyzing whether the value v is non-strictly or strictly closer to the reference 0 than 1 for any scale type \(\mathtt {T}\) (i.e., \({v} \mathrel {{\preccurlyeq ^{0}_{\mathtt {T}}}} {1}\) and \({v} \mathrel {\prec ^{0}_{\mathtt {T}}} {1}\)). Each column is associated with a value v, and each row with non-strict or strict closeness under different scale types. In other terms, we ask whether a given value v is closer to 0 than 1 for a given scale type; the upper and lower parts of the table discuss strict and non-strict closeness, respectively. For instance, looking at the first row, all values are non-strictly closer (actually, equally closer) to 0 than 1 for the nominal scale type, since a permissible transformation function, i.e., a bijective function, can be found that transforms any value into any other one, including 0; looking at the fourth row, only the zero value is strictly closer to 0 than 1 for the nominal scale type. For the ordinal scale type (second row), only 2 can not be non-strictly closer to 0 than 1 (because 2 is farther away to 0 than 1 for any monotonic transformation); looking at the fifth row, any value in [0, 1) is strictly closer. For the interval and ratio scale types, any value in \((-1,1)\) is strictly closer to 0 than 1. In general, the table reflects the subsumption of permissible transformation functions across scale types (Formulas (30) and (31)). The effect is that, when considering higher scale types, the number of strict closeness relationships increases, whereas the number of non-strict closeness relationships decreases.

Table 4 Examples for closeness and strict closeness at different scale types

This definition of (strict) closeness is general and valid for any scale type. We can instantiate it into the four scale types as shown by the following lemma; this form helps intuition and it will be useful to develop the formal proofs in the following (all the proofs are in Appendix B).

Lemma 1

(Closeness for the four scale types) Let\(x, y, r \in {{\,{{\mathbb {R}}}\,}}\). The valuexis (strictly) closer to a referencerthany for each scale type \(\mathtt {T}\)respectively if and only if the conditions in Table 5are satisfied.

Table 5 The conditions for closeness and strict closeness in Lemma 1

We now turn to closeness for assignments. We define two kinds of closeness (value-oriented and equivalence-oriented), whose meaning is discussed in the example in Sect. 3.7.

3.5 Value-oriented assignment closeness

On the the basis of value closeness, we define a first closeness notion between assignments for a certain scale type. Consistently with Definition A.1, we denote assignments with \(\omega\), \(\omega ^{\prime }\), \(\omega _i\), etc.

Definition 3

(Value-oriented assignment closeness) Given a set of objects \({\mathcal {D}}\), an assignment \(\omega\) is value-closer to a reference assignment \(\rho\) than another assignment \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\) (we write \({\omega }\mathrel {\trianglelefteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }}\) and we speak of Value-Oriented Assignment Closeness) if and only if for every value (i.e., objects in \({\mathcal {D}}\)) \(\omega\) is closer to \(\rho\) than \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\):

$$\begin{aligned} {\omega }\mathrel {\trianglelefteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }} ~~\Longleftrightarrow ~~ \forall d\in {{\mathcal {D}}} \left( {\omega (d)} \mathrel {{\preccurlyeq ^{\rho (d)}_{\mathtt {T}}}} {\omega ^{\prime }(d)} \right) . \end{aligned}$$
(5)

Moreover, we say that an assignment\(\omega\)is strictly value-closer to a reference assignment \(\rho\) than another assignment \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\) if:

$$\begin{aligned} {\omega }\mathrel {\vartriangleleft ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }} \Longleftrightarrow {\omega }\mathrel {\trianglelefteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }} \wedge \lnot \left( {\omega ^{\prime }}\mathrel {\trianglelefteq ^{\rho }_{\mathtt {T}}} {\omega }\right) . \end{aligned}$$

Note that strict value-closeness can be expressed as:

$$\begin{aligned} {\omega }\mathrel {\vartriangleleft ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }} \Longleftrightarrow&\forall d\in {{\mathcal {D}}} \left( {\omega (d)} \mathrel {{\preccurlyeq ^{\rho (d)}_{\mathtt {T}}}} {\omega ^{\prime }(d)} \right) \wedge \lnot \left( \forall d\in {{\mathcal {D}}} \left( {\omega ^{\prime }(d)} \mathrel {{\preccurlyeq ^{\rho (d)}_{\mathtt {T}}}} {\omega (d)} \right) \right) \\ \Longleftrightarrow&\forall d\in {{\mathcal {D}}} \left( {\omega (d)} \mathrel {{\preccurlyeq ^{\rho (d)}_{\mathtt {T}}}} {\omega ^{\prime }(d)} \right) \wedge \exists d\in {{\mathcal {D}}} \left( \lnot \left( {\omega ^{\prime }(d)} \mathrel {{\preccurlyeq ^{\rho (d)}_{\mathtt {T}}}} {\omega (d)} \right) \right) , \end{aligned}$$

i.e., by applying (4) for any scale type \(\mathtt {T}\in \{\mathtt {N}, \mathtt {O},\mathtt {I},\mathtt {R}\}\),

$$\begin{aligned} {\omega }\mathrel {\vartriangleleft ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }} \Longleftrightarrow&\forall d\in {{\mathcal {D}}} \left( {\omega (d)} \mathrel {{\preccurlyeq ^{\rho (d)}_{\mathtt {T}}}} {\omega ^{\prime }(d)} \right) \wedge \exists d\in {{\mathcal {D}}} \left( {\omega (d)} \mathrel {\prec ^{\rho (d)}_{\mathtt {T}}} {\omega ^{\prime }(d)} \right) . \end{aligned}$$
(6)

We use “value-closer” (and not simply “closer”) because we now define another closeness notion for assignments, before discussing an example in Sect. 3.7.

3.6 Equivalence-oriented assignment closeness

We formalise the closeness between assignments in terms of their equivalence class, rather than value correspondence.

Definition 4

(Equivalence-oriented assignment closeness) Given a set of objects \({\mathcal {D}}\), an assignment \(\omega\) is equivalence-closer to a reference \(\rho\) than another assignment \(\omega ^{\prime }\) for the scale type \(\mathtt {T}\) (we write \({\omega }\mathrel {\sqsubseteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }}\) and we speak of Equivalence-Oriented Assignment Closeness) if and only if for every assignment \(\omega _i^{\prime }\) in the equivalence class of \(\omega ^{\prime }\), there exists at least one assignment in the equivalence class of \(\omega\) that is value-closer to \(\rho\) for the scale type \(\mathtt {T}\):

$$\begin{aligned} {\omega }\mathrel {\sqsubseteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }}&\Longleftrightarrow \forall \omega _i^{\prime } \in [\omega ^{\prime }]_\mathtt {T} \left( \exists \omega _i \in [\omega ]_\mathtt {T} ( {\omega _i}\mathrel {\trianglelefteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }_i})\right) . \end{aligned}$$
(7)

The strict closeness is analogous to Definition 3:

$$\begin{aligned} {\omega }\mathrel {\sqsubset ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }}&\Longleftrightarrow {\omega }\mathrel {\sqsubseteq ^{\rho }_{\mathtt {T}}} {\omega ^{\prime }} \wedge \lnot \left( {\omega ^{\prime }}\mathrel {\sqsubseteq ^{\rho }_{\mathtt {T}}} {\omega }\right) . \end{aligned}$$
(8)

Two assignments are closer according to Definition 3 if they tend to assign similar values, and according to Definition 4 if there exists a permissible transformation function that makes them assign similar values. In general, Definition 3 is useful for tasks like “measure temperature using the Celsius scale”; Definition 4 for tasks like “measure temperature using a scale of the interval scale type”. Referring to our abstract tasks, as we will discuss in detail in the following, the former is useful for categorization, where the assigned values are important; the latter for clustering, where cluster labels can be changed without affecting the result. The two kinds of assignment closeness lead to two different families of metrics, as discussed in Sect. 4; we first provide an intuitive example.

3.7 An example

As an example, consider the situation in Table 6, representing some assignments of three objects in \({\mathcal {D}} = \{o_1, o_2, o_3 \}\): \(\rho\) is the reference assignment, and the \(\omega _i\) are some different assignments. Now, if the scale type is \(\mathtt {N}\) or \(\mathtt {O}\), all the assignments are equivalent. Therefore, these \(\omega _i\) are all equally close to the reference \(\rho\) in terms of equivalence-closeness.

Table 6 The example described in the text

However, the situation changes when looking at value-oriented closeness, since \(\omega _2\) and \(\omega _3\) are strictly value-closer to \(\rho\) than the other assignments, given that they achieve equality for objects \(o_2\) and \(o_3\). Thus, for the \({\mathtt {N}}\) scale type:Footnote 4

$$\begin{aligned} {\omega _2, \omega _3}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {N}}} {\omega _1, \omega _4} . \end{aligned}$$
(9)

If the scale type is \(\mathtt {O}\) the situation is slightly more complex. In addition to the previous relationships, now \(\omega _2\) is closer to \(\rho\) than \(\omega _3\) (\({\omega _2}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {O}}} {\omega _3}\)) since the value 11 is ordinal closer to 10 than 12. At interval (or ratio) scale type, in addition, \({\omega _4}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {I}}} {\omega _1}\), and therefore (where \(\mathtt {T}\) is \(\mathtt {I}\) or \(\mathtt {R}\))

$$\begin{aligned} {{\omega _2}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {T}}} {\omega _3}}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {T}}} {{\omega _4}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {T}}} {\omega _1}}. \end{aligned}$$
(10)

The meaningful statements of the interval scale type capture differences between assignments that are not captured at ordinal or nominal scale types.

When considering again equivalence-oriented assignment closeness, we obtain a different, and contradictory, outcome: \(\omega _1\) and \(\omega _4\) are closer to \(\rho\) than \(\omega _2\) and \(\omega _3\), i.e.,

$$\begin{aligned} {\omega _1, \omega _4}\mathrel {\sqsubset ^{\rho }_{\mathtt {I}}} {\omega _2, \omega _3}. \end{aligned}$$
(11)

To see why, consider that for any linear transformation applied to \(\omega _2\) or \(\omega _3\), we can define a transformation for \(\omega _1\) or \(\omega _4\) to make them closer to \(\rho\): for instance \(\omega _1^{\prime } = \omega _1 * 10\) and \(\omega _4^{\prime } = \omega _4 - 2\). Finally, if the scale type is \(\mathtt {R}\), \(\omega _1\) is equivalence-closer to \(\rho\) than \(\omega _4\), i.e.,

$$\begin{aligned} {\omega _1}\mathrel {\sqsubset ^{\rho }_{\mathtt {R}}} {\omega _4}. \end{aligned}$$
(12)

It is clear that which assignment to prefer depends on the scale type, the abstract task, and whether value-oriented or equivalence-oriented closeness is used.

4 Metrics: two families, eight classes

We can now turn to analyse effectiveness metrics. As it can be seen in the following, the previous framework based on the inclusion of closeness in measurement theory allows general definitions and theorems (in the next sections).

We first define some notation and provide a basic definition of metrics (in Sect. 4.1); on this basis, and exploiting the two kinds of closeness notions, namely value- and equivalence-oriented (Definitions 3 in Sects. 3.5 and 4 in Sect. 3.6, respectively), we distinguish between two families of metrics, namely value- and equivalence-oriented metrics (in Sects. 4.2 and 4.3, respectively). We then revisit the example of Sect. 3.7 (in Sect. 4.4) and classify the metrics into eight classes (in Sect. 4.5).

4.1 System outputs, gold standards, and metrics

The first step is to formalise system outputs and golds as assignments: they can be relevance judgments, assigned categories, etc. We use \(\alpha\) for golds and \(\sigma\) for system outputs (\(\alpha\) is a mnemonic for assessment, \(\sigma\) for system); thus \(\alpha , \sigma \in \varOmega\) (see Definition A.1).

Definition 5

(System output and gold) A system output \(\sigma\) or a gold \(\alpha\) is an assignment from a set of documents \({\mathcal {D}}\) to real numbers \({{\,{{\mathbb {R}}}\,}}\):

$$\begin{aligned} \sigma : {\mathcal {D}} \longrightarrow {{\,{{\mathbb {R}}}\,}}\text { and } \alpha : {\mathcal {D}} \longrightarrow {{\,{{\mathbb {R}}}\,}}. \end{aligned}$$

This approach is different from the classical approach by van Rijsbergen (1981) who focussed on considering measuring retrieval effectiveness itself as a measurement, as well as from the recent proposal by Ferrante et al. (2017), who consider evaluation metrics as measurements and set to determine if an IR evaluation metric is a measure on an interval scale. We do not represent an effectiveness metric as a measurement. Our approach is more similar to the already cited work of Busin and Mizzaro (2013) and Maddalena and Mizzaro (2014), but with an important difference. That previous work modeled system outputs and golds as measurements; however, this is not necessary in our approach, where they are simply assignments. This is an important simplification. We also remark that we are not the only ones to represent Golds as assignments. For example, Ferrante et al. (2019, Section 4) write: “the ground-truth GT is a map which assigns a relevance degree \(rel \in REL\) to a document d with respect to a topic \(t^{\prime \prime }\). We extend that approach by applying it to (i) system outputs and (ii) other abstract tasks beyong IR (ranking).

On the basis of Definition 5 we can represent any system output and gold. For example, when human assessors judge relevance using the usual 4-levels scale for Highly relevant, Relevant, Marginally relevant, Not relevant, it is common to translate them into the numeric values 3, 2, 1, and 0. Turning to system outputs, of course the Retrieval Status Values are an assignment; but any ranked list of retrieved documents can easily be converted to an assignment, for example using the reciprocal of the rank. If the abstract task is not Ranking but, let us say, Clustering, the gold and system output will be again an assignment of numbers to documents, with the natural convention that two documents having the same value means that they are in the same cluster, according to the gold and/or the system output; similarly for Classification.

On these basis we define a metric as a function that, given a system output \(\sigma\) and a gold \(\alpha\), returns a real value that depends on how much \(\sigma\) is close to \(\alpha\).

Definition 6

(Metric) A metric is a function \({\mathcal {M}}: \varOmega ^2 \longrightarrow {{\,{{\mathbb {R}}}\,}}\).

We remark that some authors and metrics require a bounded codomain for \({\mathcal {M}}\). Moffat’s (2013) first property is Boundedness (see Sect. 2.3). Several metrics assume values in [0, 1]. However, this is not always the case: some metrics (that we will analyze in the following) such as Utility metrics (for classification) are unbounded and assume values in \((-\infty ,+\infty )\), other metrics like DCG (for IR), and MAE (for quantitation) in \([0,+\infty )\), and Pearson and Spearman in \([-1, +1]\). We choose the most general possibility for two reasons: (i) we aim at a general framework, thus we do not want to exclude metrics and to do so we focus on monotonicity and invariance properties; and (ii) normalizations can be applied, though this seems a technical and minor issue that we leave for future work.

However, we have defined two notions of closeness; therefore we define two families of metrics, as follows.

4.2 First family: value-oriented metrics

Value-oriented metrics quantify to what extent a system resembles the values assigned to items in the gold. The more the system values are close to the gold values, the more the metric returns high scores for the system. For instance, the values spam/non spam assigned by a spam filter should match with the values given by the human references. Given that the concept of closeness defined in the previous section depends on the scale type, the definition of a value-oriented metric is also dependent on the scale type. In addition, transforming both the system output and the gold by the same permissible transformation function should not affect the metric result. For instance, we can apply a bijective transformation from the value representing the category “spam” to the value representing the category “trash-messages” (that, being bijective, is in \({\mathcal {F}}_{\mathtt {N}}\), i.e., is a permissible transformation function for the nominal scale) as long as we do it for both the system output and the gold.

We first define the following two properties.

Property 1

(VOI, value-oriented invariance) A metric\({\mathcal {M}}\)is value-oriented invariant for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system output\(\sigma\), both of them on the same set of objects\({\mathcal {D}}\), the metric value does not change by applying the same permissible transformation to both\(\alpha\)and\(\sigma\):

$$\begin{aligned} \forall \alpha , \sigma \in \varOmega ,~\forall f\in {\mathcal {F}}_{\mathtt {T}} \Big ( {\mathcal {M}} (\sigma ,\alpha )={\mathcal {M}} (f(\sigma ),f(\alpha )) \Big ). \end{aligned}$$
(13)

Property 2

(VOM, value-oriented monotonicity) A metric\({\mathcal {M}}\)is value-oriented monotonic for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system outputs\(\sigma\)and and\(\sigma ^{\prime }\), all three of them on the same set of objects\({\mathcal {D}}\), it holds that if\(\sigma\)is value-closer to\(\alpha\)than\(\sigma ^{\prime }\), then the metric value for\(\sigma\)has to be higher than that for\(\sigma ^{\prime }\). In formulas:

$$\begin{aligned} \forall \alpha , \sigma , \sigma ^{\prime } \in \varOmega \Big ( {\sigma }\mathrel {\vartriangleleft ^{\alpha }_{\mathtt {T}}} {\sigma ^{\prime }} \Longrightarrow {\mathcal {M}} (\sigma ,\alpha )>{\mathcal {M}} (\sigma ^{\prime },\alpha ) \Big ). \end{aligned}$$
(14)

We can now specialize Definition 6 and formally define a value-oriented metric on the basis of value-oriented assignment closeness (Definition 3) as follows.

Definition 7

(Value-oriented metric) A value-oriented evaluation metric for the scale type \(\mathtt {T}\) is a metric that satisfies the two properties VOI and VOM for the scale type \(\mathtt {T}\).

Therefore, the metrics of this family need to satisfy just two basic properties: (i) invariance to permissible transformation functions (VOI), i.e., the metric value does not change when transforming both assignments in the same permissible way, and

(ii) monotonicity to value-oriented closeness (VOM), i.e., if an assignment is value-closer to the gold than another, then the former has a higher metric value. Note that the two properties VOI and VOM depend on the scale type: a given function could be a metric for a scale type \(\mathtt {T}\) and not be a metric for another scale type \(\mathtt {T'}\).Footnote 5

4.3 Second family: equivalence-oriented metrics

Directly comparing the values of system outputs to the values assigned by the gold is in some cases too strict. For instance, the purpose of search engines consists of offering the most relevant documents rather than quantifying the relevance of documents: the key point is that any assignment that keeps the ranking of documents in the search engine output is equally effective. That is, any assignment in the same equivalence class for the ordinal scale type.

The metrics of the second family, the equivalence-oriented metrics, are formally defined on the basis of the following two properties.

Property 3

(EOI, equivalence-oriented invariance) A metric\({\mathcal {M}}\)is equivalence-oriented invariant for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system output\(\sigma\), both of them on the same set of objects\({\mathcal {D}}\), the metric value does not change by applying any permissible transformation to\(\sigma\). In formulas:

$$\begin{aligned} \forall \alpha , \sigma \in \varOmega ,~\forall f\in {\mathcal {F}}_{\mathtt {T}} \Big ( {\mathcal {M}} (\sigma ,\alpha )={\mathcal {M}} (f(\sigma ),\alpha ) \Big ) . \end{aligned}$$
(15)

Property 4

(EOM, equivalence-oriented monotonicity) A metric\({\mathcal {M}}\)is equivalence-oriented monotonic for the scale type\(\mathtt {T}\)if for any reference gold\(\alpha\)and system outputs\(\sigma\)and and\(\sigma ^{\prime }\), all three of them on the same set of objects\({\mathcal {D}}\), it holds that if\(\sigma\)is equivalence-closer to\(\alpha\)than\(\sigma ^{\prime }\), then the metric value for\(\sigma\)has to be higher than that for\(\sigma ^{\prime }\). In formulas:

$$\begin{aligned} \forall \alpha , \sigma , \sigma ^{\prime } \in \varOmega \Big ( {\sigma }\mathrel {\sqsubset ^{\alpha }_{\mathtt {T}}} {\sigma ^{\prime }} \Longrightarrow {\mathcal {M}} (\sigma ,\alpha )>{\mathcal {M}} (\sigma ^{\prime },\alpha ) \Big ). \end{aligned}$$
(16)

We can now specialize again Definition 6 and define an equivalence-oriented metric on the basis of equivalence-oriented assignment closeness (Definition 4) as follows.

Definition 8

(Equivalence-oriented metric) An equivalence-oriented evaluation metric for the scale type \(\mathtt {T}\) is a metric that satisfies the two properties EOI and EOM for the scale type \(\mathtt {T}\).

Therefore, also the metrics of this family need to satisfy two basic properties only, namely invariance (EOI) and monotonicity (EOM), although these are defined in a slightly different way from the corresponding properties of the previous family of metrics (VOI and VOM): in EOI the function f is applied to \(\sigma\) only, and in EOM equivalence-oriented assignment closeness is used. As above, we will use the scale type as a subscript for the EOI and EOM properties when needed (see Footnote 5).

4.4 The example revisited

We provide some intuition by discussing again the example in Sect. 3.7 and Table 6: here \(\rho\) is the gold (\(\alpha\) using the notation of this section) and the \(\omega _i\) are the system outputs (\(\sigma _i\)). Let us first interpret the assignments as being of the \(\mathtt {I}\) scale, and focus on \(\omega _1\) and \(\omega _2\). Since \(\omega _2\) is value-closer to \(\rho\) than \(\omega _1\) (\({\omega _2}\mathrel {\vartriangleleft ^{\rho }_{\mathtt {I}}} {\omega _1}\), see Formula (10)), a value-oriented metric will assign a higher value to \(\omega _2\) than to \(\omega _1\) (\({\mathcal {M}}(\omega _2,\rho )>{\mathcal {M}}(\omega _1,\rho )\)), because of the VOM property. Conversely, since \(\omega _1\) is equivalence-closer to \(\rho\) than \(\omega _2\) (\({\omega _1}\mathrel {\sqsubset ^{\rho }_{\mathtt {I}}} {\omega _2}\), see Formula (11)), an equivalence-oriented metric will assign a higher value to \(\omega _1\) than to \(\omega _2\) (\({\mathcal {M}}(\omega _1,\rho )>{\mathcal {M}}(\omega _2,\rho )\)), because of the EOM property. Metrics of the first family reward \(\omega _2\) for “almost guessing” the correct \(\rho\) values; metrics of the second family reward \(\omega _1\) for guessing better the ratios between the intervals (i.e., the meaningful statements for \(\mathtt {I}\)) among the values in \(\rho\).

Indeed, \(\omega _1\) guess of the ratios of the intervals is not only better, it is perfect. This can be seen also considering the EOI property: if the values in \(\omega _1\) are multiplied by ten (a permissible transformation function for \(\mathtt {I}\)) we obtain exactly \(\rho\). Since it is not possible to do better that \(\omega _1\), an equivalence-oriented metric should assign to \(\omega _1\) a higher value than any other one. Note that the above remarks hold also when replacing \(\omega _1\) with \(\omega _4\) (apart from changing the permissible transformation function), as \(\omega _4\) is equivalent to \(\omega _1\) (simply use the transformation \(f(x) = \frac{x -2}{10}\)) and to \(\rho\) (\(f(x) = x -2\)). This also means, again because of EOI, that \({\mathcal {M}}(\omega _1,\rho )={\mathcal {M}}(\omega _4,\rho )\).

Changing the scale will in general change the situation. For example, if we now interpret the same assignments as being of the \(\mathtt {R}\) scale, we get different outcomes. On the \(\mathtt {R}\) scale, \(\omega _4\) is not equivalent to \(\omega _1\) and \(\rho\) anymore: there is no function in \({\mathcal {F}}_{\mathtt {R}}\) that maps the one into the other two. Indeed, since \(\omega _1\) is equivalence-closer to \(\rho\) than \(\omega _4\) (see Formula (12)), equivalence-oriented metrics for the ratio scale will assign a higher value to \(\omega _1\) than to \(\omega _4\) (\({\mathcal {M}}(\omega _1,\rho )>{\mathcal {M}}(\omega _4,\rho )\)), because of the EOM property.

4.5 Eight classes of metrics

Our framework is now complete: we have provided two definitions, one for each metric family: value- and equivalence-oriented metric. Both definitions can be applied at different scale types: nominal, ordinal, interval or ratio. By combining the four scale types \(\mathtt {N, O, I, R}\) and the two families of metrics (value- and equivalence-oriented) we obtain the eight classes of metrics summarized in Table 7.

Table 7 The eight classes of metrics

In the following we prove some theorems that show that:

  • The basic axioms proposed in the literature for specific abstract tasks can be derived from the general metric definition, taking into account the particular combination of family and scale type. Notice that we have used the term basic axiom instead of axiom. The reason is that, as we have seen in Sect. 2, some axioms depend on the particular task, while other (basic) axioms are common for any task that fits into the corresponding abstract task.

  • Existing abstract tasks, and the corresponding metrics, actually fit into our classification. More specifically, we show that each information access abstract task (classification, clustering, etc.) corresponds to a metric class, i.e., a particular combination of metric family and scale type, as well as that metrics that fit in the same category according to these two dimensions are used in the literature for that abstract task.

  • The theoretical limitations of metrics that have been identified in the literature (i.e., metrics that do not satisfy basic axioms) can be explained also in the general framework proposed in this paper.

  • By exploring the classes of metrics along the two dimensions (scale type and family of metric), it is possible to address evaluation gaps and provide formal definitions, for example, of Ordinal Classification metrics, which have not been addressed yet.

5 Properties and scale types

We start by making explicit some implication relationships between the four properties VOI, VOM, EOI, and EOM at different scale types. These are derived from the fact that permissible transformation functions are subsumed across scale types (see, in Appendix A, Sect. A.2 and in particular Formulas (30) and (31)). That is, the set of bijective functions includes the set of monotonic functions, which in turn includes the set of linear affinity functions. Therefore, closeness for low scale types (e.g., nominal) implies closeness for higher scale types (e.g., interval). The resulting relationships are listed in the following lemma and in the two subsequent corollaries, with the aims of: (i) provide the basis to prove some properties in the following of the paper, and (ii) help to better understand the meaning of the four axioms VOI, VOM, EOI, EOM and their relationships with the four scales \(\mathtt {N, O, I, R}\).

Lemma 2

(Four properties, four scale types) The following relationships hold among the four properties (VOI, VOM, EOI, EOM) and the four scale types (\(\mathtt {N, O, I, R}\)).

  1. (a)

    If a metric satisfies VOI for a certain scale type, then it satisfies VOI for higher scale types:

    $$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T} < \mathtt {T'} ~( \mathrm {VOI}_{\mathtt {T}}&\Longrightarrow \mathrm {VOI}_{\mathtt {T'}} ). \end{aligned}$$
  2. (b)

    If a metric satisfies EOI for a certain scale type, then it satisfies EOI for higher scale types:

    $$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T}< \mathtt {T'} ~( \mathrm {EOI}_{\mathtt {T}}&\Longrightarrow \mathrm {EOI}_{\mathtt {T'}} ). \end{aligned}$$
  3. (c)

    VOI for the nominal or ordinal scale type and VOM for higher scale types are incompatible:

    $$\begin{aligned} \forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T'} \in \{\mathtt {O, I, R} \}, \mathtt {T} < \mathtt {T'} \left( \lnot \left( \mathrm {VOI}_{\mathtt {T}}\wedge \mathrm {VOM}_{\mathtt {T'}}\right) \right) . \end{aligned}$$
  4. (d)

    VOI for the nominal or ordinal scale type and EOM for higher scale types are incompatible:

    $$\begin{aligned} \forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T'} \in \{\mathtt {O, I, R} \}, \mathtt {T}< \mathtt {T'} \left( \lnot \left( \mathrm {VOI}_{\mathtt {T}}\wedge \mathrm {EOM}_{\mathtt {T'}}\right) \right) . \end{aligned}$$
  5. (e)

    VOM and EOM are incompatible, whatever the scale type:

    $$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \} \left( \lnot ( \mathrm {VOM}_{\mathtt {T}} \wedge \mathrm {EOM}_{\mathtt {T'}}) \right) . \end{aligned}$$
  6. (f)

    \(\hbox {VOM}_\mathtt {I}\)and\(\hbox {VOM}_{\mathtt {R}}\)are equivalent:Footnote 6

    $$\begin{aligned} \mathrm {VOM}_\mathtt {I}&\Longleftrightarrow \mathrm {VOM}_{\mathtt {R}}. \end{aligned}$$
  7. (g)

    VOM and EOI are incompatible, whatever the scale type:

    $$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \} \left( \lnot ( \mathrm {VOM}_{\mathtt {T}} \wedge \mathrm {EOI}_{\mathtt {T'}}) \right) . \end{aligned}$$
  8. (h)

    EOM for a certain scale type and EOI for lower scale types are incompatible:

    $$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T'} < \mathtt {T} \left( \lnot ( \mathrm {EOM}_\mathtt {T} \wedge \mathrm {EOI}_{\mathtt {T'}}) \right) . \end{aligned}$$

From this lemma, we can infer the following corollaries. It is easy to see that items (c), (d), (e), (g), and (h) can be restated as implications.

Corollary 1

(Incompatibilities and implications) Items (c) and (d) of Lemma 2can be restated as follows. If a metric satisfies VOI for the nominal or ordinal scale type, then it does not satisfy VOM and EOM for higher scale types, and vice-versa if a metric satisfies VOM or EOM for higher scale types, then it does not satisfy VOI for the nominal or ordinal scale type:

$$\begin{aligned}&\forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T} < \mathtt {T'} \left( \mathrm {VOI}_{\mathtt {T}}\Longrightarrow \lnot (\mathrm {VOM}_{\mathtt {T'}}) \right) \end{aligned}$$
(17)
$$\begin{aligned}&\forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T} < \mathtt {T'} \left( \mathrm {VOM}_{\mathtt {T'}} \Longrightarrow \lnot (\mathrm {VOI}_{\mathtt {T}} ) \right) \end{aligned}$$
(18)
$$\begin{aligned}&\forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T} < \mathtt {T'} \left( \mathrm {VOI}_{\mathtt {T}}\Longrightarrow \lnot (\mathrm {EOM}_{\mathtt {T'}}) \right) \end{aligned}$$
(19)
$$\begin{aligned}&\forall \mathtt {T} \in \{\mathtt {N, O} \}, \mathtt {T} < \mathtt {T'} \left( \mathrm {EOM}_{\mathtt {T'}} \Longrightarrow \lnot (\mathrm {VOI}_{\mathtt {T}} ) \right) . \end{aligned}$$
(20)

Item (e) of the lemma can be restated as follows. If a metric satisfies VOM for a certain scale type, then it does not satisfy EOM for any scale type, and vice-versa:

$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}&\left( \mathrm {VOM}_{\mathtt {T}}\Longrightarrow \lnot (\mathrm {EOM}_{\mathtt {T'}})\right) \end{aligned}$$
(21)
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}&\left( \mathrm {EOM}_{\mathtt {T'}}\Longrightarrow \lnot (\mathrm {VOM}_{\mathtt {T}})\right) . \end{aligned}$$
(22)

Item (g) of the lemma can be restated as follows. If a metric satisfies EOI for any scale type, then it does not satisfy VOM for any scale type, and vice-versa:

$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}&\left( \mathrm {EOI}_{\mathtt {T}} \Longrightarrow \lnot (\mathrm {VOM}_{\mathtt {T'}} ) \right) \end{aligned}$$
(23)
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}&\left( \mathrm {VOM}_{\mathtt {T'}} \Longrightarrow \lnot (\mathrm {EOI}_{\mathtt {T}} ) \right) . \end{aligned}$$
(24)

Item (h) of the lemma can be restated as follows. If a metric satisfies EOM for a certain scale type, then it does not satisfy EOI for lower scale types, and vice-versa:

$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T'}< \mathtt {T}&\left( \mathrm {EOM}_\mathtt {T} \Longrightarrow \lnot (\mathrm {EOI}_{\mathtt {T'}}) \right) \end{aligned}$$
(25)
$$\begin{aligned} \forall \mathtt {T, T'} \in \{\mathtt {N, O, I, R} \}, \mathtt {T'} < \mathtt {T}&\left( \mathrm {EOI}_\mathtt {T'} \Longrightarrow \lnot (\mathrm {EOM}_{\mathtt {T}}) \right) . \end{aligned}$$
(26)

Another result is that, knowing that a metric fits into one metric class ensures that that metric does not fit into other definitions, with only one exception. The unification between value-oriented metrics for interval and ratio scale types is due to the fact that the concepts of closeness for those two scale types are equivalent (see Table 5).

Corollary 2

(Compatibility of value-oriented interval and ratio) Under our definition of metrics, there exists only one case in which a metric can be classified in more than one class(see Table 7): the value-oriented metrics for the interval and ratio scale types (5 and 7 in the table).

6 Basic axioms in the literature

In this section we state some theorems that prove that the basic axioms proposed in the literature for classification, clustering, and ranking, i.e., GMON (Generalized Strict Monotonicity Axiom), GHC (Generalized Homogeneity / Completeness), and PRI (Priority Axiom) (see Sect. 2) can be derived in our framework. As anticipated in Sect. 4.5, we focus on the basic axioms that we have identified in Sect. 2 (i.e., the axioms marked with (*) and shown in italics in the tables in Sect. 2), leaving aside the task-dependent axioms. As noted in Sect. 2.4, there are no axiomatics, and no basic axioms, for quantitation, at least in the context of information access evaluation: it is left out from this section, and analyzed in Sect. 8.1.

6.1 Classification: GMON is equivalent to \(\hbox {VOM}_ \mathtt {N}\)

As discussed in Sect. 2.1, the common axiom that can be applied to any classification task is the Strict Monotonicity Axiom (MON). It states that if \(\sigma\) and \(\sigma ^{\prime }\) are two classifiers and \(\alpha\) is the ground truth, all of them over a set of documents \({\mathcal {D}}\), and if \(\sigma\) and \(\sigma ^{\prime }\) differ only for a single document \(d \in {\mathcal {D}}\), which is correctly classified for \(\sigma\) and wrongly for \(\sigma ^{\prime }\), then the metric value must be higher for \(\sigma\). More formally, if

$$\begin{aligned} \exists ! d \in {\mathcal {D}} \Big ( \forall d^{\prime } \in {\mathcal {D}} \setminus \{d\} \big ( \sigma (d^{\prime }) = \sigma ^{\prime }(d^{\prime })\big ) \wedge \alpha (d) = \sigma (d)\ne \sigma ^{\prime }(d)\Big ) \end{aligned}$$

then \({\mathcal {M}}(\sigma ,\alpha )>{\mathcal {M}}(\sigma ^{\prime },\alpha )\).

This definition requires that \(\sigma\) and \(\sigma ^{\prime }\) return the same result for all documents different from d. However, this is not strictly necessary: we can slightly generalise MON requiring that both systems are accurate or wrong, with respect to the gold, for the same documents, except for d.

Axiom 1

(GMON, generalized strict monotonicity axiom) Let\(\alpha\), \(\sigma\), and\(\sigma ^{\prime }\)be three assignments. If every document with an error in\(\sigma\)is also an error in\(\sigma ^{\prime }\), and there exist an error in\(\sigma ^{\prime }\)which is not an error in\(\sigma\), i.e.,

$$\begin{aligned}&\forall d \in {{\mathcal {D}}}\big (\alpha (d)=\sigma (d) \vee \alpha (d)\ne \sigma ^{\prime }(d)\big )\\&\exists d \in {\mathcal {D}}\big (\alpha (d)=\sigma (d)\ne \sigma ^{\prime }(d)\big ) \end{aligned}$$

then the metric value must be higher for\(\sigma\)than\(\sigma ^{\prime }\):

$$\begin{aligned} {\mathcal {M}}(\sigma ,\alpha )>{\mathcal {M}}(\sigma ^{\prime },\alpha ). \end{aligned}$$

When there are only two classes (only two possible values in \(\sigma\) and \(\alpha\)), GMON and MON are equivalent, given that if \(\sigma (d)\ne \alpha (d)\) and \(\sigma ^{\prime }(d)\ne \alpha (d)\) then \(\sigma (d)=\sigma ^{\prime }(d)\).

We can now prove the following theorem.

Theorem 1

(\(\hbox {VOM}_\mathtt {N}\) and GMON) The VOM property for the nominal scale type (\(\hbox {VOM}_\mathtt {N}\)) and the Generalized Strict Monotonicity (GMON) axiom are equivalent.

6.2 Clustering: GHC is equivalent to \(\hbox {EOM}_\mathtt {N}\)

As mentioned in Sect. 2.2, the basic clustering axioms Homogeneity and Completeness can be generalized into a unique GHC axiom, which can be formalized as follows.

Axiom 2

(GHC, generalized homogeneity/completeness) Let\(\alpha\), \(\sigma\), and\(\sigma ^{\prime }\)be three assignments. If (i) for each document pair if\(\sigma ^{\prime }\)is correct then also\(\sigma\)is correct, i.e.,

$$\begin{aligned}&\forall d_i, d_j \in {\mathcal {D}} \Big ( \left( \sigma ^{\prime }(d_i)=\sigma ^{\prime }(d_j)\wedge \alpha (d_i)=\alpha (d_j)\Longrightarrow \sigma (d_i)=\sigma (d_j)\right) ~\wedge \nonumber \\&\quad \left( \sigma ^{\prime }(d_i) \ne \sigma ^{\prime }(d_j)\wedge \alpha (d_i) \ne \alpha (d_j) \Longrightarrow \sigma (d_i)\ne \sigma (d_j)\right) \Big ), \end{aligned}$$
(27)

and (ii) there exists at least a document pair\(d_1\), \(d_2\)such that\(\sigma\)adds to\(\sigma ^{\prime }\)a correct relation, i.e.,\(\sigma\)is correct and\(\sigma ^{\prime }\)is not, i.e.,

$$\begin{aligned} \exists d_1, d_2 \in {\mathcal {D}} \Big (&\big (\sigma ^{\prime }(d_1)=\sigma ^{\prime }(d_2)\wedge \alpha (d_1)\ne \alpha (d_2)\wedge \sigma (d_1)\ne \sigma (d_2)\big ) ~\vee \nonumber \\&\big (\sigma ^{\prime }(d_1)\ne \sigma ^{\prime }(d_2)\wedge \alpha (d_1)=\alpha (d_2)\wedge \sigma (d_1)=\sigma (d_2)\big ) \Big ) \end{aligned}$$
(28)

then the metric value must be higher for\(\sigma\)than\(\sigma ^{\prime }\):

$$\begin{aligned} {\mathcal {M}}(\sigma ,\alpha ) > {\mathcal {M}}(\sigma ^{\prime },\alpha ). \end{aligned}$$

The following theorem states that the GHC and the EOM properties for the nominal scale type are equivalent.

Theorem 2

(\(\hbox {EOM}_\mathtt {N}\) and GHC) The EOM property for the nominal scale type (\(\hbox {EOM}_\mathtt {N}\)) and the Generalized Homogeneity and Completeness (GHC) axiom are equivalent.

6.3 Ranking: PRI is equivalent to \(\hbox {EOM}_\mathtt {O}\)

We have seen in Sect. 2.3 that the basic axiom Swapping appears in most axiomatics and that it can be generalized as the Priority axiom (PRI). PRI states that swapping two contiguous documents in the ranking according to the gold necessarily increases the score. We formalize it as follows.

Axiom 3

(PRI, priority axiom) Let\(\alpha\), \(\sigma\), and\(\sigma ^{\prime }\)be three assignments such that\(d_i\)and\(d_j\)have contiguous values at scale type\(\mathtt {O}\)in both\(\sigma\)and\(\sigma ^{\prime }\). If:Footnote 7

$$\begin{aligned} \exists i, j \Big (&\alpha (d_i)>\alpha (d_j)\wedge \sigma (d_i)>\sigma (d_j) \wedge \sigma ^{\prime }(d_i)<\sigma ^{\prime }(d_j) ~\wedge \nonumber \\&\forall k,l\ne i,j \big ( \sigma (d_k)>\sigma (d_l)\Leftrightarrow \sigma ^{\prime }(d_k)>\sigma ^{\prime }(d_l) \big ) \Big ) \end{aligned}$$
(29)

then the metric value must be higher for\(\sigma\)than\(\sigma ^{\prime }\):

$$\begin{aligned} {\mathcal {M}}(\sigma ,\alpha ) > {\mathcal {M}}(\sigma ^{\prime },\alpha ). \end{aligned}$$

We can prove the following theorem.

Theorem 3

(\(\hbox {EOM}_\mathtt {O}\) and PRI) The EOM property for ordinal scale type (\(\hbox {EOM}_\mathtt {O}\)) and the Priority (PRI) axiom are equivalent.

In other words, our definition of metrics captures the basic axiom for metrics in ranking tasks. We now turn to analyse the implications for specific tasks and metrics.

7 Metrics analysis

In this section, we analyse existing metrics in terms of our theoretical framework. We have the twofold aim of: (i) showing that existing abstract tasks and corresponding metrics are explained by our framework, and (ii) that the theoretical limitations of metrics that have been identified in the literature (i.e., metrics that do not satisfy basic axioms) are captured in our framework as well. As noted in the previous section, we postpone the quantitation case to Sect. 8.1.

Table 8 summarises the analysis carried out in this section. The columns represent the metrics categorised by abstract tasks. The rows represent the properties VOI, VOM, EOI, and EOM, at different scale types. Circles indicate that properties are satisfied, black circles emphasize that both properties (from the corresponding definition of evaluation metric) are satisfied. As the table shows, only metrics in the corresponding category satisfy our definition of metric. In addition, for each abstract task category, there exist metrics which are not able to satisfy both properties. This does not mean that these metrics are absolutely useless. For instance, Purity and Inverse Purity in clustering, or P@N in IR have the advantage of being easy to interpret. However, the evaluation results have to be analysed carefully to prevent misinterpretations of systems’ quality. We will see in this section that these cases correspond with theoretical metric drawbacks previously reported in the literature with practical implications.

Table 8 Metric analysis for classification, clustering, and ranking abstract tasks

7.1 Classification: value-oriented, nominal

We define classification metrics as follows.

Definition 9

(Classification metric) A classification metric is a value-oriented metric for the nominal scale type.

According to our aim (i) above, we want to show that metrics satisfying VOI and VOM for the nominal scale type are those that are used for classification problems in the literature. Most of the metrics used in classification tasks are combinations of the contingency matrix components, i.e., the amount of true and false, positive and negative samples. Formally, the contingency matrix can be defined as follows.

Definition 10

(Contingency matrix) Being the system output \(\sigma\) and the gold \(\alpha\) two functions over a limited amount of values \({\mathcal {V}}\), the contingency matrix \(C:{\mathcal {V}}\times {\mathcal {V}}\rightarrow {\mathcal {N}}\) is:

$$\begin{aligned} C_{\sigma ,\alpha }(x,y)={{\,\mathrm{card}\,}}\big (\{d\in {\mathcal {D}}~|~\sigma (d)=x \wedge \alpha (d)=y\}\big ). \end{aligned}$$

We can prove easily that C(xy) is invariant across the permissible transformation functions for the nominal scale type \({\mathcal {F}}_\mathtt {N}\) (i.e., the bijective functions, see Sect. A.2 in Appendix A) applied to \(\sigma\) and \(\alpha\). In other words, changing the names of categories without merging them (as it is done with bijective functions) in both the gold and the system output does not affect the contingency table. That is, being \(f_b\) any bijective function

$$\begin{aligned} C_{\sigma ,\alpha }(x,y)=C_{f_b(\sigma ),f_b(\alpha )}(f_b(x),f_b(y)). \end{aligned}$$

Therefore, we can state the following theorem.

Theorem 4

(\(\hbox {VOI}_{\mathtt {N}}\) and contingency matrix) Any function over the elements in the contingency matrix satisfies the VOI property for the nominal scale type (\(\hbox {VOI}_{\mathtt {N}}\)).

Table 8 includes some metrics used in classification tasks, such as Accuracy, Macro-Average Accuracy, F-measure, Odds ratio, Lam%. All of them are computed from the contingency matrix. Therefore, according to the previous theorem, they satisfy \(\hbox {VOI}_{\mathtt {N}}\). According to Lemma 2(a) they also satisfy VOI for the rest of (higher) scale types.

We can prove that some common metrics applied in classification satisfy \(\hbox {VOM}_{\mathtt {N}}\) (and, given Theorem 1, GMON).

Theorem 5

(\(\hbox {VOM}_{\mathtt {N}}\) and classification metrics) The metrics Accuracy, Macro Average Accuracy and Phi Correlation satisfy the VOM property for the nominal scale type (\(\hbox {VOM}_{\mathtt {N}}\)).

Given that these metrics also satisfy \(\hbox {VOI}_{\mathtt {N}}\) (according to Theorem 4), we can state that they fit into our definition of value-oriented metrics for the nominal scale type, and therefore, they are classification metrics.

We now turn to our aim (ii) above. The main theoretical drawbacks of classification metrics are related with MON. For instance, according to Sebastiani (2015), the F-measure (computed as the harmonic mean of precision and recall) does not satisfy MON when there is a zero value in one component of the contingency matrix. According to Qi et al. (2010), the classification metric Lam% has the same drawback when zero values appear in the contingency matrix. Something similar happens with the Odds ratio. This problem can be solved by considering the contingency table as a probabilistic distribution and applying smoothing techniques (Amigó et al. 2018), but this is not the focus of this paper. Given that they do not satisfy MON, they do not satisfy also GMON, and therefore, \(\hbox {VOM}_{\mathtt {N}}\). Mutual Information (MI) is another metric used in classification. However, we can assert that it does not satisfy GMON (and \(\hbox {VOM}_\mathtt {N}\)), given that according to MI, an output achieves the highest score even if the label names are replaced. We will see shortly that MI is a clustering metric.

The second and third columns of Table 8 summarize the properties satisfied by these metrics according to the implications derived from Lemma 2. In general, the main result of this analysis is that metrics used in classification tasks fit into our definition, while the main limitations of metrics such as F-measure, Lam% or Odds identified in the literature are also captured by our model.

7.2 Clustering: equivalence-oriented, nominal

We define clustering metrics as follows.

Definition 11

(Clustering metric) A clustering metric is an equivalence-oriented metric for the nominal scale type.

We aim to show that this definition actually captures the existing clustering metrics and their desirable basic constraints. We start by noting that existing clustering metrics such as Purity and Inverse Purity, BCubed precision and Recall, Entropy and Class Entropy, F-measure, or clustering metrics based on counting, are all functions over a set partition. A partition is the standard set concept: a decomposition of a set in subsets such that their (pairwise) intersection is empty and their union is the original set. A function over a partition is any function that takes the original set and its partition as input parameters. We can now state as a theorem that all the above metrics satisfy \(\hbox {EOI}_\mathtt {N}\).

Theorem 6

(\(\hbox {EOI}_{\mathtt {N}}\) and functions over a partition) Any function over a set partition satisfies the EOI property for the nominal scale type (\(\hbox {EOI}_{\mathtt {N}}\)).

Checking if each metric proposed in the literature satisfies \(\hbox {EOM}_{\mathtt {N}}\) (or GHC, which is equivalent according to Theorem 2), is too complex to be included in this paper. For this reason, we will analyse the existing metrics by using the categorisation proposed by Amigó et al. (2009). Then, we will check to what extent a category can produce metrics that satisfy EOM and EOI at the nominal scale type.

A first category is called counting pairs based metrics [e.g., Rand Statistic, Jaccard Coefficient, or Folkes and Mallows (Meila 2003; Halkidi et al. 2001)], that count how many pairs correspond to the gold in terms of same/different cluster: correct relationships between pairs of items increases the score. This principle matches directly with the GHC conditions, which is equivalent to \(\hbox {EOM}_{\mathtt {N}}\). On the other hand, the metric BCubed (Amigó et al. 2009) also increases with the amount of document pairs that are consistent with the gold. It also satisfies GHC. Therefore, according to Theorem 2, we can state that metrics based on counting pairs fit into our definition of clustering metric.

A second category is the entropy based metrics. Some examples are Entropy and Class Entropy (Wu et al. 2003), Variation of Information (Meila 2003), Mutual Information (Xu et al. 2003) or V-measure (Rosenberg and Hirschberg 2007). Proving that all those metrics satisfy EOM and EOI requires too much analysis for this paper: we focus on Entropy and Class entropy only. Given a set of documents \({\mathcal {D}}\), a ground truth clustering \(\alpha\) and a clustering \(\sigma\), the average Entropy of clusters is computed as (Wu et al. 2003)

$$\begin{aligned} \mathrm {E}(\sigma ,\alpha )&= -\smash [b]{\sum _{c\in {\mathcal {V}}(\sigma )}} \Big ( P(\sigma (d)=c) \cdot \\&\quad \smash [b]{\sum _{l\in {\mathcal {V}}(\alpha )}} \Big ( P(\alpha (d)=l~|~\sigma (d)=c )\cdot \log (P(\alpha (d)=l ~|~\sigma (d)=c))\Big ) \Big ), \end{aligned}$$

where the sums are over the documents \(d \in {\mathcal {D}}\), as well as the probability P computed by frequency count as usual, \({\mathcal {V}}(\alpha )\) is the set of different values generated by the gold assignment \(\alpha\), and \({\mathcal {V}}(\sigma )\) is the set of values generated by the system assignment \(\sigma\). The Class Entropy is defined as

$$\begin{aligned} \mathrm {CE}(\sigma ,\alpha )&=-\smash [b]{\sum _{l \in {\mathcal {V}}(\alpha )}} \Big ( P(\alpha (d)=l)\cdot \\&\quad \smash [b]{\sum _{c \in {\mathcal {V}}(\sigma )}} \Big ( P(\sigma (d)=c~|~\alpha (d)=l )\cdot \log (P(\sigma (d)=c ~|~\alpha (d)=l)) \Big ) \Big ). \end{aligned}$$

Notice that the evaluation score is inversely correlated with the entropy values. Then we can prove the following theorem.

Theorem 7

(\(\hbox {EOM}_{\mathtt {N}}\) and entropy metrics) Entropy and Class Entropy satisfy the EOM property for the nominal scale type (\(\hbox {EOM}_{\mathtt {N}}\)).

The conclusion is that BCubed, metrics based on counting pairs, and entropy based metrics are able to satisfy \(\hbox {EOI}_\mathtt {N}\) and \(\hbox {EOM}_\mathtt {N}\). Therefore, they fit into our definition of clustering metric. However, not all metrics used in clustering tasks fit into our definition. In particular, metrics based on set matching such as F-measure [notice that F-measure has a different meaning in the context of clustering (Amigó et al. 2009)] or Purity and Inverse Purity do not satisfy Completeness (Amigó et al. 2009). Therefore, they do not satisfy GHC and, according to Theorem 2, neither \(\hbox {EOM}_\mathtt {N}\). The reason is that they assume a certain correspondence between system output and gold clusters. This produces a lack of sensitivity in some cases.

The fourth and fifth columns in Table 8 illustrate the properties satisfied by these metrics according to the implications derived from Lemma 2. In summary, we can say that most metrics used in clustering fit into our definition, capturing the limitations described in the literature.

As we mentioned before, there exist other axioms that are not captured by our definition, but they are task-dependent. For instance, the range of values (Meila 2003), the ability to join single clusters into a rag bag cluster (Amigó et al. 2009), or the robustness under the overweighting of big clusters due to the combinatory explosion of document pairs.

7.3 Ranking: equivalence-oriented, ordinal

We define ranking metrics as follows.

Definition 12

(Ranking metric) A ranking metric is an equivalence-oriented metric for the ordinal scale type.

Let us see that ranking metrics satisfy \(\hbox {EOI}_\mathtt {O}\) and \(\hbox {EOM}_\mathtt {O}\). Let us first consider the obvious fact that ranking metrics compare system output ranking against the gold. By definition, a ranking is invariant across monotonic functions (the permissible transformation functions for ordinal scale types, \({\mathcal {F}}_\mathtt {O}\)). Therefore, we can state the following theorem.

Theorem 8

(\(\hbox {EOI}_{\mathtt {O}}\) and ranking metrics) Metrics that compare rankings with a gold standard satisfy the EOI property for the ordinal scale type (\(\hbox {EOI}_{\mathtt {O}}\)).

Concerning \(\hbox {EOM}_\mathtt {O}\), we have seen in Theorem 3 that it is equivalent to the Priority axiom (PRI). According to Amigó et al. (2013), many metrics applied in ranking problems actually satisfy PRI. Therefore, we can state the following theorem.

Theorem 9

(\(\hbox {EOM}_\mathtt {O}\) and ranking metrics) The metrics MAP, DCG, nDCG, RBP, ERR, and ordinal correlation coefficients such as Kendall or Spearman satisfy the EOM property for the ordinal scale type (\(\hbox {EOM}_\mathtt {O}\)).

Therefore, many of the metrics used in ranking problems actually fit our definition. This is the case of metrics such as MAP, DCG, nDCG, RBP, ERR and also ordinal correlation coefficients such as Kendall or Spearman. However, these last two coefficients are normally not used in IR, since the collection contains a huge amount of documents that will never be explored by the user. Indeed, IR evaluation metrics, besides satisfying the priority axiom, also give more weight to the top of the ranking returned by the system. This is captured for example by Moffat’s properties Convergence and Top-weightedness (2013), or by Amigó et al.’s Deepness, Closeness Threshold, and Deepness Threshold (2013) (see Table 3). These properties are not captured in our framework and have to be added explicitly if needed; this is usually the case, although, as we already discussed in Sect. 2.3, one might imagine a ranking task where the top-weightedness property is undesirable: for instance, if we need to rank a set of documents, and we know that the user will explore all the documents anyway.

However, not every metric used in ranking satisfies the Priority axiom (Amigó et al. 2013). This is the case of some metrics such as Precision at N, Recall at N or Maximum Reciprocal Rank: P@N and R@N do not consider the order of documents before position N, and MRR does not consider the order of documents after the first relevant one. Therefore, according to Theorem 3 we can infer that they do not satisfy \(\hbox {EOM}_\mathtt {O}\): they do not fit into our definition of ranking metrics. Table 8 (last two columns) illustrates the ranking metrics.

In summary, again, most of the metrics used in ranking problems fit into our definition, with some exceptions whose limitations have been discussed in the literature.

8 Other tasks and metrics

The previous two sections show how the classical abstract tasks (classification, clustering, and ranking) and their metrics can be modeled in our framework. In this section we aim to demonstrate the generality of the framework, showing how it adapts also to other information access tasks, by (i) analyzing the metrics for the quantitation, and (ii) showing that the framework leads to ordinal classification, which has not yet been studied from an axiomatic perspective.

8.1 Quantitation: value- and equivalence-oriented, interval and ratio

The proposed framework gives us the opportunity of modeling tasks at the higher scale types \(\mathtt {I}\) and \(\mathtt {R}\). We now analyze the four quantitation variants shown in Table 7.

8.1.1 Quantitation-1: Value-oriented, interval

Let us analyse the effect of applying the metric definition for the interval scale type. For instance, a value-oriented metric for the interval scale type must be invariant under linear transformations, and it must increase when every assignment is value-closer to the goldstandard. Let us consider the widely used Mean Absolute Error (MAE). It is computed as the average difference between the \(\sigma\) and \(\alpha\) values:

$$\begin{aligned} {{\,\mathrm{\text {MAE}}\,}}(\sigma ,\alpha )={{\,\mathrm{\text {Avg}}\,}}_{d\in {{\mathcal {D}}}}|\alpha (d)-\sigma (d)|. \end{aligned}$$

Notice that the definition of this metric directly matches with the definition of closeness for these scale types. This metric satisfies VOM for both interval and ratio scale types. However, as it is, it does not fit into the definition of value-oriented metric, since \(\mathrm {VOI}_{\mathtt {I}}\) (as well \(\mathrm {VOI}_{\mathtt {R}}\)) does not hold. For instance, transforming both the assignments by multiplying them by a constant factor (a permissible transformation function) affects the average difference. However, the average error has always to be expressed in terms of a unit. For instance, a MAE of 2 when measuring temperature has no sense: one should say a MAE of 2 centigrades. That is, we need to incorporate a unit \(|\alpha (d_0)-\alpha (d^{\prime }_0)|\) which depends of empirical observations over a fixed pair of objects (e.g., a centigrade is a hundredth of the temperature difference between ice and water vapor). Then, applying a transformation also implies transforming the unit, and in this way the average error is invariant for the interval scale type. We can define MAE with a reference difference as:

$$\begin{aligned} {{\,\mathrm{\text {MAE}}\,}}_{\mathrm {RD}}(\sigma ,\alpha )={{\,\mathrm{\text {Avg}}\,}}_{d\in {{\mathcal {D}}}}\left( \frac{|\alpha (d)-\sigma (d)|}{|\alpha (d_0)-\alpha (d^{\prime }_0)|}\right) . \end{aligned}$$

After this definition, we can state the following theorem.

Theorem 10

(\(\hbox {VOI}_\mathtt {I}\), \(\hbox {VOM}_\mathtt {I}\) and \({{\,\mathrm{\text {MAE}}\,}}\)) The Mean Absolute Error with a reference difference is a value-oriented metric for the interval scale type, i.e.,\({{\,\mathrm{\text {MAE}}\,}}_{\mathrm {RD}}\)satisfies\(\hbox {VOI}_\mathtt {I}\)and\(\hbox {VOM}_\mathtt {I}\).

8.1.2 Quantitation-2: Value-oriented, ratio

But in the ratio scale type we do not need a reference difference: a single reference object is enough (a meter has been defined in terms of a prototype meter bar). We can define the mean absolute error with a reference unit as:

$$\begin{aligned} {{\,\mathrm{\text {MAE}}\,}}_{\mathrm {RU}}(\sigma ,\alpha ) ={{\,\mathrm{\text {Avg}}\,}}_{d\in {{\mathcal {D}}}}\left( \frac{|\alpha (d)-\sigma (d)|}{|\alpha (d_0)|}\right) . \end{aligned}$$

This can be applied to ratio scaled dimensions such as length or speed. Now we can prove the following theorem.

Theorem 11

(\(\hbox {VOI}_\mathtt {R}\), \(\hbox {VOM}_\mathtt {R}\) and \({{\,\mathrm{\text {MAE}}\,}}\)) The Mean Absolute Error with a reference unit is a value-oriented metric for the ratio scale type, i.e.,\({{\,\mathrm{\text {MAE}}\,}}_{\mathrm {RU}}\)satisfies\(\hbox {VOI}_\mathtt {R}\)and\(\hbox {VOM}_\mathtt {R}\).

8.1.3 Quantitation-3: Equivalence-oriented, interval

Let us consider the behaviour of an equivalence-oriented metric for the interval scale type. First it should be invariant under linear affine transformations of the system output. In addition, there must be a metric value increase if for every linear affine transformation for one assignment we can find a transformation for the other assignment which is value-closer to the gold. This is the case of the traditional Pearson correlation coefficient, which is defined as:

$$\begin{aligned} {{\,\mathrm{CORR}\,}}(\sigma ,\alpha )= \frac{\sum _i (\sigma (i)-{{\,\mathrm{\text {Avg}}\,}}{(\sigma )})(\alpha (i)-{{\,\mathrm{\text {Avg}}\,}}{(\alpha )})}{\sqrt{\sum _i (\sigma (i)-{{\,\mathrm{\text {Avg}}\,}}{(\sigma )})^2}\sqrt{\sum _i (\alpha (i)-{{\,\mathrm{\text {Avg}}\,}}{(\alpha )})^2}}. \end{aligned}$$

We can now prove the following theorem.

Theorem 12

(\(\hbox {EOI}_\mathtt {I}\), \(\hbox {EOM}_\mathtt {I}\) and Pearson) The Pearson correlation coefficient is an equivalence-oriented metric for the interval scale type, i.e., it satisfies\(\hbox {EOI}_\mathtt {I}\)and\(\hbox {EOM}_\mathtt {I}\).

8.1.4 Quantitation-4: Equivalence-oriented, ratio

Finally, we can also find equivalence oriented metrics at the ratio scale type. The most popular metric in this category is probably the cosine distance, defined as (where \(\overrightarrow{\sigma }=\langle \sigma (i_1),\ldots ,\sigma (i_n)\rangle\) and \(\overrightarrow{\alpha }=\langle \alpha (i_1),\ldots ,\alpha (i_n)\rangle\)):

$$\begin{aligned} {{\,\mathrm{\text {COS}}\,}}(\sigma ,\alpha )=\frac{\overrightarrow{\sigma }\cdot \overrightarrow{\alpha }}{\Vert \overrightarrow{\sigma } \Vert \cdot \Vert \overrightarrow{\alpha } \Vert }. \end{aligned}$$

The strength of this similarity criterion is that it is not affected by proportionality transformations of assignments. According to the state of the art, the cosine distance is a good estimator for document similarity (in this case each item is the frequency of a word in the document). We can prove the following theorem.

Theorem 13

(\(\hbox {EOI}_\mathtt {R}\), \(\hbox {EOM}_\mathtt {R}\) and cosine distance) Whenever assignment values are positive, the cosine distance is an equivalence-oriented metric for the ratio scale type, i.e., it satisfies\(\hbox {EOI}_\mathtt {R}\)and\(\hbox {EOM}_\mathtt {R}\).

Table 9 shows the properties satisfied by these metrics according to the above theorems and the implication relationships between properties stated in Lemma 2 (the last column of the table is discussed in the following).

Table 9 Metric analysis for high scale types (\(\mathtt {I}\) and \(\texttt {R}\)) and value-oriented for ordinal scale type

8.2 Ordinal classification: Value-oriented, ordinal

In Sect. 2.3 we highlighted that the Ordinal Classification task has not yet been analysed in depth, although several evaluation campaigns match this task and several authors have analysed the most popular metrics and have made some proposals (Gaudette and Japkowicz 2009; Baccianella et al. 2009; Cardoso and Sousa 2011). Our framework leads to this problem when considering value oriented metrics at ordinal scale, and provides two properties to be satisfied by metrics: \(\hbox {VOI}_{\mathtt {O}}\) and \(\hbox {VOM}_{\mathtt {O}}\). The first one states that the metric must be invariant under strict increasing functions (permissible transformation functions for the ordinal scale type) applied over both the system output and the gold: if we keep the relative order of gold and system values, the metric must return the same result. \(\hbox {VOM}_{\mathtt {O}}\) states that approaching a value to the correct one must increase the system score.

Let us analyse the most popular metrics used in this task. On the one hand, the popular Accuracy metric is invariant at value-oriented nominal (\(\hbox {VOI}_{\mathtt {N}}\)) and, therefore, it is also invariant at value-oriented ordinal scale, but it does not satisfy monotonicity (\(\hbox {VOM}_{\mathtt {O}}\)). That is, it does not capture closeness to the gold value. An attempt in the literature to solve this gap is by means of Accuracy with n (Gaudette and Japkowicz 2009) which relaxes the range of values for a response to be accepted as matching. However, this solution does not solve the monotonicity problem for larger ordinal differences. Other authors proposed to use correlation coefficients, such as Pearson, Spearman, or Kendall; in particular, non parametric ones such as Kendall and Spearman are invariant at the ordinal scale, but they do not satisfy monotonicity, given that the maximum value of 1 can be achieved without returning the correct values. The Normalized Distance Performance Measure (NDPM) (Yao 1995) has the same behavior.

On the other hand, the Mean Average Error (MAE) and also the Mean Square Error (MSE) have been applied to this problem. They satisfy monotonicity (\(\hbox {VOM}_{\mathtt {O}}\)) but at the cost of invariance, given that they take into account the interval distance between system and gold assigned values.

Let us focus on a specific example and consider two documents \(d_1\) and \(d_2\), a gold \(\alpha\), and four system outputs \(\sigma _1, \sigma _2, \sigma _3, \sigma _4\) with the values shown in Table 10. Notice that \(\sigma _1\) resembles exactly the gold, \(\sigma _2\) does not hit the target with the values, but it keeps the correct ordering as well as the correct distance between the categories (they are adjacent in this case), and \(\sigma _3\) keeps the correct ordering but the second value moves further away from the correct result. Finally, \(\sigma _4\) does not reflect the correct order of values and neither the correct value matching. Then, a metric should satisfy:

$$\begin{aligned} {\mathcal {M}}(\sigma _1,\alpha )>{\mathcal {M}}(\sigma _2,\alpha )>{\mathcal {M}}(\sigma _3,\alpha )>{\mathcal {M}}(\sigma _4,\alpha ). \end{aligned}$$

Accuracy is not able to discriminate among \(\sigma _2\), \(\sigma _3\) and \(\sigma _4\), given that all these system outputs fail in both target values. Therefore, value-oriented metrics for the nominal scale type are not adequate, since being closer to the real target should increase the score. Non parametric correlation coefficients (Kendall, Spearman, etc) do not discriminate among \(\sigma _1\), \(\sigma _2\) and \(\sigma _3\), given that all of them sort the documents in the correct way. The reason is that it is not a ranking problem either, so equivalence-oriented metric for the ordinal scale type are not appropriate. The linear correlation coefficient Pearson, an equivalence-oriented metric at interval scale, shows the same non discriminating effect as Spearman. In addition it also assumes that there exists the same interval between each category. However, we know that there are more categories between “Positive” and “Negative” than between “Positive” and “Neutral”, but we can not assert that the distance is the double, which is assumed by the Pearson coefficient. MAE and MSE would sort systems in a correct manner, given that they satisify monotonicity (\(\hbox {VOM}_{\mathtt {O}}\)). The limitation is that MAE and MSE are not invariant, and they require to assume an invariant value corresponding to each category, and thus predefined intervals between categories.

Table 10 The example described in the text

We claim that in this situation one must use value-oriented metrics for the ordinal scale type. The goal consists of reducing the distance between \(\sigma\) and \(\alpha\) values, but at the same time, the distance can be defined only in ordinal terms. The more a prediction is far away from the target in ordinal terms, the more the system is penalized. The question is how to satisfy monotonicity and invariance simultaneously. We leave this open issue as future work.

9 Conclusions and future work

9.1 Summary

In this paper, we have defined a theoretical framework that explains the nature of evaluation metrics by grounding on measurement theory. Besides exploiting the traditional measurement theory, we have also introduced the concepts of value-oriented and equivalence-oriented closeness, for all scale types (nominal, ordinal, interval, ratio).

The theoretical results derived from the framework are:

  • There is a clear correspondence between abstract tasks, metric kinds, and scale types. That is, classification, clustering, ranking, value prediction, and linear correlation oriented tasks can be interpreted as assignments for nominal, ordinal, interval, and ratio scale types, in which the closeness to the gold is evaluated at value or equivalence level.

  • The definitions of value- and equivalence-oriented evaluation metrics match with the basic axioms stated in the literature for particular abstract tasks (strict monotonicity for classification, homogeneity and heterogeneity for clustering, and swapping for ranking): we only need to instantiate over the different scale types to infer these axioms.

  • The proposed framework gives a single and unified explanation for most theoretical criticisms of classification, ranking, and clustering metrics found in the literature.

  • The proposed framework explains the need for an interval and ratio units (i.e., meters, grades, etc.) when assignments are compared at the interval and ratio scale types.

  • The proposed framework explains the popularity of Pearson coefficient and cosine distance when estimating the closeness of assignments at interval and ratio scale types.

Tables 8 and 9 summarize the aggregated analysis for all abstract tasks and metrics. Metrics used in different tasks match with the corresponding kind of metric according to our definition. The discarded metrics match with metric limitations actually identified in the literature. In addition, by filling the gap of value-oriented metrics for the ordinal scale type we understand how to evaluate tasks such as semantic textual similarity, recommendation, or polarity detection.

9.2 Practical consequences

We have already mentioned that our framework is not only theoretical. Let us summarize the practical contributions of this paper, mainly addressed to the communities of tasks and metric designers. First, the framework gives a tool for selecting and checking the suitability of metrics. One only needs to know: (i) in what scale the system output is defined, and (ii) if the goal consists of predicting values (i.e., classification, mean error, etc.) or relationships (i.e., clustering, ranking, or linear correlation). Second, the framework helps users to distinguish between necessary properties and properties that depend on the particular characterization of the task. The constraints found in the literature that match with our basic definitions of evaluation metric are strictly necessary for the corresponding abstract task. The other constraints depend on the particular task in which the evaluation is taking place. For instance, the priority constraint is a common desirable property to be satisfied by any ranking metric for the asbtract task of ranking, while top-heaviness is task-dependent, although necessary for IR. Third, the framework provides a tool for defining metrics in situations in which, nowadays, the lack of suitable metrics enforces the use of several metrics. The most clear example is the simultaneous use of Pearson and Spearman as well as error rate metrics when evaluating value prediction in an ordinal scale (i.e., sentiment polarity prediction).

9.3 Limits of this study and future developments

Our framework opens the door for evaluation metrics in empty theoretical spaces, such as value-oriented metrics for the ordinal scale type (that have not been proposed yet). However, it does not cover every scenario nor every metric property. The reason is that there exist particular axioms that depend on a particular task. One example is top-weightedness in the case of ranking. Another one is the ability of clustering metrics to avoid the combinatory effect of element pairs in big clusters (counting pairs metrics fail on this). In classification, metrics can be grouped into classes depending of how random or non informative outputs are evaluated: this also depends on the particular task. However, we find this as a common situation in science. The first example that comes to mind is the exclusion of Euclid’s fifth postulate to define different kinds of geometry.

Moreover, the framework is based on the assumption that system outputs and golds are assignments (of numerical values). This idea does not match with translation or summarization systems that generate text. Measurement theory has also a gap regarding the generation of structures. For instance, the inference of dependency trees from a text, or the construction of a hierarchical clustering do not fit into the idea of assignment. However, these tasks are often abstracted into classification or clustering abstract tasks.

In the literature one can find also composite metrics. For instance, diversity metrics in fact consider two golds simultaneously: the ordinal relevance of documents and their redundancy which is modeled as as assignment at the nominal scale type (information nuggets). These kinds of scenarios are not covered at the moment by the framework.

Finally, we have been focusing on the four classical scale types, but in measurement theory other scale types are proposed. Thus adding more tasks to our framework will be straightforward as long as they can be associated with a scale type. In this respect, we plan to further discuss the role of document filtering and its relations with other tasks.Footnote 8