Controlling and Visualizing the Precision-Recall Tradeoff for External Performance Indices

Hanczar, Blaise; Nadif, Mohamed

doi:10.1007/978-3-030-10925-7_42

Blaise Hanczar¹⁷ &
Mohamed Nadif¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11051))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3874 Accesses

Abstract

In many machine learning problems, the performance of the results is measured by indices that often combine precision and recall. In this paper, we study the behavior of such indices in function of the tradeoff precision-recall. We present a new tool of performance visualization and analysis referred to the tradeoff space, which plots the performance index in function of the precision-recall tradeoff. We analyse the properties of this new space and show its advantages over the precision-recall space. Code related to this paper is available at: https://sites.google.com/site/bhanczarhomepage/prerec.

You have full access to this open access chapter, Download conference paper PDF

Machine Learning—Evaluation (Cross-validation, Metrics, Importance Scores...)

Interpretable meta-score for model performance

Article 22 September 2022

Machine Learning: Towards an Unified Classification Criteria

Keywords

1 Introduction

In machine learning, precision and recall are usual measures to assess the performances of the results. These measures are particularly used in supervised learning [18], information retrieval [16], clustering [13] and recently in biclustering contexts [12]. In supervised learning, the classifier performances are assessed by comparing the predicted classes to the actual classes of a test set. These comparisons can be measured by using the precision and recall of the positive class. The precision-recall is generally used in the problems which present very unbalanced classes where the couple sensitivity-specificity is not relevant. In information retrieval, the performance of a search algorithm is assessed by analysing from the similarity between the set of target documents and the set returned by the algorithm. This similarity is generally based on the precision and recall values. In clustering or biclustering, the algorithms identify the clusters or biclusters in the data matrix which are then compared to the clusters or biclusters of reference. It is very common to combine precision and recall in order to construct a performance index such as the F-measure or Jaccard indices; see for instance [1].

By default, the performance indices give the same weight to the precision and recall measures. However, in many contexts, one of these two measures is more important than the other. For example, in genomics, we use clustering algorithms in order to identify clusters of genes with similar expression profiles [6]. These clusters are compared to a gene clustering constructed from genomics databases in order to evaluate their biological relevance. The objective of these analyses is to identify as much biological information as possible in the clusters of genes. In this context the recall is more important than the precision, it is therefore more convenient to use an index that favors the recall. In some performance indices, a parameter has been introduced in order to control the precision-recall tradeoff, for example, the parameter $\beta $ in F-measure.

In this paper, we analyze the performance indices in function of the precision-recall tradeoff and point out their characteristics. For the analysis and visualization of the performances, we also propose a new tool called the tradeoff space. This new space has many advantages compared to the classic precision-recall space.

The paper is organized as follows. In Sect. 2, we present the performance indices and their variants which are sensitive to the precision-recall tradeoff. In Sect. 3, we give the properties of the precision-recall space and analyze the performance indices in this space. In Sect. 4, we introduce the tradeoff space and show how to represent the performances with the tradeoff curves. Section 5 is devoted to applications in unsupervised and supervised contexts, we point out the advantages of the tradeoff space to model selection and comparison of algorithms. Finally, we present our conclusions and give some recommendations on the choice of the performance index.

2 Performance Indices Based on the Precision and Recall Measures

In this section, the definitions are given in the context of unsupervised learning, however all these methods can also be used in the context of supervised learning. In Sect. 5 all the indices and methods are applied to both contexts.

2.1 Definitions

Let D be a dataset containing N elements. Let $T\subset D$ be a target cluster that we want to find and let X be the cluster returned by an algorithm referred as $\mathbb {A}$ whose objective is to find the target cluster. The goodness of X is estimated by using a performance index I(T, X) measuring the similarity between T and X. Some performance indices rely on two basic measures of precision and recall given by

$$\left\{ \begin{array}{rc l} pre = precision(T,X)&{} = &{}\frac{|T \cap X|}{|X|}, \\ rec = recall(T,X) &{}= &{}\frac{|T \cap X|}{|T|}\\ \end{array} \right. $$

where |.| denotes the cardinality. The main performance indices are a combination of the precision and recall measures. These indices give the same importance to precision and recall, however we can define weighted version that may favor the precision or the recall. We introduce in each index a parameter $\lambda \in [0,1]$ that controls the tradeoff; $\lambda $ gives the importance of recall and $1-\lambda $ the importance of precision. The weighted indices have to respect the following conditions. For $\lambda =0$ (resp. $\lambda =1$) only precision (resp. recall) matters. The index must return 0 when the intersection $|T\cap X|$ is null and 1 when $T=X$. For $\lambda =0.5$, the same importance is given to precision and recall. Formally the conditions are as follows:

$$\left\{ \begin{array}{rc l} I_{weighted}(T,X,\lambda ) &{}\in &{} [0,1]\\ I_{weighted}(T,X,0) &{}=&{} pre \\ I_{weighted}(T,X,1) &{}=&{} rec \\ I_{weighted}(T,X,0.5) &{}=&{} I_{non-weighted}(T,X) \\ I_{weighted}(T,X,\lambda ) = 0 &{}\Rightarrow &{} |T\cap X|=0 \\ I_{weighted}(T,T,\lambda ) &{}=&{} 1. \\ \end{array} \right. $$

In this paper, we study the four most popular indices: Kulczynski, F-measure, Folke and Jaccard. However, our work can easily be extended to other indices.

2.2 Kulczynski Index

The Kulczynski index is the arithmetic mean between precision and recall.

$$ I_{Kul}(T,X) = \frac{1}{2} (pre+rec). $$

The weighted version introduces parameter $\rho \in [0,+\infty [$ that controls the precision-recall tradeoff. The importance of precision increases with the value of $\rho $, the pivotal point is at $\rho =1$. In order to respect the conditions on the weighted indices, we rewrite this index in setting: $\lambda =\frac{\rho }{\rho +1}$.

$$\left\{ \begin{array}{rc l} I_{Kul}(T,X,\rho ) &{}=&{} \frac{1}{\rho +1} (\rho {.}pre+rec) \\ I_{Kul}(T,X,\lambda ) &{}=&{} \lambda rec + (1-\lambda ) pre. \\ \end{array} \right. $$

2.3 F-Measure

The F1-measure, also called the Dice index, is the ratio between the intersection and the sum of the sizes of cluster X and target cluster T. It is the harmonic mean between precision and recall.

$$ I_{Fmes}(T,X) = \frac{2|T\cap X|}{|T|+|X|} = \frac{2}{\frac{1}{rec}+\frac{1}{pre}} = \frac{2pre{.}rec}{pre+rec}. $$

The F-measure is a weighted version of the F1-measure. The parameter $\beta \in [0,+\infty ]$ controls the precision-recall tradeoff. The importance of precision increases with the value of $\beta $, the pivotal point is at $\beta =1$. In order to respect the conditions on the weighted indices, we rewrite this index in setting: $\lambda =\frac{\beta ^2}{1+\beta ^2}$ and we obtain

$$ I_{Fmes}(T,X,\beta ) = \frac{1+\beta ^2}{\frac{\beta ^2}{rec}+\frac{1}{pre}} = (1+\beta ^2)\frac{pre{.}rec}{\beta ^2 pre+rec} $$

$$ I_{Fmes}(T,X,\lambda )= \frac{1}{\frac{\lambda }{rec}+\frac{1-\lambda }{pre}} = \frac{pre{.}rec}{\lambda pre+ (1-\lambda ) rec}. $$

2.4 Folke Index

The Folke index is the geometric mean of precision and recall

$$ I_{Fk}(T,X) = \frac{|T \cap X|}{\sqrt{|T||X|}} = \sqrt{pre{.}rec}. $$

To obtain the weighted version of the Folke index, we introduce the parameter $\lambda $ such that:

$$ I_{Fk}(T,X,\lambda ) = \frac{|T \cap X|}{|X|^{1-\frac{\lambda }{2}} |T|^{\frac{\lambda }{2}}} = rec^{\frac{\lambda }{2}} pre^{1-\frac{\lambda }{2}}. $$

2.5 Jaccard Index

The Jaccard index is the ratio between the intersection and the union of cluster X and target cluster T.

$$ I_{Jac}(T,X)= \frac{|T\cap X|}{|T|+|X|-|T\cap X|} = \frac{pre{.}rec}{pre+rec-pre{.}rec}. $$

It is not easy to define a weighted version of the Jaccard index because of the term pre.rec in the denominator. In order to respect the conditions on the weighted indices, we introduce the two weight functions $w(\lambda )=min\{2\lambda ,1\}$ and $v(\lambda )=1-|1-2\lambda |$ where $\lambda $ controls the precision-recall tradeoff. The weighted Jaccard index is defined by:

$$ I_{Jac}(T,X,\lambda ) = \frac{pre{.}rec}{w(\lambda ){.}pre+w(1-\lambda )rec-v(\lambda ){.}pre{.}rec}. $$

3 The Precision-Recall Space

A common analysis and visualization tool of the performances is the precision-recall space. It is a 2D space which represents the precision on the y-axis and recall on the x-axis. The performance of a cluster is represented by a point in this space (Fig. 1) [3].

The precision-recall space is close to the ROC space that is defined by the false and true positive rates. Some relationships between precision-recall and ROC spaces have been identified [7]. A point on the precision-recall space represents the performance of all clusters with the same size $|X|=|T|\frac{rec}{pre}$ and the same intersection $|T \cap X|=|T|rec$. The $\bullet $ symbol at (1, 1) which maximizes both precision and recall, represents the perfect cluster, i.e. that equal to the target cluster ($T=X$). The $\blacksquare $ symbol at $(1,\frac{|T|}{|D|})$ represents the case where the returned cluster is equal to the whole data matrix $X=D$. The horizontal bold line corresponds to the expected performances of a random cluster, i.e. a cluster whose elements are randomly selected. Since it depends on the size of the target cluster |T|, the expected precision of a random cluster is constant and $\mathbbm {E}[pre]=\frac{|T|}{|D|}$. The expected recall of a random cluster depends on the size of the cluster |X|, it is equal to $\mathbbm {E}[rec]=\frac{|X|}{|D|}$. The $\blacktriangle $ symbol at ($\frac{1}{|T|},1$) represents the clusters with a unique element belonging to the target cluster. The $\blacktriangle $ symbol at (0, 0) represents the clusters whose intersection with the target cluster is null. The gray area represents performances that cannot be reached by a cluster. Since we have $pre \ge \frac{|T|}{|D|}rec$ and $|X|\le |D|$, then all the clusters whose performance are represented by a point on the $pre = \frac{|T|}{|D|}rec$ line are the clusters with the minimal intersection possible $|T \cap X|$ for a given size |X|. In clustering and biclustering, the algorithms may have a parameter controlling the size of the returned clusters |X|. In varying this parameter, an algorithm produces different clusters having different values of precision and recall. The performance of an algorithm is therefore represented by a set of points that can be approximated by a curve. Figure 1 (right) gives an example of this precision-recall curve. Some information can be drawn from this curve. If a point A dominates another point B i.e. $pre(T,A)>pre(T,B)$ and $rec(T,A)>rec(T,B)$, then the performances of cluster A are better than the performances of cluster B, whatever the performance index used to compare the two clusters. In Fig. 1 (right) the black points are the dominant points, they represent the performance of the best clusters. However, there is no domination relation between these points, we can not compare them from the precision-recall space, the use of a performance index is needed.

Table 1. The values of the four performance indices on several examples with different values of precision, recall and $\lambda $.

Full size table

The behavior of the performances indices can be visualized in plotting their isolines in the precision-recall space. An isoline is a set of points in the precision-recall space having the same value of the performance index [9, 12]. Figure 2 shows the isolines of the Kulczynski, F-measure, Folke and Jaccard indices. The bold lines represent the isolines when $\lambda =0.5$ while the dotted lines and full lines represent the isolines for respectively $\lambda =0.2$ and $\lambda =0.8$. For the four indices, we observe a symmetry of the isoline around the axis $pre=rec$, this means that precision and recall have the same importance. Nevertheless, the different indices do not record the difference between precision and recall $(pre-rec)$ in the same way. This difference is not taken into account in the Kulczynski index, whereas the other indices penalize it. The Folke index penalizes less than the F-measure and Jaccard indices. Note that the F-measure and Jaccard indices are equivalent because they are compatible i.e. $I_{Fmes}(T,X_1)\ge I_{Fmes}(T,X_2) \Leftrightarrow I_{Jac}(T,X_1)\ge I_{Jac}(T,X_2).$

A modification of $\lambda $ value changes the shape of the isolines and gives more importance to precision or recall. Note that for $pre=rec$ the Kulczynski, F-measure and Folke indices return the same value, whatever $\lambda $ (the bold, dotted and full lines cross the line $pre=rec$). The Jaccard index is different, it penalizes the fact that $\lambda $ is close to 0.5. Table 1 gives some examples illustrating the consequences of these characteristics. We see that the recorded value and rank of each point depends on the index. For $\lambda =0.5$ all indices consider $C_3$ as the best cluster. For $\lambda =0.2$ the indices do not agree anymore, and the best cluster depends on the performance index.

As we have seen, the choice of the performance index has a high impact on the analysis of the results and especially when the precision-recall tradeoff is far from 0.5. It is a crucial step that must depend on the context. We discuss this point in the next part.

4 The Tradeoff Space

4.1 Definitions

We propose a new tool, called the tradeoff space, in order to visualize the performance of the algorithms in function on the precision-recall tradeoff. The x-axis and y-axis represent respectively $\lambda $ and the performance index. This method is inspired by the cost curves used in supervised classification [8]. The performance of a result is represented on this space by a curve: $I(T,X,\lambda )$, this curve depends only on $\lambda $ because X and T are fixed. There is a bijection between the points on the precision-recall space and the curves on the tradeoff space. Figure 3 gives an example of these curves for a result whose performances are $pre=0.85$ and $rec=0.5$. The bold curve represents the performance index. The extremities of the curves give the precision and recall of the cluster, we have $I(T,X,0)=pre$ and $I(T,X,1)=rec$. The full line shows the performances of the maximal cluster, i.e. the cluster containing all the elements, this corresponds to the $(1,\frac{|X|}{|D|})$ point in the precision-recall space. This curve defines the domain of application of the performance index for a given dataset, it is illustrated by the white area in Fig. 3. A point in the gray area means that the corresponding cluster has worse performances than the maximal cluster and may be considered irrelevant. The application domain of the Kulczynski index is much smaller than the application domain of the other indices because this index does not penalize the difference between precision and recall. The other extreme case is the empty cluster containing no element. However the precision of the empty cluster is not defined because its size is null $|X|=0$. We therefore consider the cluster containing a unique element belonging to the target cluster as the minimal cluster whose performances are $pre=1$ and $rec=1/|T|$. The dotted line represents the performance of the minimal cluster. The clusters below this line are worse than the trivial minimal cluster. Note that this line is relevant only for the Kulczynski index, for the other indices the line falls sharply when the tradeoff value is not close to zero. The perfect cluster is represented by the $I(X,T,\lambda )=1$ line while the clusters with a null intersection with the target cluster are represented by the $I(X,T,\lambda )=0$ line. All curves representing the expected performances of a random cluster pass through point $(0,\frac{|T|}{|D|})$.

4.2 The Optimal Curve of Tradeoff

The performance of an algorithm can be represented by a curve in the precision-recall space as illustrated in Fig. 1 (right). Each point of this curve corresponds to a curve in the tradeoff space. The precision-recall curve is therefore represented by a set of curves in the tradeoff space. Figure 4 shows the representation of the precision-recall curve of Fig. 1 in the tradeoff space for the four performance indices. We call optimal tradeoff curve the upper envelope of the set of curves (in bold in Fig. 4). The optimal tradeoff curve of the algorithm $\mathbb {A}$, noted $I^*(T,\mathbb {A},\lambda )$, is a piece-wise curve obtained by keeping the best tradeoff curve for each value of $\lambda $. It represents the best performances of the algorithm for any value of tradeoff $\lambda $. Note that the curves forming the upper envelope correspond to dominant points of the precision-recall curve. The curve of the dominated points is always below the optimal tradeoff curve. For the Kulczynski index, the curves forming the upper envelope correspond to the points of the convex hull of the precision-recall curve, which is our next point.

5 Application of the Tradeoff Curves

We show in this section that the tradeoff curves are a better visualization tool and easier to interpret than the precision-recall curves. We focus especially on the application of tradeoff curves, and more precisely, on two problems: the model selection and the comparison of algorithms. To this end, we study these problems in the context of biclustering and supervised binary classification.

Biclustering, referred also to as co-clustering, is a major tool of data science in many domains and many algorithms have emerged in recent years [4, 10, 11, 14, 15, 17]. Knowing that a bicluster is a subset of rows which exhibit a similar behavior across a subset of columns, all these algorithms aim to obtain coherent biclusters and it is crucial to have a reliable procedure for their validation. In this paper we rely on an artificial biclustering problem where a bicluster has been introduced in a random data matrix. The random data matrix follows a uniform distribution and the bicluster, following the additive model defined by [15], is the target bicluster. The objective of the biclustering algorithms is to find the target bicluster. The points belonging to both the target bicluster and the bicluster returned by the algorithm are considered as true positives, the points of the target bicluster which are not returned by the algorithm are the false negatives, and the points returned by the algorithm but not in the target bicluster, are the false positives.

For supervised classification problems the measures based on precision and recall are generally preferred when the classes are unbalanced. The precision-recall curves and tradeoff space can also be used for supervised classification problems. In this context, the target cluster T is the positive class and the cluster returned by an algorithm is the set of positive predictions. The classifier has a decision threshold which controls the number of positive predictions, each of them yields a curve in the tradeoff space. For our experiments we use unbalanced real data from UCI data repository and artificial data generated from Gaussian distributions.

5.1 Model Selection

To deal with the biclustering aim, we used the well known CC (Cheng & Church) algorithm to find the bicluster [5]. The similarity between the bicluster returned by the CC algorithm and the target bicluster is computed with the four performance indices. The CC algorithm has a hyper-parameter controlling the size of the returned bicluster. The performance of the algorithm can, therefore, be represented by a precision-recall curve (Fig. 5 1st graphics), each bicluster returned by the CC algorithm is identified by its size. From this curve, it is not easy to define the best bicluster because there are several dominant points. Even if we plot the isolines on the graph, the comparison of the different biclusters is not intuitive. The graphics 2–5 in the Fig. 5 represent the optimal tradeoff curve for the Kulczynski, F-measure, Folke and Jaccard indices. From the tradeoff curve, we can immediately identify the best bicluster for a given tradeoff value. There is a decomposition of the value of $\lambda $ in a set of intervals, represented in Fig. 5 by the vertical dotted lines. For each interval, the best bicluster is identified. In our example, there are seven intervals for the F-measure, only the seven corresponding biclusters are therefore relevant. For the last interval ($\lambda >0.74$) the best bicluster is the maximal bicluster, the optimal tradeoff curve is the curve of the maximal bicluster. We can do the same analysis with the tradeoff curves of the Kulczynski, Folke and Jaccard indices, containing respectively eight, seven and six intervals and relevant biclusters. Note that the identification of the best bicluster depends on the chosen performance index. It is not possible to identify the best biclusters in the precision-recall space because they do correspond neither to the set of the dominant points nor to the convex hull of the precision-recall curve (except for the Kulczynski index). Furthermore, it is also easy to use constraints on the precision and recall in the tradeoff space. The precision and recall can be read at the extremity of the tradeoff curve. If a minimal precision $pre_{min}$ is required, we simply have to select the curves that start above the minimal precision, i.e. $I(X,T,0)>pre_{min}$. In the same way, with a required minimal recall $rec_{min}$, we keep only the curves that finish above the minimal recall i.e. $T(X,T,1)>rec_{min}$.

To deal with the supervised classification problems, we use the linear discriminant analysis (LDA) to find the positive class. For each test example the classifier estimates a probability to belong to the positive class, this probability is then compared to a decision threshold t, in order to assign the positive or negative class to the example. By default this threshold is 0.5 but it can be changed to favor positive or negative classes in the context of unbalanced classes. The decision threshold is a hyper-parameter to optimize in order to maximize the performance index. The first graphic in the Fig. 6 shows the precision-recall curve of the LDA classifier, each point represents a different value of decision threshold. As with the biclustering problem, it is difficult to identify the best thresholds from this curve for a given performance index. The tradeoff curves, represented in the graphics 2–5 of the Fig. 6, are much more useful to find the models maximizing the F-measure or the Jaccard index whatever the precision-recall tradeoff. To each interval corresponds a value of the decision threshold which yields the best model. The models assigning all examples to the positive ($t=0)$ or negative class ($t=1$) are at the extremities.

5.2 Comparison of Algorithms

Here we consider the precision-recall curves and tradeoff curves as visualization tools to compare the algorithms. We keep the same illustrative problems used in the previous section. Concerning the situation of biclustering, a second algorithm, ISA [2], is used to find the target bicluster. The objective is to compare the performances of the two algorithms and identify the best one. The first graphic of the Fig. 7 shows the performance of the CC algorithm (in black) and ISA (in gray) in the precision-recall space. In the precision-recall space, the two curves cross each other several times, no algorithm is strictly better than the other. It is hard to identify the conditions in which CC is better than ISA and vice versa. The graphics 2–5 of the Fig. 7 represents the tradeoff curves of the algorithms for the Kulczynski, F-measure, Folke and Jaccard indices. In the tradeoff space, we immediately visualize which is the best algorithm whatever the tradeoff value. According to F-measure, CC is the best for $\lambda <0.28$, ISA is the best for $0.28<\lambda <0.83$, for $\lambda >0.83$ both algorithms return the maximal bicluster and have the same performances. According to the Folke index, CC is the best for $\lambda <0.30$, ISA for $0.30<\lambda <0.69$ and for $\lambda >0.69$ both algorithms return the maximal bicluster. According to the Jaccard index, CC is the best for $\lambda <0.2$, ISA for $0.2<\lambda <0.87$ and for $\lambda >0.87$ both algorithms return the maximal bicluster. In the Kulczynski figure, we add the line representing the extreme case where the algorithm return the minimum bicluster i.e. a bicluster containing only one true positive. For $\lambda <0.33$, CC is better than ISA but both algorithms are worse than the minimal bicluster. ISA is the best for $0.33<\lambda <0.53$, for $\lambda >0.53$ both algorithms return the maximal bicluster. The interval [0.33, 0.53] is the tradeoff range in which the algorithms are useful, outside this interval trivial solutions are better than algorithm’s solutions. We can, therefore, conclude that ISA is strictly better than CC for the Kulczynski index. Defining the conditions where CC is better than ISA from the precision-recall curve is much more difficult. In the precision-recall space, the two curves cross each other three times which implies that the interval of $\lambda $, where CC is better than ISA, is not continued. Actually, the tradeoff curves show that the best algorithm changes only once. In the precision-recall space, CC has a better precision than ISA, 14 times out of 20. We can therefore conclude that CC is better than ISA for a large range of $\lambda $ values. The tradeoff space shows that the opposite is true.

For the supervised classification problems, we compare LDA with the linear support vector machine (SVM). The Fig. 8 shows the performances of LDA (in black) and SVM (in gray) in the precision-recall space (1st graphic) and the tradeoff space (graphics 2–5). As observed with the biclustering situation, we can easily show which algorithm is the best for any tradeoff values and define the range of tradeoff for which the algorithms are better than a trivial solution. According to the Kulczynski index, LDA is the best for $0.23<\lambda <0.44$ and SVM for $0.44<\lambda <0.81$. According to F-measure, LDA is the best for $\lambda <0.23$, SVM for $0.23<\lambda $. According to the Folke index, LDA is the best for $\lambda <0.50$ and SVM for $0.50<\lambda <0.89$. According to the Jaccard index, LDA is the best for $\lambda <0.29$ and $0.79<\lambda <0.96$, and SVM for $0.29<\lambda <0.79$.

6 Conclusion

In this paper, we have presented new methods to deal with the precision-recall tradeoff for the different performance indices. The analysis of these indices in the precision-recall space shows several properties depending on the difference between the precision and recall measures, and the $\lambda $ tradeoff. These characteristics should guide the choice of the performance index in order to select the most suitable one to the dataset and context.

The tradeoff space is a new tool to visualize the performance in function of the tradeoff. In this space, the model selection and comparison of algorithms are much easier and more intuitive than with the precision-recall space. We have also proposed new performance indices weighted by a probability density function representing the knowledge about the tradeoff precision-recall. This work focuses on four indices (Kulczynsky, F-measure, Folke and Jaccard) but it can easily be extended to any other index that relies on the precision and recall measures.

To conclude this paper, we give some recommendations to the user who wants to use a performance index adapted to a given problem. First, choose the type of index (Jaccard, Folke, F-measure,...) in function of its characteristics; the visualization of its isolines can be helpful. Secondly, define the precision-recall tradeoff of the problem. If the exact value of the tradeoff is known, use the index given in Sect. 2 in setting $\lambda $ to this value. If there is no knowledge about the precision-recall tradeoff, draw the optimal tradeoff curve and compute its AUC for computing a numerical value. If some vague information about the precision-recall tradeoff is known, draw the optimal tradeoff curve on the range of the possible values of $\lambda $. These recommendations should produce evaluation procedures which will be more suitable to the problem and would, therefore, improve the robustness and accuracy of the experimental studies.

References

Albatineh, A.N., Niewiadomska-Bugaj, M.: Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. Adv. Data Anal. Classif. 5(3), 179–200 (2011)
Article MathSciNet Google Scholar
Bergmann, S., Ihmels, J., Barkai, N.: Iterative signature algorithm for the analysis of large-scale gene expression data. Phys. Rev. E Stat. Nonlin. Soft Matter. Phys. 67, 031902 (2003)
Article Google Scholar
Buckland, M., Gey, F.: The relationship between recall and precision. J. Am. Soc. Inform. Sci. 45, 12–19 (1994)
Article Google Scholar
Busygin, S., Prokopyev, O., Pardalos, P.: Biclustering in data mining. Comput. Oper. Res. 35(9), 2964–2987 (2008)
Article MathSciNet Google Scholar
Cheng, Y., Church, G.M.: Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 93–103 (2000)
Google Scholar
Datta, S., Datta, S.: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinform. 7, 397 (2006)
Article Google Scholar
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 233–240 (2006)
Google Scholar
Drummond, C., Holte, R.C.: Cost curves: an improved method for visualizing classifier performance. Mach. Learn. 65, 95–130 (2006)
Article Google Scholar
Flach, P.A.: The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: ICML, pp. 194–201 (2003)
Google Scholar
Govaert, G., Nadif, M.: Co-clustering: Models, Algorithms and Applications. Wiley, Hoboken (2013)
Book Google Scholar
Hanczar, B., Nadif, M.: Ensemble methods for biclustering tasks. Pattern Recogn. 45(11), 3938–3949 (2012)
Article Google Scholar
Hanczar, B., Nadif, M.: Precision recall space to correct external indices for biclustering. In: International Conference on Machine Learning ICML, vol. 2, pp. 136–144 (2013)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Lazzeroni, L., Owen, A.: Plaid models for gene expression data. Technical report, Stanford University (2000)
Google Scholar
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 1(1), 24–45 (2004)
Article Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Salah, A., Nadif, M.: Directional co-clustering. Adv. Data Anal. Classif, 1–30 (2018)
Google Scholar
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of the 19th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence, AI 2006, pp. 1015–1021 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

IBISC, University of Paris-Saclay, Univ. Evry, Evry, France
Blaise Hanczar
LIPADE, University of Paris Descartes, Paris, France
Mohamed Nadif

Authors

Blaise Hanczar
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Blaise Hanczar .

Editor information

Editors and Affiliations

IBM Research - Ireland, Dublin, Ireland
Michele Berlingerio
Institute for Scientific Interchange, Turin, Italy
Francesco Bonchi
University of Nottingham, Nottingham, UK
Thomas Gärtner
University College Dublin, Dublin, Ireland
Neil Hurley
University College Dublin, Dublin, Ireland
Georgiana Ifrim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hanczar, B., Nadif, M. (2019). Controlling and Visualizing the Precision-Recall Tradeoff for External Performance Indices. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11051. Springer, Cham. https://doi.org/10.1007/978-3-030-10925-7_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-10925-7_42
Published: 18 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10924-0
Online ISBN: 978-3-030-10925-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)