Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as \(\ell _p\)-norm with \(p>0\)), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and \(m_p\)-dissimilarity (\(p>0\)), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise \(m_p\)-dissimilarity where \(p\ge 0\) by introducing \(m_0\)-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of \(m_p\)-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.

Similarity is the inverse of dissimilarity. We use dissimilarity in this paper to be consistent with distance measures.
Author(s) defined it as a similarity measure, but we define it as a dissimilarity measure to be consistent with other measures.
Author(s) defined it as a similarity measure, but we define it as a dissimilarity measure to be consistent with other measures.
Because \(A=\frac{1}{M}\), both numerator and denominator approach 0 when \(p\rightarrow 0\).
(Fernando and Webb 2017) have shown that it is a better alternative than any other p settings. Hereafter, to simplify notation, we refer to \(d_{rank}(\mathbf{x}, \mathbf{y}, 1)\) as \(d_{rank}(\mathbf{x}, \mathbf{y})\).
We also examined whether the geometric mean of rank differences produced better results than the arithmetic mean (\(d_{rank}\)); but we observed that it produced worse results than \(d_{rank}\) in all six datasets.
The BoW text datasets were not used because there is no issue of scales and units of measurement as feature values are frequency counts.
We did not use other datasets used in Sect. 6.3.2 because we do not have the actual text of documents and only got BoW vectors.
\(d_{cos}\) here is equivalent to \(d_{cosIdf}\) in Sect. 6.3.2 because IDF weights were applied as a part of BoC vector representation.
The authors would like to thank A/Prof Peter Vamplew for interesting discussion and useful feedback in the first draft of the manuscript. The authors would also like to thank the anonymous reviewers for their valuable comments and suggestions to improve the manuscript.
Appendix A: Empirical evaluation in documents represented as bag-of-concepts
Here we report the performances of BoW versions of data-dependent measures \(d_{rank}, m_1, d_{lin}\) and \(m_0\) (discussed in Sect. 4) against cosine distance (\(d_{cos}\)) and dissimilarity based on simple dot product (\(d_{dot}\)) in text datasets where documents are represented as bag-of-concepts (BoC) vectors (Kim et al. 2017).
In the BoC representation, concepts are defined using deep neural network (LeCun et al. 2015; Goodfellow et al. 2016) based word embedding technique called word2vec (Mikolov et al. 2013a) and clustering similar words. BoC representation is shown to produce better results than BoW or doc2vec (vector representation of documents based on word embedding) representations in the documents classification task (Kim et al. 2017). We conducted experiments in the 5NN document classification task using three text datasets - NG20, R52 and R8 (the same datasets used by Kim et al. (2017))Footnote 15, where documents were represented by 500-dimensional BoC vectors. We used the python implementations of BoC representation provided by the authors (Kim et al. 2017).Footnote 16
The average 5NN classification error and standard error over a 10-fold cross-validation of \(d_{cos}\)Footnote 17, \(d_{dot}\) and BoW variants of \(d_{rank}, m_1, d_{lin}\) and \(m_0\) is provided in Table 14. The result shows that \(m_0\) produced the best or equivalent to the best result in all three datasets followed by \(d_{rank}\) in two datasets, and \(d_{lin}\) and \(d_{cos}\) in only one dataset each. This result is consistent with that in Sect. 6.3.2 where \(m_0\) produced better results overall.
The dissimilarity measure based on simple dot product (\(d_{dot}\)) produced significantly worse results than other contenders in all datasets. It is interesting to note that the self-similarity of instances using \(d_{dot}\) is also not constant like in \(m_p\). However, the self-similarity of \(d_{dot}\) is not data-dependent as it solely depends on feature values of an instance. Furthermore, it does not look at self-similarity in each dimension separately.
Appendix B: Effect of ensemble size in tree-based data-dependent measures
In order to investigate the effect of ensemble size (t) in tree-based data-dependent measures (\(d_{USF}\) and \(m_{IF}\)), we evaluated their task-specific performance by varying the number of trees. The CBIR performances of \(d_{USF}\) and \(m_{IF}\) with a number of trees up to \(t=1000\) in the Corel and Hba datasets are shown in Fig. 8.
As expected, MAP@25 of both measures increased with the increase of t in both datasets. However, they did not produce competitive retrieval results with the one-dimension data-dependent measure of \(m_0\) even with \(t=1000\) in both datasets where the number of dimensions (M) is much less than 1000: Corel (\(M=67\)) and Hba (\(M=187\)). This result shows that tree-based methods require a large ensemble size (\(t<M\)) to produce a good result, but using a large t makes them expensive to run. For example, average total runtime (building trees, pre-processing and retrieval) of one run in the Corel dataset with \(t=1000\) took 809 seconds in \(d_{USF}\) and 2216 seconds in \(m_{IF}\), whereas \(m_0\) took 281 seconds only.
Appendix C: Effectiveness of equal-frequency and equal-width discretisation approaches to speed up one-dimensional data-dependent measures
We evaluated the performances of one-dimensional data-dependent measures with equal-width discretisation (EWD), and equal-frequency discretisation (EFD) in the CBIR task using the Corel and Hba datasets. We used the same number of intervals \(\eta =\lfloor \log _2 N\rfloor +1\) with both discretisation approaches; therefore, the only difference between them was the discretisation approach. The average MAP@k of \(d_{rank}\), \(d_{lin}\), \(m_1\) and \(m_0\) over 10 runs in the Corel and Hba datasets with EFD and EWD are provided in Fig. 9.
The CBIR results in Fig. 9 show that EFD produced either better or at least competitive results with EWD. It did not produce worse retrieval results than EWD in any case.
