Abstract
In this work we prove a semimetric property for distances used for finding dissimilarities between two finite sets such as the Sørensen-Dice and the Tversky indexes. The Jaccard-Tanimoto index is known to be one of the most common distances for the task. Because the distance is a metric, when used, several algorithms can be applied to retrieve information from the data. Although the Sørensen-Dice index is known to be more robust than the Jaccard-Tanimoto when some information is missing from datasets, the distance is not a metric as it does not satisfy the triangle inequality. Recently, there are several machine learning algorithms proposed which use non-metric distances. Hence, instead of the triangle inequality, it is required that the distance satisfies the approximate triangle inequality with some small value of \(\rho \). This motivates us to find the value of \(\rho \) for the Sørensen-Dice index. In this paper, we prove that this value is 1.5. Besides, we can find the value for some of the Tversky index.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Jaccard, P.: Lois de distribution florale dans la zone alpine. Corbaz (1902)
Tanimoto, T.: An elementary mathematical theory of classification and prediction. Technical report, IBM Report (1958)
Lipkus, A.H.: A proof of the triangle inequality for the Tanimoto distance. J. Math. Chem. 26(1–3), 263–265 (1999)
Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: FOCS 1999, pp. 2–13 (1999)
Ruiz, E.V.: An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recogn. Lett. 4(3), 145–157 (1986)
Sankoff, D., Rousseau, P.: Locating the vertices of a steiner tree in an arbitrary metric space. Math. Program. 9(1), 240–246 (1975)
Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol. Skr. 5, 1–34 (1948)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
McCune, B., Grace, J.B., Urban, D.L.: Analysis of Ecological Communities, vol. 28. MjM software design, Gleneden Beach (2002)
Looman, J., Campbell, J.: Adaptation of Sorensen’s K (1948) for estimating unit affinities in prairie vegetation. Ecology, 409–416 (1960)
Gragera, A.: Approximate matching for Go board positions. In: GPW 2015 (2015)
Schubert, A., Telcs, A.: A note on the Jaccardized Czekanowski similarity index. Scientometrics 98(2), 1397–1399 (2014)
Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., Tagiku, B.: Streaming k-means on well-clusterable data. In: SODA 2011, pp. 26–40 (2011)
Mettu, R.R., Plaxton, C.G.: The online median problem. SIAM J. Comput. 32(3), 816–832 (2003)
Jaiswal, R., Kumar, M., Yadav, P.: Improved analysis of D2-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett. 115(2), 100–103 (2015)
Tversky, A., Gati, I.: Similarity, separability, and the triangle inequality. Psychol. Rev. 89(2), 123 (1982)
Jimenez, S., Becerra, C., Gelbukh, A., Bátiz, A.J.D., Mendizábal, A.: Softcardinality-core: Improving text overlap with distributional measures for semantic textual similarity. In: SEM 2013, pp. 194–201 (2013)
Acknowledgement
The authors would like to thank Mr. Naoto Osaka and Prof. Hiroshi Imai for several useful comments during the course of this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Gragera, A., Suppakitpaisarn, V. (2016). Semimetric Properties of Sørensen-Dice and Tversky Indexes. In: Kaykobad, M., Petreschi, R. (eds) WALCOM: Algorithms and Computation. WALCOM 2016. Lecture Notes in Computer Science(), vol 9627. Springer, Cham. https://doi.org/10.1007/978-3-319-30139-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-30139-6_27
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30138-9
Online ISBN: 978-3-319-30139-6
eBook Packages: Computer ScienceComputer Science (R0)