Skip to main content

Semimetric Properties of Sørensen-Dice and Tversky Indexes

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9627))

Abstract

In this work we prove a semimetric property for distances used for finding dissimilarities between two finite sets such as the Sørensen-Dice and the Tversky indexes. The Jaccard-Tanimoto index is known to be one of the most common distances for the task. Because the distance is a metric, when used, several algorithms can be applied to retrieve information from the data. Although the Sørensen-Dice index is known to be more robust than the Jaccard-Tanimoto when some information is missing from datasets, the distance is not a metric as it does not satisfy the triangle inequality. Recently, there are several machine learning algorithms proposed which use non-metric distances. Hence, instead of the triangle inequality, it is required that the distance satisfies the approximate triangle inequality with some small value of \(\rho \). This motivates us to find the value of \(\rho \) for the Sørensen-Dice index. In this paper, we prove that this value is 1.5. Besides, we can find the value for some of the Tversky index.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Jaccard, P.: Lois de distribution florale dans la zone alpine. Corbaz (1902)

    Google Scholar 

  2. Tanimoto, T.: An elementary mathematical theory of classification and prediction. Technical report, IBM Report (1958)

    Google Scholar 

  3. Lipkus, A.H.: A proof of the triangle inequality for the Tanimoto distance. J. Math. Chem. 26(1–3), 263–265 (1999)

    Article  MATH  Google Scholar 

  4. Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: FOCS 1999, pp. 2–13 (1999)

    Google Scholar 

  5. Ruiz, E.V.: An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recogn. Lett. 4(3), 145–157 (1986)

    Article  Google Scholar 

  6. Sankoff, D., Rousseau, P.: Locating the vertices of a steiner tree in an arbitrary metric space. Math. Program. 9(1), 240–246 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  7. Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol. Skr. 5, 1–34 (1948)

    Google Scholar 

  8. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  Google Scholar 

  9. McCune, B., Grace, J.B., Urban, D.L.: Analysis of Ecological Communities, vol. 28. MjM software design, Gleneden Beach (2002)

    Google Scholar 

  10. Looman, J., Campbell, J.: Adaptation of Sorensen’s K (1948) for estimating unit affinities in prairie vegetation. Ecology, 409–416 (1960)

    Google Scholar 

  11. Gragera, A.: Approximate matching for Go board positions. In: GPW 2015 (2015)

    Google Scholar 

  12. Schubert, A., Telcs, A.: A note on the Jaccardized Czekanowski similarity index. Scientometrics 98(2), 1397–1399 (2014)

    Article  Google Scholar 

  13. Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., Tagiku, B.: Streaming k-means on well-clusterable data. In: SODA 2011, pp. 26–40 (2011)

    Google Scholar 

  14. Mettu, R.R., Plaxton, C.G.: The online median problem. SIAM J. Comput. 32(3), 816–832 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  15. Jaiswal, R., Kumar, M., Yadav, P.: Improved analysis of D2-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett. 115(2), 100–103 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  16. Tversky, A., Gati, I.: Similarity, separability, and the triangle inequality. Psychol. Rev. 89(2), 123 (1982)

    Article  Google Scholar 

  17. Jimenez, S., Becerra, C., Gelbukh, A., Bátiz, A.J.D., Mendizábal, A.: Softcardinality-core: Improving text overlap with distributional measures for semantic textual similarity. In: SEM 2013, pp. 194–201 (2013)

    Google Scholar 

Download references

Acknowledgement

The authors would like to thank Mr. Naoto Osaka and Prof. Hiroshi Imai for several useful comments during the course of this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vorapong Suppakitpaisarn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Gragera, A., Suppakitpaisarn, V. (2016). Semimetric Properties of Sørensen-Dice and Tversky Indexes. In: Kaykobad, M., Petreschi, R. (eds) WALCOM: Algorithms and Computation. WALCOM 2016. Lecture Notes in Computer Science(), vol 9627. Springer, Cham. https://doi.org/10.1007/978-3-319-30139-6_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30139-6_27

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30138-9

  • Online ISBN: 978-3-319-30139-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics