Skip to main content

On Computing the Jaro Similarity Between Two Strings

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2023)

Abstract

Jaro similarity is widely used in computing the similarity (or distance) between two strings of characters. For example, record linkage is an application of great interest in many domains for which Jaro similarity is popularly employed. Existing algorithms for computing the Jaro similarity between two given strings take quadratic time in the worst case. In this paper, we present an algorithm for Jaro similarity computation that takes only linear time. We also present experimental results that reveal that our algorithm outperforms existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Basak, J., Soliman, A., Deo, N., Rajasekaran, S.: SuperBlocking: an efficient blocking technique for record linkage, manuscript (2023)

    Google Scholar 

  2. Clark, D.E.: Practical introduction to record linkage for injury research. Injury Prevention BMJ J. 10(3), 186–191 (2004)

    Article  CAS  Google Scholar 

  3. GeeksforGeeks, “Jaro and Jaro-Winkler Similarity”, 20 Jan. 2020. https://www.geeksforgeeks.org/jaro-and-jaro-winkler-similarity/

  4. Horowitz, E., Sahni, S., Rajasekaran, S.: Computer Algorithms. Silicon Press (2008)

    Google Scholar 

  5. Jaro, M.A.: Advances in record linkage methodology as applied to the 1985 census of Tampa Florida. J. Am. Stat. Assoc. 84(406), 414–20 (1989). https://doi.org/10.1080/01621459.1989.10478785

    Article  Google Scholar 

  6. Maizlish, N., Herrera, L.: A record linkage protocol for a diabetes registry at ethnically diverse community health centers. J. Am. Med. Inform. Assoc. 12, 331–337 (2005)

    Article  PubMed  PubMed Central  Google Scholar 

  7. Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The four generations of entity resolution. Synthesis Lectures Data Manage. 16, 1–170 (2021)

    Article  Google Scholar 

  8. Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15, pp. 576–592 (2018)

    Google Scholar 

  9. Soliman, A., Rajasekaran, S.: FIRLA: a Fast Incremental Record Linkage Algorithm. J. Biomed. Inform. 130, 104094 (2022)

    Article  PubMed  Google Scholar 

  10. Soliman, A., Rajasekaran, S.: A Novel String Map-Based Approach for Distance Calculations with Applications to Faster Record Linkage, manuscript (2023)

    Google Scholar 

  11. Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association: 354–359 (1990)

    Google Scholar 

  12. Winkler, W.E.: Overview of Record Linkage and Current Research Directions, Research Report Series, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233 (2006)

    Google Scholar 

Download references

Acknowledgements.

This work was partially supported by the United States Census Bureau under Award Number CB21RMD0160003. The content is solely the responsibility of the authors and does not necessarily represent the official views of the US Census Bureau.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joyanta Basak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Basak, J. et al. (2023). On Computing the Jaro Similarity Between Two Strings. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2023. Lecture Notes in Computer Science(), vol 14248. Springer, Singapore. https://doi.org/10.1007/978-981-99-7074-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7074-2_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7073-5

  • Online ISBN: 978-981-99-7074-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics