Skip to main content
Log in

Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

A novel hashing algorithm is applied to match two prominent and important bibliographic databases at the paper level. In the literature, such tasks have been studied and conducted many times, but relying only on journal information due to massive volume of indexed publications. As a result of paper based match, missing or erroneous items can be completed from other source or the overlap can be measured more reliably. In this context, we focus on measuring the overlap between Clarivate Analytics Web of Science (WoS) and Elsevier’s Scopus at the paper level. Our focus is on detecting exact matches, that is, no false positives are tolerated at all. To this end, we follow a twofold matching procedure. First, a locality sensitive hashing algorithm is applied, which provides fast approximate nearest neighbours and similarities, in order to obtain WoS-Scopus pair suggestions. Second, for each suggested pair, different heuristics are applied to identify those pair of records that indeed refer to the same publication. We observe that at least 74% of WoS publications are also indexed by Scopus. The percentage increases to 92% when only the cited publications are retained. The overlapped WoS records are also presented based on Institute for Scientific Information subject categories (SC). Of those, three big SCs, whose overlap ratios are relatively low, are chosen and examined in detail. Last but not the least, it takes just about an hour to match 14.2 million versus 19.6 million publications from a publication year range of 2004–2013 in a high performance computer environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

[Note that this figure is based on two figures by Benjamin Van Durme and Ashwin Lall in their presentation for 48th Annual Meeting of the Association for Computational Linguistics to present their paper Van Durme and Lall (2010) and used here with their kind permission]

Fig. 2

Similar content being viewed by others

References

  • Abdulhayoglu, M. A., & Thijs, B. (2017). Use of locality sensitive hashing (LSH) algorithm to match Web of Science and SCOPUS. In Proceedings of the fifth workshop on bibliometric-enhanced information retrieval (BIR) co-located with the 39th European conference on information retrieval (ECIR), Aberdeen, UK (pp. 30–40).

  • Abdulhayoglu, M. A., Thijs, B., & Jeuris, W. (2016). Using character n-grams to match a list of publications to references in bibliographic databases. Scientometrics, 109(3), 1525–1546.

    Article  Google Scholar 

  • Bosman, J., Mourik, I. V., Rasch, M., Sieverts, E., & Verhoeff, H. (2006). Scopus reviewed and compared: The coverage and functionality of the citation database Scopus, including comparisons with Web of Science and Google Scholar. Utrecht University Library.

  • Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2,27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

  • Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380–388). ACM.

  • Egghe, L., & Goovaerts, M. (2007). A note on measuring overlap. Journal of Information Science, 33(2), 189–195.

    Article  Google Scholar 

  • Gavel, Y., & Iselid, L. (2008). Web of Science and Scopus: A journal title overlap study. Online Information Review, 32(1), 8–21.

    Article  Google Scholar 

  • Gluck, M. (1990). A review of journal coverage overlap with an extension to the definition of overlap. Journal of the American Society for Information Science, 41(1), 43–60.

    Article  Google Scholar 

  • Hood, W. W., & Wilson, C. S. (2003). Overlap in bibliographic databases. Journal of the American Society for Information Science and Technology, 54(12), 1091–1103.

    Article  Google Scholar 

  • Indyk, P. (2000). High-dimensional computational geometry. Doctoral Dissertation, Stanford University.

  • Indyk, P., & Motwani, R. (1998). Approximate nearest neighbours: Towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on theory of computing (pp. 604–613). ACM.

  • Kondrak, G. (2005). N-gram similarity and distance. In International symposium on string processing and information retrieval (pp. 115–126). Springer, Berlin.

  • Kurzak, J., Alvaro, W., & Dongarra, J. (2009). Optimizing matrix multiplication for a short-vector SIMD architecture–CELL processor. Parallel Computing, 35(3), 138–150.

    Article  Google Scholar 

  • Meho, L. I., & Rogers, Y. (2008). Citation counting, citation ranking, and h-index of human-computer interaction researchers: A comparison of Scopus and Web of Science. Journal of the American Society for Information Science and Technology, 59(11), 1711–1726.

    Article  Google Scholar 

  • Pao, M. L. (1993). Term and citation retrieval: A field study. Information Processing and Management, 29(1), 95–112.

    Article  Google Scholar 

  • Ravichandran, D., Pantel, P., & Hovy, E. (2005). Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 622–629). Association for Computational Linguistics.

  • Van Durme, B., & Lall, A. (2010). Online generation of locality sensitive hash signatures. In Proceedings of the ACL 2010 conference short papers (pp. 231–235). Association for Computational Linguistics.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehmet Ali Abdulhayoglu.

Appendix

Appendix

figure a
figure b
figure c

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abdulhayoglu, M.A., Thijs, B. Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus. Scientometrics 116, 1229–1245 (2018). https://doi.org/10.1007/s11192-017-2569-6

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-017-2569-6

Keywords

Navigation