Skip to main content
Log in

Similarity measures for document mapping: A comparative study on the level of an individual scientist

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

This paper investigates the utility of the Inclusion Index, the Jaccard Index and the Cosine Index for calculating similarities of documents, as used for mapping science and technology. It is shown that, provided that the same content is searched across various documents, the Inclusion Index generally delivers more exact results, in particular when computing the degree of similarity based on citation data. In addition, various methodologies such as co-word analysis, Subject-Action-Object (SAO) structures, bibliographic coupling, co-citation analysis, and self-citation links are compared. We find that the two former ones tend to describe rather semantic similarities that differ from knowledge flows as expressed by the citation-based methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ahlgren, P., Jarneving, B., Rousseau, R. (2003), Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient, Journal of the American Society for Information Science, 54: 550–560.

    Article  Google Scholar 

  • Bartkowski, A., Hill, J., Lühr, C., Schramm, R. (2004), Rationelle Patentrecherche und Patentanalyse. In: R. Schramm, S. Milde (Eds), PATINFO 2004 Patentrecht und Patentinformation — Mittel zur Innovation. pp. 177–204.

  • Bergmann, I., Butzke, D., Walter, L., Fuerste, J. P., Moehrle, M. G., Erdmann, V. A. (2007), Evaluating the Risk of Patent Infringement by Means of Semantic Patent Analysis: The Case of DNA Chips, Proceedings of the R&D Management Conference, Bremen, July 4–6, 2007.

  • Blanchard, A. (2007), Understanding and customizing stopword lists for enhanced patent mapping, World Patent Information, 29: 308–316.

    Article  Google Scholar 

  • Boerner, K., Chen, C., Boyack, K. W. (2003), Visualizing knowledge domains, Annual Review of Information Science and Technology, 37: 179–255.

    Article  Google Scholar 

  • Borgatti, S. P., Everett, M. G., Freeman, L. (1999), Ucinet 6 for Windows — Software for Social Network Analysis, Harvard, MA: Analytic Technologies.

    Google Scholar 

  • Callon, M., Courtial, J. P., Laville, F. (1991), Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemistry, Scientometrics, 22: 155–205.

    Article  Google Scholar 

  • Clarkson, G. (2004), Objective Identification of Patent Thickets: A Network Analytic Approach, Harvard Business School Doctoral Thesis http://www.si.umich.edu/stiet/researchseminar/Fall%202004/Patent%20Thickets%20v3.9.pdf.

  • Dreßler, A. (2006), Patente in technologieorientierten Mergers und Acquisitions, Dt. Univ.-Verl, Wiesbaden.

    Google Scholar 

  • Golbeck, J., Mutton, P. (2006), Spring-embedded graphs for semantic visualization. In: V. Geroimenko, C. Chen (Eds), Visualizing the Semantic Web — XML-based Internet and Information Visualization. Springer, pp. 172–182.

  • Hamers, L., Hemeryck, Y., Herweyers, G., Janssen, M., Keters, H., Rousseau, R., Vanhoutte, A. (1989), Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula, Information Processing and Management, 25: 315–318.

    Article  Google Scholar 

  • Harter, S. P., Nisonger, T. E., Weng, A. (1993), Semantic relationships between cited and citing articles in library and information science journals, Journal of the American Society for Information Science, 44: 543–552.

    Article  Google Scholar 

  • Invention Machine Corporation (no date), Accelerating the speed of knowledge, White Paper, http://lsdis.cs.uga.edu/SemWebCourse_files/WP/Invention_Machine.pdf (March 09, 2007).

  • Jaccard, P. (1901), Bulletin del la Société Vaudoisedes Sciences Naturelles, 37: 241–272.

    Google Scholar 

  • Jarneving, B. (2005), A comparison of two bibliometric methods for mapping of the research front, Scientometrics, 65: 245–263.

    Article  Google Scholar 

  • Kamada, T., Kawai, S. (1989), An algorithm for drawing general undirected graphs, Information Processing Letters, 31: 7–15.

    Article  MATH  MathSciNet  Google Scholar 

  • Kessler, M. M. (1963), Bibliographic coupling between scientific papers, American Documentation, 14: 10–25.

    Article  Google Scholar 

  • Leydesdorff, L. (1987), Various methods for the mapping of science, Scientometrics, 11: 295–324.

    Article  Google Scholar 

  • Marshakova, I. V. (1973), System of document connections based on references, Scientific and Technical Information Serial of VINITI, 6: 3–8.

    Google Scholar 

  • Moehrle, M. G., Walter, L., Geritz, A., Müller, S. (2005), Patent-based inventor profiles as a basis for human resource decisions in research and development, R & D Management, 35: 513–524.

    Article  Google Scholar 

  • Peters, H., Braam, R., Raan, A. (1995), Cognitive resemblance and citation relations in chemical engineering publications, Journal of the American Society for Information Science, 46: 9–21.

    Article  Google Scholar 

  • Porter, M. (1980), An algorithm for suffix stripping program, Program, 14: 130–137.

    Google Scholar 

  • Qin, J. (2000), Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature, Journal of the American Society for Information Science, 51: 166–180.

    Article  Google Scholar 

  • Ramlogan, R., Mina, A., Tampubolon, G., Metcalfe, J. (2007), Networks of knowledge: The distributed nature of medical innovation, Scientometrics, 70: 459–489.

    Article  Google Scholar 

  • Rijsbergen, C. V. (1979), Information Retrieval, Butterworth, London.

    Google Scholar 

  • Rip, A., Courtial, J. (1984), Co-word maps of biotechnology: An example of cognitive scientometrics, Scientometrics, 6: 381–400.

    Article  Google Scholar 

  • Salton, G., Macgill, M. J. (1983), Introduction to Modern Information Retrieval, McGraw-Hill, New York.

    MATH  Google Scholar 

  • Sharabchiev, J. T. (1989), Cluster analysis of bibliographic references as a scientometric method, Scientometrics, 15: 127–137.

    Article  Google Scholar 

  • Small, H., Griffith, B. C. (1974), The structure of scientific literatures I: Identifying and graphing specialties, Science Studies, 4: 17–40.

    Article  Google Scholar 

  • Small, H. (1973), Co-citation in the scientific literature: A new measure of the relationship between two documents, Journal of the american Society for Information Science, 24: 265–269.

    Article  Google Scholar 

  • Sternitzke, C., Bartkowski, A., Schramm, R. (2007), Regional PATLIB centres as integrated one-stop service providers for intellectual property services, World Patent Information, 29: 241–245.

    Article  Google Scholar 

  • Tsourikov, V. M., Batchilo, L. S., Sovpel, I. V. (2000), Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures, United States Patent No. 6167370.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Sternitzke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sternitzke, C., Bergmann, I. Similarity measures for document mapping: A comparative study on the level of an individual scientist. Scientometrics 78, 113–130 (2009). https://doi.org/10.1007/s11192-007-1961-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-007-1961-z

Keywords

Navigation