Skip to main content
Log in

Statistical tests for ‘related records’ search results

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Related records searching, now a common option within bibliographic databases, is applied to an individual result record as a secondary way of refining the retrieval set obtained from the primary subject search operation. In one approach, an individual result record is linked to other article records on the basis of the number of references cited they share in common, the theory being that two articles that cite many of the same sources are likely to be highly similar in subject content. Results of the secondary search are usually displayed in the order of each item’s actual number of commonly-shared references. In the present paper we suggest an improved way of ranking the results, employing statistical significance tests. We suggest two approaches, one involving a statistical test previously unknown in bibliometric circles, the binomial index of dispersion, and the other employing the more familiar centralized cosine measure; these turn out to produce nearly identical results. An example demonstrating the application of these measures, and contrasting such with the use of raw totals, is provided. In the example the results rankings are found to be only modestly (positively) correlated, suggesting that much information is lost to the user when raw totals alone are made the basis for ordering results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. The literature on bibliometrics in general is enormous; for quick accountings of the subject see Donohue (1974), Wolfram (2003), De Bellis (2009), and Haustein (2012). The same is true even for citation analysis alone; consult Borgman (1990), Moed (2005), and De Bellis (2009) for reviews. Importantly, however, much of the attention given to citation analysis has been in: (1) the scientometric context of the sociology of science: e.g., to identifying ways of establishing schools of scientific endeavor, roles of key figures, subject trends, etc., and (2) the evaluation of various kinds of database inconsistencies or omissions (especially as related to impact factors). Much less attention has been given to the investigation of user-focused needs, though this is beginning to change (see, for example, Zhao and Strotmann 2014).

  2. In Web of Science, the command icon “View Related Records” is positioned on the right hand side of each of the records obtained through the primary search; the pop-up descriptor linked to the command reads “View other records that share references with this one”. There is very little literature on the citation analysis form of related records searching beyond notices in product reviews (e.g., Wiley 1998), probably because the retrieval algorithms involved feature simple match counting.

  3. Regarding biological measures of association, two well-known reviews are Cheetham and Hazel (1969) and Hayek (1994). As far as we can tell such measures have not been used in the past to contribute to a database user-oriented citation analysis mission.

  4. Strictly speaking the order of two articles in each pair (i, j) is not important (i.e., the index is symmetrical), and we do not need to compare an article i with itself. Hence there are only n(n−1)/2 relevant indices.

  5. A primary objective of the present work is to introduce database providers to the possibility of a new kind of tool, but ultimately it will be up to them to adapt the ideas to their own circumstances. The statistical approach itself may of course be applied to any setting—including humanities subjects—that meet the basic conditions of order within the data.

  6. Also, given Eq. (6) and because CSC can take negative values, BID is not a monotonic function of CSC. Hence, the order (or ranking) between CSC and BID is not preserved over the full set of CSC values. In our data set where negative values for CSC always remain very close to zero, then both BID and CSC will always generate the same ranking for relatively similar composers—that is, when CSC takes a positive value not too close to zero. Note that the ranking based on BID is equal to the ranking of the set of absolute values of CSC. The overall nature of this synonomy has led us to consider some related statistical and philosophical issues bearing on the measure and meaning of similarity and relatedness in data sets of the present type. We hope to explore this further in a future publication.

References

  • Bassanezi, R. B., Filho, A. B., Amorim, L., Gimenes-Fernandes, N., Gottwald, T. R., & Bové, J. M. (2003). Spatial and temporal analyses of citrus sudden death as a tool to generate hypotheses concerning its etiology. Phytopathology, 93(4), 502–512.

    Article  Google Scholar 

  • Borgman, C. L. (1990). Scholarly communication and bibliometrics. Newbury Park, CA: Sage Publications.

    Google Scholar 

  • Bradford, S. C. (1934). Sources of information on specific subjects. Engineering: An Illustrated Weekly Journal (London), 137, 85–86.

    Google Scholar 

  • Brandyberry, A., Rai, A., & White, G. P. (1999). Intermediate performance impacts of advanced manufacturing technology systems: An empirical investigation. Decision Sciences, 30(4), 993–1020.

    Article  Google Scholar 

  • Cheetham, A. H., & Hazel, J. E. (1969). Binary (presence–absence) similarity coefficients. Journal of Paleontology, 43(5), 1130–1136.

    Google Scholar 

  • De Bellis, N. (2009). Bibliometrics and citation analysis: From the Science Citation Index to cybermetrics. Lanham, MD: Scarecrow Press.

    Google Scholar 

  • Donohue, J. C. (1974). Understanding scientific literatures: A bibliometric approach. Cambridge, MA: MIT Press.

    Google Scholar 

  • Duncan, O. D., & Duncan, B. (1955). A methodological analysis of segregation indexes. American Sociological Review, 20(2), 210–217.

    Article  Google Scholar 

  • Giller, G. L. (2012). The statistical properties of random bitstreams and the sampling distribution of cosine similarity. Giller Investments Research Notes (20121024/1). http://dx.doi.org/10.2139/ssrn.2167044. Accessed 17 January 2015

  • Glänzel, W., & Czerwon, H. J. (1996). A new methodological approach to bibliographic coupling and its application to the national, regional, and institutional level. Scientometrics, 37(2), 195–221.

    Article  Google Scholar 

  • Haustein, S. (2012). Multidimensional journal evaluation: Analyzing scientific periodicals beyond the impact factor. Berlin: De Gruyter/Saur.

    Book  Google Scholar 

  • Hayek, L.-A. C. (1994). Analysis of amphibian biodiversity data. In W. R. Heyer, M. A. Donnelly, R. W. McDiarmid, L.-A. C. Hayek, & M. S. Foster (Eds.), Measuring and monitoring biological diversity: Standard methods for amphibians (pp. 207–269). Washington, DC: Smithsonian Books.

    Google Scholar 

  • Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37(142), 547–579.

    Google Scholar 

  • Lotka, A. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences, 16(12), 317–324.

    Google Scholar 

  • Moed, H. F. (2005). Citation analysis in research evaluation. Dordrecht: Springer.

    Google Scholar 

  • Potthoff, R. F., & Whittinghill, M. (1966). Testing for homogeneity. I. The binomial and multinomial distributions. Biometrika, 53, 167–182.

  • Rogosa, D., Floden, R., & Willett, J. B. (1984). Assessing the stability of teacher behavior. Journal of Educational Psychology, 76(6), 1000–1027.

    Article  Google Scholar 

  • Sen, S. K., & Gan, S. K. (1983). A mathematical extension of the idea of bibliographic coupling and its applications. Annals of Library and Information Studies, 30(2), 78–82.

    Google Scholar 

  • Smith, C. H. (2000). The classical music navigator. http://people.wku.edu/charles.smith/music/. Accessed 17 November 2014

  • Smith, C. H., & Georges, P. (2014). Composer similarities through ‘The Classical Music Navigator’: Similarity inference from composer influences. Empirical Studies of the Arts, 32(2), 205–229.

    Article  Google Scholar 

  • Spósito, M. B., Amorim, L., Ribeiro, P. J. Jr., Bassanezi, R. B., & Krainski, E. T. (2007). Spatial pattern of trees affected by black spot in citrus groves in Brazil. Plant Disease, 91(1), 36–40.

    Article  Google Scholar 

  • Wallet, L. A., & Gotway, C. A. (2004). Applied spatial statistics for public health data. Hoboken, NJ: Wiley.

    Book  Google Scholar 

  • Wiley, D. L. (1998). Cited references on the Web: A review of ISI’s ‘Web of Science’. Searcher, 6(1), 32–39, 57.

  • Wolfram, D. (2003). Applied informetrics for information retrieval research. Westport, CT: Libraries Unlimited.

    Google Scholar 

  • Zhao, D., & Strotmann, A. (2014). In-text author citation analysis: Feasibility, benefits, and limitations. Journal of the Association for Information Science and Technology, 65(11), 2348–2358.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles H. Smith.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Smith, C.H., Georges, P. & Nguyen, N. Statistical tests for ‘related records’ search results. Scientometrics 105, 1665–1677 (2015). https://doi.org/10.1007/s11192-015-1610-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-015-1610-x

Keywords

Navigation