Abstract
The ongoing discussion in the bibliometric community about the best similarity measures has led to diverse insights. Although these insights are sometimes contradicting, there is one very consistent conclusion: Hybrid measures outperform the application of their singular components. While this initially answers the question as to what is the best similarity measure, it also raises issues which have been resolved in part for conventional similarity measures. Given this, in this study we investigate the impact of the right weighting factors, the appropriate level of edge cutting, the performance of first- in contrast to second-order similarities, and the interaction of these three parameters in the context of hybrid similarities. Building upon a dataset of over 8000 articles from the manufacturing engineering field and using different parameter settings we calculated over 100 similarity matrices. For each matrix we determined several cluster solutions of different resolution levels, ranging from 100 to 1000 clusters, and evaluated them quantitatively with the help of a textual coherence value based on the Jensen Shannon Divergence. We found that second-order hybrid similarity measures calculated with a weighting factor of 0.6 for the citation-based similarity and a reduction to only the strongest values yield the best clustering results. Furthermore, we found the assessed parameters to be highly interdependent, where for example hybrid first-order outperforms second-order when no edge cutting is applied. Given this, our results can serve the bibliometric community as a guideline for the appropriate application of hybrid measures.
Similar content being viewed by others
References
Ahlgren, P., & Colliander, C. (2009). Document–document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63. doi:10.1016/j.joi.2008.11.003.
Arenas, A., Fernández, A., & Gómez, S. (2008). Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10(5), 53039.
Benoit, K., & Nulty P. (2016). quanteda: Quantitative analysis of textual data. https://CRAN.R-project.org/package=quanteda. Accessed January 31, 2016.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, 10008ff.
Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404. doi:10.1002/asi.21419.
Boyack, K. W., & Klavans, R. (2014). Creation of a highly detailed, dynamic, global model and map of science. Journal of the Association for Information Science and Technology, 65(4), 670–685. doi:10.1002/asi.22990.
Colliander, C., & Ahlgren, P. (2012). Experimental comparison of first and second-order similarities in a scientometric context. Scientometrics, 90(2), 675–685. doi:10.1007/s11192-011-0491-x.
Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research (p. 1695). Complex Systems: InterJournal.
Eisenhardt, K. M. (1989). Building theories from case study research. Academy of Management Review, 14(4), 532–550.
Feinerer, I., & Hornik, K. (2015). tm: Text mining package. https://CRAN.R-project.org/package=tm. Accessed January 31, 2016.
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976. doi:10.1126/science.1136800.
Glänzel, W. (2012). Bibliometric methods for detecting and analysing emerging research topics. Profesional De La Informacion, 21(2), 194–201. doi:10.3145/epi.2012.mar.11.
Glänzel, W., & Thijs, B. (2011). Using ‘core documents’ for the representation of clusters and topics. Scientometrics, 88(1), 297–309. doi:10.1007/s11192-011-0347-4.
Hornik, K., Buchta, C., & Zeileis, A. (2009). Open-source machine learning: R meets Weka. Computational Statistics, 24(2), 225–232. doi:10.1007/s00180-008-0119-7.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. doi:10.1007/BF01908075.
Janssens, F., Glänzel, W., & Moor, B. (2008). A hybrid mapping of information science. Scientometrics, 75(3), 607–631. doi:10.1007/s11192-007-2002-7.
Janssens, F., Zhang, L., de Moor, B., & Glänzel, W. (2009). Hybrid clustering for validation and improvement of subject-classification schemes. Information Processing and Management, 45(6), 683–702. doi:10.1016/j.ipm.2009.06.003.
Klavans, R., & Boyack, K. W. (2017). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology, 68, 984–998. doi:10.1002/asi.23734.
Li, Y., Zhang, G., Feng, Y., & Wu, C. (2015). An entropy-based social network community detecting method and its application to scientometrics. Scientometrics, 102(1), 1003–1017. doi:10.1007/s11192-014-1377-5.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. doi:10.1109/18.61115.
Liu, X., Glänzel, W., & de Moor, B. (2012). Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping. Scientometrics, 91(2), 473–493. doi:10.1007/s11192-011-0600-x.
Martin, S., Brown, W. Michael, Klavans, R., & Boyack, K. W. (2011). OpenOrd: An open-source toolbox for large graph layout. Proceedings of SPIE - The International Society for Optical Engineering, 7868, 786–806. doi:10.1117/12.871402.
Meng, X., Liu, X., Tong, Y., Glänzel, W., & Tan, S. (2015). Multi-view clustering with exemplars for scientific mapping. Scientometrics, 105(3), 1527–1552. doi:10.1007/s11192-015-1682-7.
Newman, M. (2004). Fast algorithm for detecting community structure in networks. Physical Review E, 69(6), 066133. doi:10.1103/PhysRevE.69.066133.
R Core Team (2016). R: A language and environment for statistical computing. Vienna: R Foundation for statistical computing. URL https://www.R-project.org/. Accessed January 31, 2016.
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval (McGraw-Hill computer science series). New York: McGraw-Hill.
Schiebel, E. (2012). Visualization of research fronts and knowledge bases by three-dimensional areal densities of bibliographically coupled publications and co-citations. Scientometrics, 91(2), 557–566. doi:10.1007/s11192-012-0626-8.
Sharma, V., Prakash, U., & Kumar, B. V. M. (2015). Surface composites by friction stir processing: A review. Journal of Materials Processing Technology, 224, 117–134. doi:10.1016/j.jmatprotec.2015.04.019.
Sims, G. E., Jun, S.-R., Wu, G. A., & Kim, S.-H. (2008). Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America, 106(8), 2677–2682. doi:10.1073/pnas.0813249106.
Strehl, A., & Ghosh, J. (2003). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. doi:10.1162/153244303321897735.
Thijs, B., Schiebel, E., & Glänzel, W. (2013). Do second-order similarities provide added-value in a hybrid approach? Scientometrics, 96(3), 667–677. doi:10.1007/s11192-012-0896-1.
Zhang, L., Glänzel, W., & Ye, F. Y. (2015). The Dynamic evolution of core documents: An experimental study based on h-related literature (2005–2013). Scientometrics, 106(1), 369–381. doi:10.1007/s11192-015-1705-4.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Meyer-Brötz, F., Schiebel, E. & Brecht, L. Experimental evaluation of parameter settings in calculation of hybrid similarities: effects of first- and second-order similarity, edge cutting, and weighting factors. Scientometrics 111, 1307–1325 (2017). https://doi.org/10.1007/s11192-017-2366-2
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-017-2366-2