Skip to main content
Log in

A decade of research in statistics: a topic model approach

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Topic models are a well known clustering approach for textual data, which provides promising applications in the bibliometric context for the purpose of discovering scientific topics and trends in a corpus of scientific publications. However, topic models per se provide poorly descriptive metadata featuring the discovered clusters of publications and they are not related to the other important metadata usually available with publications, such as authors affiliation, publication venue, and publication year. In this paper, we propose a methodological approach to topic modeling and post-processing of topic models results to the end of describing in depth a field of research over time. In particular, we work on a selection of publications from the international statistical literature, we propose an approach that allows us to identify sophisticated topic descriptors, and we analyze the links between topics and their temporal evolution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. In our approach, the relevance \(\rho (p,T)\) of a topic T for a paper p is the log-likelihood returned for each paper by the VEM algorithm implementation as provided in the R topicmodels package (Grün and Hornik 2011).

  2. We note that the total number of papers is lower than the total number of papers in the corpus, because we excluded from the topic analysis those papers with incomplete metadata as well as those containing editorial material but not a proper scientific contribution.

  3. T960 with 124 papers was not considered here because it groups Discussion papers. Similarly T906 with 20 papers was not considered here because it groups all the Erratum papers.

  4. In the subsequent formula and in all the other formulae in the rest of the paper, the \(\log \) symbol refers to the base-10 logarithm.

  5. We recall that topic with an index of h includes h papers each of which has been cited in other papers at least h times.

  6. Corrado Gini’s concentration index; the value 0 indicates equality or uniform distribution, the value 1 indicates maximum concentration.

References

  • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

    Article  MathSciNet  Google Scholar 

  • Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.

    Article  MATH  MathSciNet  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Ferrara, A., & Salini, S. (2012). Ten challenges in modeling bibliographic data for bibliometric analysis. Scientometrics, 93, 765–787.

    Article  Google Scholar 

  • Genest, C. (1997). Statistics on statistics: Measuring research productivity by journal publications between 1985 and 1995. The Canadian Journal of Statistics, 25(4), 427–433.

    Article  MATH  Google Scholar 

  • Genest, C. (1999). Probability and statistics: A tale of two worlds? The Canadian Journal of Statistics, 27(2), 421–444.

    Article  MATH  MathSciNet  Google Scholar 

  • Genest, C. (2002). Worldwide research output in probability and statistics: An update. The Canadian Journal of Statistics, 30(2), 329–342.

    Article  MATH  MathSciNet  Google Scholar 

  • Grün, B., & Hornik, K. (2011). Topicsmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30.

    Google Scholar 

  • Gupta, H. M., Campahna, J. R., & Pesce, R. A. G. (2005). Power-law distributions for the citation index of scientific publications and scientists. Brazilian Journal of Physics, 35(4A), 981–986.

    Article  Google Scholar 

  • Hall, D., Jurafsky, D., & Manning, C. (2008). Studying the history of ideas using topic models. In proceedings of the conference on empirical methods in natural language processing (pp. 363–371). Honolulu, Hawaii: Association for Computational Linguistics.

  • Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572.

    Article  Google Scholar 

  • Mimno, D., & Blei, D. (2011). Bayesian checking for topic models. In proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 227–237.

  • Newman, M. E. J. (2006). Power laws, Pareto distribution and Zipf’s law. In arXiv:cond-mat/0412004v3.

  • Ryan, T. P., & Woodall, W. H. (2005). The most-cited statistical papers. Journal of Applied Statistics, 32(5), 461–474.

    Article  MATH  MathSciNet  Google Scholar 

  • Schell, M. J. (2010). Identifying key statistical papers from 1985 to 2002 using citation data for applied biostatisticians. The American Statistician, 64(4), 310–317.

    Article  MathSciNet  Google Scholar 

  • Steyvers, M., T. Griffiths, T. (2007). Probabilistic topic models. In Handbook of latent semantic analysis, chapter 21.

  • Stigler, S. (1994). Citation patterns in the journals of statistics and probability. Statistical Science, 9(1), 94–108.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Silvia Salini.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

De Battisti, F., Ferrara, A. & Salini, S. A decade of research in statistics: a topic model approach. Scientometrics 103, 413–433 (2015). https://doi.org/10.1007/s11192-015-1554-1

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-015-1554-1

Keywords

Navigation