Abstract
In this article we study and characterize the phenomenon of the hyperprolific authors, who are the most productive researchers according to a given repository in a specific period of time. Particularly, we are interested in investigating and characterizing a subset of such hyperprolific authors who present a sudden growth in the number of published articles and coauthors, as well as concentrate their publications in a few specific journals, what can be seen as an anomalous behavior. Using data collected from the DBLP repository and covering the last 10 years, we propose a set of discriminative dimensions (features) aimed at characterizing the behavior of hyperprolific authors, ultimately helping to identify anomalous ones. Moreover, using a strategy based on ranking aggregation to identify the most prominent anomalous authors, we demonstrate that the best dimensions to characterize such anomalous behaviors may vary significantly among authors, but it is possible to identify a clear subset of them who present such behavior. Our results show that the top-ranked (most anomalous) authors manifest a distinct behavior from the middle-ranked ones. Indeed, each one of the five most anomalous authors published more than 48 journal articles in 2021 while collaborating with more than 1,000 coauthors in their careers. Specifically, one of such authors published more than 140 articles in just a single journal.
Similar content being viewed by others
Notes
http://uis.unesco.org/apps/visualisations/research-and-development-spending, accessed on April 25, 2022.
https://ncses.nsf.gov/pubs/nsb20206/publication-output-by-region-country-or-economy, accessed on April 25, 2022.
http://dblb.org, accessed on May 14, 2022.
https://scholar.google.com, accessed on May 14, 2022.
https://arxiv.org, accessed on May 14, 2022.
https://pubmed.ncbi.nlm.nih.gov, accessed on May 14, 2022.
https://www.aminer.org, accessed on May 15, 2022.
In this article we may use the words “researcher” and “author” to refer to a same person depending on its role when discussing a specific issue.
We have computed those numbers based on the dataset we have created for our experiments. See Subsection 3.1.
The Empirical Cumulative Distribution Function (ECDF) value for a given point p in the horizontal axis is the fraction of observations of the variable with values less than or equal to p.
https://scholar.google.com/intl/en/scholar/citations.html, accessed on May 16, 2022.
https://www.microsoft.com/en-us/research/project/microsoft-academic-graph, accessed on June 25, 2022.
https://jcr.clarivate.com, accessed on June 19, 2022.
https://incites.help.clarivate.com/Content/Indicators-Handbook/ih-journal-impact-factor.htm, accessed on June 25, 2022.
A mega journal is a type of journal where publishers charge authors, rather than readers, for the article publication.
https://predatory-publishing.com/how-many-predatory-journals-are-there, accessed on June 19, 2022.
https://www.interacademies.org, accessed on May 9, 2022.
Five temporal scenarios \(\times\) four temporal metrics \(\times\) two summarizations \(+\) the entropy \(+\) the publication intensity.
This is the ECDF’s value obtained by subtracting 0.31 (or 31%) from one, which is equivalent to the portion of the reference set’s plot at the right side of the vertical dotted line.
Remind that we use the term hyperproductive for the set of the top most productive researchers of the considered time period.
For the sake of clarity, we limit the graph plot to the 10 top-ranked researchers only. The same reasoning applies to the analysis of the features discussed next.
We consider ties as sharing the same rank.
https://www.andrews.edu/\(\sim\) calkins/math/edrm611/edrm05.htm, accessed on January 1, 2022.
All resources needed to reproduce our experiments, including our source code and dataset, are available at https://github.com/edreqm/raise-of-hiperprolific.
We take the working days in 2021 as a reference and repeat the same number for all the 10 years. We take the 365 year’s days and exclude the weekends (104). To be conservative, we do not exclude the holidays.
To compute the topics, we first concatenate all the publications’ titles to create a document for each author. Then, we eliminate all the adverbs, non-English, and stop words. Finally, we applied the CluWords algorithm to discover the 12 topics and the top 10 words describing them. We chose 12 as the number of topics to resemble the 12 Computer Science subfields defined in https://en.wikipedia.org/wiki/Outline_of_computer_science, accessed on December 11, 2022).
https://pubmed.ncbi.nlm.nih.gov, accessed on June 27, 2022
References
Antkare, I. (2020). Ike Antkare, His Publications, and Those of His Disciples. In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 14, p 177–200
Berghel, H. (2022). A Collapsing Academy, Part III: Scientometrics and Metric Mania. Computer, 55(3), 117–123. https://doi.org/10.1109/MC.2022.3142542
Biagioli, M. (2016). Watch out for cheats in citation game. Nature News, 535(7611), 201. https://doi.org/10.1038/535201a
Biagioli, M., & Lippman, A. (Eds.). (2020). Gaming the metrics: Misconduct and manipulation in academic research. MIT Press.
Biagioli, M., & Lippman, A. (2020). Introduction: Metrics and the new ecologies of academic misconduct. In A. Lippman (Ed.), Biagioli M (pp. 1–23). Gaming the metrics: Misconduct and manipulation in academic research. MIT Press.
Björk, B. C. (2015). Have the “mega-journals’’ reached the limits to growth? PeerJ, 3, e981. https://doi.org/10.7717/peerj.981
Björk, B. C. (2018). Evolution of the scholarly mega-journal, 2006–2017. PeerJ, 6, e4357. https://doi.org/10.7717/peerj.4357
Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222. https://doi.org/10.1002/asi.23329
Butler, D. (2008). Free journal-ranking tool enters citation market. Nature 451(7174)(3). https://doi.org/10.1038/451006a
Chorus, C., & Waltman, L. (2016). A large-scale analysis of impact factor biased journal self-citations. PLoS One, 11(8), e0161,021.
Dwork, C., Kumar, R., & Naor, M., et al. (2001). Rank Aggregation Methods for the Web. In: Proceedings of the Tenth International Conference on the World Wide Web, WWW 10, Hong Kong, China, May 1-5, 2001, pp 613–622, https://doi.org/10.1145/371920.372165
Elmore, S. A., & Weston, E. H. (2020). Predatory journals: What they are and how to avoid them. Toxicologic Pathology, 48(4), 607–610.
Fanelli, D. (2020). Pressures to publish: What effects do we see? In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 8, p 111–122
Fire, M., & Guestrin, C. (2019). Over-optimization of academic publishing metrics: observing Goodhart’s Law in action. GigaScience, 8(6), 1–20. https://doi.org/10.1093/gigascience/giz053
Garfield, E. (1999). Journal impact factor: A brief review. Canadian Medical Association Journal, 161(8), 979–980.
Grudniewicz, A., Moher, D., & Cobey, K.D., et al. (2019). Predatory journals: no definition, no defence
Guaspare, C., & Didier, E. (2020). The Voinnet Affair: Testing the Norms of Scientific Image Management. In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 12, p 157–167
Helmer, S., Blumenthal, D. B., & Paschen, K. (2020). What is meaningful research and how should we measure it? Scientometrics, 125(1), 153–169.
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102(46), 16,569-16,572.
IAP (2022) Combatting Predatory Academic Journals and Conferences (Full Report in English). The InterAcademy Partnership (IAP), accessed on May 20, 2022
Ioannidis, J. P., Klavans, R., & Boyack, K. W. (2018). The scientists who publish a paper every five days. Nature, 561, 167–169. https://doi.org/10.1038/d41586-018-06185-8
Kojaku, S., Livan, G., & Masuda, N. (2021). Detecting anomalous citation groups in journal networks. Scientific Reports, 11(1), 1–11.
Ley, M. (2009). DBLP—Some Lessons Learned. Proceedings of the VLDB Endowment, 2(2), 1493–1500. https://doi.org/10.14778/1687553.1687577
Li, W., Aste, T., Caccioli, F., et al. (2019). Early coauthorship with top scientists predicts success in academic careers. Nature communications, 10(1), 1–9.
Lima, H., Silva, T. H. P., Moro, M. M., et al. (2015). Assessing the profile of top Brazilian computer science researchers. Scientometrics, 103(3), 879–896. https://doi.org/10.1007/s11192-015-1569-7
Oravec, J. A. (2019). The “Dark Side’’ of Academics? Emerging issues in the gaming and manipulation of metrics in higher education. The Review of Higher Education, 42(3), 859–877.
Pan, R. K., Petersen, A. M., Pammolli, F., et al. (2018). The memory of science: Inflation, myopia, and the knowledge network. Journal of Informetrics, 12(3), 656–678.
Perez, O., Bar-Ilan, J., Cohen, R., et al. (2019). The network of law reviews: Citation cartels, scientific communities, and journal rankings. The Modern Law Review, 82(2), 240–268.
Petersen, A. M. (2015). Quantifying the impact of weak, strong, and super ties in scientific careers. Proceedings of the National Academy of Sciences, 112(34), E4671–E4680.
Pinto, Â. P., Mejdalani, G., Mounce, R., et al. (2021). Are publications on zoological taxonomy under attack? Royal Society Open Science, 8(2), 201,617-201,617.
Sinha, A., Shen, Z., & Song, Y., et al. (2015). An Overview of Microsoft Academic Service (MAS) and Applications. In: Proceedings of the 24th International Conference on the World Wide Web, pp 243–246, https://doi.org/10.1145/2740908.2742839
Sismondo, S. (2020). Ghost-Managing and Gaming Pharmaceutical Knowledge. In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 9, p 123–133
Spearman, C. (2010). The proof and measurement of association between two things. International Journal of Epidemiology, 39(5), 1137–1150. https://doi.org/10.2307/1422689
Tang, J., Zhang, J., & Yao, L., et al. (2008). Arnetminer: Extraction and Mining of Academic Social Networks. In: Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, pp 990–998, https://doi.org/10.1145/1401890.1402008
Viegas, F., Canuto, S., & Gomes, C., et al. (2019). CluWords: exploiting semantic word clustering representation for enhanced topic modeling. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp 753–761
Viegas, F., Cunha, W., & Gomes, C., et al. (2020). Cluhtm - semantic hierarchical topic modeling based on cluwords. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, pp 8138–8150, https://doi.org/10.18653/v1/2020.acl-main.724, https://doi.org/10.18653/v1/2020.acl-main.724
Viegas, F., Júnior, A. P. D. S., Cecilio, P., et al. (2022). Semantic academic profiler (SAP): A framework for researcher assessment based on semantic topic modeling. Scientometrics, 127(8), 5005–5026. https://doi.org/10.1007/s11192-022-04449-9
Von Bergen, C.W., & Bressler, M.S. (2017). Academe’s Unspoken Ethical Dilemma: Author Inflation in Higher Education. Research in Higher Education Journal 32
Wang, K., Shen, Z., Huang, C., et al. (2019). A Review of Microsoft Academic Services for Science of Science Studies. Frontiers in Big Data, 2,. https://doi.org/10.3389/fdata.2019.00045
Wasserman, L. (2005). All of statistics: A concise course in statistical inference (1st ed.). Springer.
Acknowledgements
This work is partially supported by the authors individual research grants from CAPES, CNPq and FAPEMIG, and by the projects MASWeb and INCT-Cyber.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
the authors have no financial or non-financial interests to disclose.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Moreira, E., Meira, W., Gonçalves, M.A. et al. The rise of hyperprolific authors in computer science: characterization and implications. Scientometrics 128, 2945–2974 (2023). https://doi.org/10.1007/s11192-023-04676-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-023-04676-8