The rise of hyperprolific authors in computer science: characterization and implications

Moreira, Edré; Meira, Wagner; Gonçalves, Marcos André; Laender, Alberto H. F.

doi:10.1007/s11192-023-04676-8

The rise of hyperprolific authors in computer science: characterization and implications

Published: 15 March 2023

Volume 128, pages 2945–2974, (2023)
Cite this article

Scientometrics Aims and scope Submit manuscript

Edré Moreira ORCID: orcid.org/0000-0001-6269-5256¹,
Wagner Meira Jr.¹^na1,
Marcos André Gonçalves¹^na1 &
…
Alberto H. F. Laender¹^na1

1025 Accesses
2 Citations
17 Altmetric
3 Mentions
Explore all metrics

Abstract

In this article we study and characterize the phenomenon of the hyperprolific authors, who are the most productive researchers according to a given repository in a specific period of time. Particularly, we are interested in investigating and characterizing a subset of such hyperprolific authors who present a sudden growth in the number of published articles and coauthors, as well as concentrate their publications in a few specific journals, what can be seen as an anomalous behavior. Using data collected from the DBLP repository and covering the last 10 years, we propose a set of discriminative dimensions (features) aimed at characterizing the behavior of hyperprolific authors, ultimately helping to identify anomalous ones. Moreover, using a strategy based on ranking aggregation to identify the most prominent anomalous authors, we demonstrate that the best dimensions to characterize such anomalous behaviors may vary significantly among authors, but it is possible to identify a clear subset of them who present such behavior. Our results show that the top-ranked (most anomalous) authors manifest a distinct behavior from the middle-ranked ones. Indeed, each one of the five most anomalous authors published more than 48 journal articles in 2021 while collaborating with more than 1,000 coauthors in their careers. Specifically, one of such authors published more than 140 articles in just a single journal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 17

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

How to Check for Plagiarism?

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Article Open access 27 February 2017

Nees Jan van Eck & Ludo Waltman

Notes

http://uis.unesco.org/apps/visualisations/research-and-development-spending, accessed on April 25, 2022.
https://ncses.nsf.gov/pubs/nsb20206/publication-output-by-region-country-or-economy, accessed on April 25, 2022.
http://dblb.org, accessed on May 14, 2022.
https://scholar.google.com, accessed on May 14, 2022.
https://arxiv.org, accessed on May 14, 2022.
https://pubmed.ncbi.nlm.nih.gov, accessed on May 14, 2022.
https://www.aminer.org, accessed on May 15, 2022.
In this article we may use the words “researcher” and “author” to refer to a same person depending on its role when discussing a specific issue.
We have computed those numbers based on the dataset we have created for our experiments. See Subsection 3.1.
The Empirical Cumulative Distribution Function (ECDF) value for a given point p in the horizontal axis is the fraction of observations of the variable with values less than or equal to p.
https://scholar.google.com/intl/en/scholar/citations.html, accessed on May 16, 2022.
https://www.microsoft.com/en-us/research/project/microsoft-academic-graph, accessed on June 25, 2022.
https://jcr.clarivate.com, accessed on June 19, 2022.
https://incites.help.clarivate.com/Content/Indicators-Handbook/ih-journal-impact-factor.htm, accessed on June 25, 2022.
A mega journal is a type of journal where publishers charge authors, rather than readers, for the article publication.
https://predatory-publishing.com/how-many-predatory-journals-are-there, accessed on June 19, 2022.
https://www.interacademies.org, accessed on May 9, 2022.
Five temporal scenarios \(\times\) four temporal metrics \(\times\) two summarizations \(+\) the entropy \(+\) the publication intensity.
This is the ECDF’s value obtained by subtracting 0.31 (or 31%) from one, which is equivalent to the portion of the reference set’s plot at the right side of the vertical dotted line.
Remind that we use the term hyperproductive for the set of the top most productive researchers of the considered time period.
For the sake of clarity, we limit the graph plot to the 10 top-ranked researchers only. The same reasoning applies to the analysis of the features discussed next.
We consider ties as sharing the same rank.
https://www.andrews.edu/\(\sim\) calkins/math/edrm611/edrm05.htm, accessed on January 1, 2022.
All resources needed to reproduce our experiments, including our source code and dataset, are available at https://github.com/edreqm/raise-of-hiperprolific.
We take the working days in 2021 as a reference and repeat the same number for all the 10 years. We take the 365 year’s days and exclude the weekends (104). To be conservative, we do not exclude the holidays.
To compute the topics, we first concatenate all the publications’ titles to create a document for each author. Then, we eliminate all the adverbs, non-English, and stop words. Finally, we applied the CluWords algorithm to discover the 12 topics and the top 10 words describing them. We chose 12 as the number of topics to resemble the 12 Computer Science subfields defined in https://en.wikipedia.org/wiki/Outline_of_computer_science, accessed on December 11, 2022).
https://pubmed.ncbi.nlm.nih.gov, accessed on June 27, 2022

References

Antkare, I. (2020). Ike Antkare, His Publications, and Those of His Disciples. In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 14, p 177–200
Berghel, H. (2022). A Collapsing Academy, Part III: Scientometrics and Metric Mania. Computer, 55(3), 117–123. https://doi.org/10.1109/MC.2022.3142542
Article Google Scholar
Biagioli, M. (2016). Watch out for cheats in citation game. Nature News, 535(7611), 201. https://doi.org/10.1038/535201a
Article Google Scholar
Biagioli, M., & Lippman, A. (Eds.). (2020). Gaming the metrics: Misconduct and manipulation in academic research. MIT Press.
Google Scholar
Biagioli, M., & Lippman, A. (2020). Introduction: Metrics and the new ecologies of academic misconduct. In A. Lippman (Ed.), Biagioli M (pp. 1–23). Gaming the metrics: Misconduct and manipulation in academic research. MIT Press.
Google Scholar
Björk, B. C. (2015). Have the “mega-journals’’ reached the limits to growth? PeerJ, 3, e981. https://doi.org/10.7717/peerj.981
Article Google Scholar
Björk, B. C. (2018). Evolution of the scholarly mega-journal, 2006–2017. PeerJ, 6, e4357. https://doi.org/10.7717/peerj.4357
Article Google Scholar
Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222. https://doi.org/10.1002/asi.23329
Article Google Scholar
Butler, D. (2008). Free journal-ranking tool enters citation market. Nature 451(7174)(3). https://doi.org/10.1038/451006a
Chorus, C., & Waltman, L. (2016). A large-scale analysis of impact factor biased journal self-citations. PLoS One, 11(8), e0161,021.
Article Google Scholar
Dwork, C., Kumar, R., & Naor, M., et al. (2001). Rank Aggregation Methods for the Web. In: Proceedings of the Tenth International Conference on the World Wide Web, WWW 10, Hong Kong, China, May 1-5, 2001, pp 613–622, https://doi.org/10.1145/371920.372165
Elmore, S. A., & Weston, E. H. (2020). Predatory journals: What they are and how to avoid them. Toxicologic Pathology, 48(4), 607–610.
Article Google Scholar
Fanelli, D. (2020). Pressures to publish: What effects do we see? In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 8, p 111–122
Fire, M., & Guestrin, C. (2019). Over-optimization of academic publishing metrics: observing Goodhart’s Law in action. GigaScience, 8(6), 1–20. https://doi.org/10.1093/gigascience/giz053
Article Google Scholar
Garfield, E. (1999). Journal impact factor: A brief review. Canadian Medical Association Journal, 161(8), 979–980.
Google Scholar
Grudniewicz, A., Moher, D., & Cobey, K.D., et al. (2019). Predatory journals: no definition, no defence
Guaspare, C., & Didier, E. (2020). The Voinnet Affair: Testing the Norms of Scientific Image Management. In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 12, p 157–167
Helmer, S., Blumenthal, D. B., & Paschen, K. (2020). What is meaningful research and how should we measure it? Scientometrics, 125(1), 153–169.
Article Google Scholar
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102(46), 16,569-16,572.
Article MATH Google Scholar
IAP (2022) Combatting Predatory Academic Journals and Conferences (Full Report in English). The InterAcademy Partnership (IAP), accessed on May 20, 2022
Ioannidis, J. P., Klavans, R., & Boyack, K. W. (2018). The scientists who publish a paper every five days. Nature, 561, 167–169. https://doi.org/10.1038/d41586-018-06185-8
Article Google Scholar
Kojaku, S., Livan, G., & Masuda, N. (2021). Detecting anomalous citation groups in journal networks. Scientific Reports, 11(1), 1–11.
Article Google Scholar
Ley, M. (2009). DBLP—Some Lessons Learned. Proceedings of the VLDB Endowment, 2(2), 1493–1500. https://doi.org/10.14778/1687553.1687577
Article Google Scholar
Li, W., Aste, T., Caccioli, F., et al. (2019). Early coauthorship with top scientists predicts success in academic careers. Nature communications, 10(1), 1–9.
Article Google Scholar
Lima, H., Silva, T. H. P., Moro, M. M., et al. (2015). Assessing the profile of top Brazilian computer science researchers. Scientometrics, 103(3), 879–896. https://doi.org/10.1007/s11192-015-1569-7
Article Google Scholar
Oravec, J. A. (2019). The “Dark Side’’ of Academics? Emerging issues in the gaming and manipulation of metrics in higher education. The Review of Higher Education, 42(3), 859–877.
Article Google Scholar
Pan, R. K., Petersen, A. M., Pammolli, F., et al. (2018). The memory of science: Inflation, myopia, and the knowledge network. Journal of Informetrics, 12(3), 656–678.
Article Google Scholar
Perez, O., Bar-Ilan, J., Cohen, R., et al. (2019). The network of law reviews: Citation cartels, scientific communities, and journal rankings. The Modern Law Review, 82(2), 240–268.
Article Google Scholar
Petersen, A. M. (2015). Quantifying the impact of weak, strong, and super ties in scientific careers. Proceedings of the National Academy of Sciences, 112(34), E4671–E4680.
Article Google Scholar
Pinto, Â. P., Mejdalani, G., Mounce, R., et al. (2021). Are publications on zoological taxonomy under attack? Royal Society Open Science, 8(2), 201,617-201,617.
Article Google Scholar
Sinha, A., Shen, Z., & Song, Y., et al. (2015). An Overview of Microsoft Academic Service (MAS) and Applications. In: Proceedings of the 24th International Conference on the World Wide Web, pp 243–246, https://doi.org/10.1145/2740908.2742839
Sismondo, S. (2020). Ghost-Managing and Gaming Pharmaceutical Knowledge. In: Biagioli M, Lippman A (eds) Gaming the metrics: Misconduct and manipulation in academic research. MIT Press, chap 9, p 123–133
Spearman, C. (2010). The proof and measurement of association between two things. International Journal of Epidemiology, 39(5), 1137–1150. https://doi.org/10.2307/1422689
Article Google Scholar
Tang, J., Zhang, J., & Yao, L., et al. (2008). Arnetminer: Extraction and Mining of Academic Social Networks. In: Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, pp 990–998, https://doi.org/10.1145/1401890.1402008
Viegas, F., Canuto, S., & Gomes, C., et al. (2019). CluWords: exploiting semantic word clustering representation for enhanced topic modeling. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp 753–761
Viegas, F., Cunha, W., & Gomes, C., et al. (2020). Cluhtm - semantic hierarchical topic modeling based on cluwords. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, pp 8138–8150, https://doi.org/10.18653/v1/2020.acl-main.724, https://doi.org/10.18653/v1/2020.acl-main.724
Viegas, F., Júnior, A. P. D. S., Cecilio, P., et al. (2022). Semantic academic profiler (SAP): A framework for researcher assessment based on semantic topic modeling. Scientometrics, 127(8), 5005–5026. https://doi.org/10.1007/s11192-022-04449-9
Article Google Scholar
Von Bergen, C.W., & Bressler, M.S. (2017). Academe’s Unspoken Ethical Dilemma: Author Inflation in Higher Education. Research in Higher Education Journal 32
Wang, K., Shen, Z., Huang, C., et al. (2019). A Review of Microsoft Academic Services for Science of Science Studies. Frontiers in Big Data, 2,. https://doi.org/10.3389/fdata.2019.00045
Wasserman, L. (2005). All of statistics: A concise course in statistical inference (1st ed.). Springer.
MATH Google Scholar

Download references

Acknowledgements

This work is partially supported by the authors individual research grants from CAPES, CNPq and FAPEMIG, and by the projects MASWeb and INCT-Cyber.

Author information

Wagner Meira Jr., Marcos André Gonçalves, and Alberto H. F. Laender have contributed equally to this work.

Authors and Affiliations

Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, MG, 31270-901, Brazil
Edré Moreira, Wagner Meira Jr., Marcos André Gonçalves & Alberto H. F. Laender

Authors

Edré Moreira
View author publications
You can also search for this author in PubMed Google Scholar
Wagner Meira Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Marcos André Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar
Alberto H. F. Laender
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edré Moreira.

Ethics declarations

Conflict of interest

the authors have no financial or non-financial interests to disclose.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Moreira, E., Meira, W., Gonçalves, M.A. et al. The rise of hyperprolific authors in computer science: characterization and implications. Scientometrics 128, 2945–2974 (2023). https://doi.org/10.1007/s11192-023-04676-8

Download citation

Received: 27 August 2022
Accepted: 20 February 2023
Published: 15 March 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11192-023-04676-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The rise of hyperprolific authors in computer science: characterization and implications

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

How to Check for Plagiarism?

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The rise of hyperprolific authors in computer science: characterization and implications

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

How to Check for Plagiarism?

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation