Abstract
In the field of bioinformatics, a large number of classical software becomes a necessary research tool. To measure the influence of scientific software as one kind of important intellectual products, a few strategies have been proposed to identify the software names from full texts of papers to collect the usage data of packages in bioinformatics research. However, the performance of these strategies is limited because of the highly imbalance of data in the full texts. This study proposes EnsembleSVMs-CRF, a two-step refinement strategy based on ensemble learning that gradually increases the sentences that contain software mentions to improve the performance of named entity recognition. The experiment on the bioinformatics corpus shows that the performance of EnsembleSVMs-CRF, in terms of the local F1 (78.81%) and the global F1-A (73.49%), is superior to the rule-based bootstrapping method and direct CRF. Application of this strategy to the articles published between 2013 and 2017 in 27 bioinformatics journals extracted 8,239 unique packages. The most popular 50 packages thus identified demonstrate that most of them are professional software which generally requires inter-discipline knowledge, rather than programming skill. Meanwhile, we found that researchers in bioinformatics tend to use free scientific software, and the application of general software is increasing compared with professional software.



Similar content being viewed by others
References
Alfred, R., Leong, L. C., On, C. K., & Anthony, P. (2014). Malay named entity recognition based on rule-based approach. International Journal of Machine Learning and Computing, 4(3), 300–306.
Aryani, A., Poblet, M., Unsworth, K., Wang, J., Evans, B., Devaraju, A., Brigitte, H., Claus-Peter, K., Benjamin, Z., & Samuele, K. (2018). A research graph dataset for connecting research data repositories using RD-Switchboard. Scientific Data, 5, 180099.
Bertin, M., Atanassova, I., Lariviere, V., & Gingras, Y. (2013). The distribution of references in scientific papers: An analysis of the IMRaD structure. In Proceedings of the international conference on scientometrics and informetrics (pp. 591–603), Vienna, Austria.
Borgman, C. L., Wallis, J., & Mayernik, M. (2012). Who’s got the data? Interdependencies in science and technology collaborations. Computer Supported Cooperative Work, 21(6), 485–523.
Boudjellal, N., Zhang, H., Khan, A., Ahmad, A., & Dai, L. (2021). Abioner: A bert-based model for arabic biomedical named-entity recognition. Complexity, 3, 1–6.
Bressan, B. (2013). The SciencePAD treasure hunt of persistent identifiers. CERN Bulletin.
Chassanoff, A., & Altman, M. (2019). Curation as “Interoperability with the Future”: Preserving scholarly research software in academic libraries. Journal of the Association for Information Science and Technology, 71(3), 325–337.
Chen, L., & Davidson, S. B. (2020). Automating software citation using gitcite. In 2020 IEEE 36th international conference on data engineering (ICDE) (pp.1754–1757). Texas, USA.
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow, P. M., Zietz, M., Hoffman, M. M., Xie, W., Rosen, G. L., Lengerich, B. J., Israeli, J., Lanchantin, J., Woloszynek, S., Carpenter, A. E., Shrikumar, A., Xu, J., … Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface, 15(141), 20170387.
Chiticariu, L., Li, Y., & Reiss, F. (2013). Rule-based information extraction is dead! long live rule-based information extraction systems! In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 827–832). Washington, USA.
Cho, M., Ha, J., Park, C., & Park, S. (2020). Combinatorial feature embedding based on cnn and lstm for biomedical named entity recognition. Journal of Biomedical Informatics, 103, 103381.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(76), 2493–2537.
Cosmo, R. D. (2020). Announcing biblatex-software: Software citation made easy. Software Engineering Notes, 45(4), 22–23.
Devi, G. R., Kumar, M. A., & Soman, K. P. (2019). Extraction of named entities from social media text in tamil language using N-gram embedding for disaster management. Nature-Inspired Computation in Data Mining and Machine Learning, 855, 207–223.
Dong, C., Zhang, J., Zong, C., Hattori, M., & Di, H. (2016). Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In Natural language understanding and intelligent applications (pp. 239–250). Springer.
Farquad, M. A. H., & Bose, I. (2012). Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1), 226–233.
Ghosh, S., Matsuoka, Y., Asai, Y., Hsin, K. Y., & Kitano, H. (2011). Software for systems biology: From tools to integrated platforms. Nature Reviews Genetics, 12(12), 821–832.
Goble, C. (2014). Better software, better research. IEEE Internet Computing, 18(5), 4–8.
Goyala, A., Guptab, V., & Kumarc, M. (2018). Recent named entity recognition and classification techniques: A systematic review. Computer Science Review, 29, 21–43.
Gridach, M. (2017). Character-level neural network for biomedical named entity recognition. Journal of Biomedical Informatics, 70, 85–91.
Hakala, K., Pyysalo, S. (2019). Biomedical Named Entity Recognition with Multilingual BERT. Association for Computational Linguistics, In Proceedings of the 5th workshop on BioNLP open shared tasks (pp. 56–61). Hong Kong, China.
Hemati, W., & Mehler, A. (2019). Lstmvoter: Chemical named entity recognition using a conglomerate of sequence labeling tools. Journal of Cheminformatics, 11, 3.
Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(2), 107–116.
Howison, J., & Herbsleb, J. D. (2011). Scientific software production: incentives and collaboration. In Proceedings of the 2011 ACM conference on computer supported cooperative work (pp. 513–522). Hangzhou, China.
Howison, J., & Bullard, J. (2016). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science & Technology, 67(9), 2137–2155.
Howison, J., Deelman, E., Mc Lennan, M. J., et al. (2015). Understanding the scientific software ecosystem and its impact: Current and future measures. Research Evaluation, 24(4), 454–470.
Hsu, B. M. (2020). Comparison of supervised classification models on textual data. Mathematics, 8(5), 851.
Jackson, M. (2012). How to cite and describe software. Retrieved December 7, 2021, from https://www.software.ac.uk/how-cite-software.
Katz, D. S., Bouquin, D., Hong, N., Hausman, J., Jones, C., & Chivvis, D., et al. (2019a). Software citation implementation challenges. arXiv, 1905.08674.
Katz, D. S., McInnes, L. C., Bernholdt, D. E., Mayes, A. C., Hong, N. P. C., Duckles, J., Gesing, S., Heroux, M. A., Hettrick, S., Jimenez, R. C., Pierce, M., Weaver, B., & Wilkins-Diehr, N. (2019b). Community organizations: Changing the culture in which research software is developed and sustained. Computing in Science & Engineering, 21(2), 8–24.
Katz, D. S., Hong, N., Clark, T., Muench, A., & Yeston, J. (2020). The importance of software citation. F1000 Research, 9, 1257.
Kristina, T., Dan, K., Christopher, M., & Yoram, S. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL (pp. 252–259). Edmonton, Canada.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 260–270). California, USA.
Le, T. A., Arkhipov, M. Y., & Burtsev, M. S. (2017). Application of a hybrid Bi-LSTM-CRF model to the task of Russian named entity recognition. In Conference on artificial intelligence and natural language (pp. 91–103). Petersburg, Russia.
Leroy, D., Sallou, J., Bourcier, J., & Combemale, B. (2021). When scientific software meets software engineering. Computer, 54(12), 60–71.
Li, J., Sun, A., & Joty, S. R. (2018). SegBot: A generic neural text segmentation model with pointer network. In Proceedings of the twenty-seventh international joint conference on artificial intelligence (pp. 4166–4172). Stockholm, Sweden.
Li, K., Chen, P. Y., & Yan, E. (2019). Challenges of measuring software impact through citations: An examination of the lme4 R package. Journal of Informetrics, 13(1), 449–461.
Li, K., Yan, E., & Feng, Y. (2017). How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics, 11(4), 989–1002.
Lin, F., & Xie, D. (2020). Research on named entity recognition of traditional Chinese medicine electronic medical records. In Proceedings of ninth international conference on health information science (pp.61–67). Amsterdam and Leiden, Netherlands.
Liu, P., Choo, K. K. R., Wang, L., & Huang, F. (2017). SVM or deep learning? A comparative study on remote sensing image classification. Soft Computing, 21(23), 7053–7065.
Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., & Wang, J. (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8), 1381–1388.
Löffler, F., Brandt, S. R., Allen, G., & Schnetter, E. (2014). Cactus: Issues for sustainable simulation software. Journal of Open Research Software, 2(1), e12.
Marcot, B. G., & Hanea, A. M. (2021). What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis. Computational Statistics, 36(3), 2009–2031.
Marx, V. (2013). Biology: The big challenges of big data. Nature, 498(7453), 255–260.
Mikolov, T., Karafiat, M., Burget, L., Cernock, J., & Khudanpur, S. (2010). Recurrent neural network-based language model. In Proceedings of eleventh annual conference of the international speech communication association (pp.1045–1048). Chiba, Japan.
Na, S. H., Kim, H., Min, J., & Kim, K. (2019). Improving LSTM CRFs using character-based compositions for Korean named entity recognition. Computer Speech & Language, 54, 106–121.
Nandar, T. L., Soe, T. L., & Soe, K. M. (2020). A comparative study of named entity recognition on myanmar language. In Proceedings of 23rd conference of the oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (pp. 60–64). Yangon, Myanmar.
Nguyen, T., Nguyen, D., & Rao, P. (2003). Adaptive name entity recognition under highly unbalanced data. arXiv preprint, 10296.
Ordua-Malea, E., & Costas, R. (2021). Link-based approach to study scientific software usage: The case of VOSviewer. Scientometrics, 126, 8153–8186.
Pan, X. L., Yan, E., Wang, Q. Q., & Hua, W. N. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871.
Pan, X., Yan, E., & Hua, W. (2016). Disciplinary differences of software use and impact in scientific literature. Scientometrics, 109(3), 1–18.
Park, H., & Wolfram, D. (2019). Research software citation in the data citation index: Current practices and implications for research software sharing and reuse. Journal of Informetrics, 13(2), 574–582.
Piwowar, H. (2013). Altmetrics: Value all research products. Nature, 493(7431), 159–159.
Rais, M., Lachkar, A., Lachkar A, & Ouatik, S. E. A. (2014). A comparative study of biomedical named entity recognition methods based machine learning approach. In Proceedings of 3rd IEEE international colloquium on information science and technology (pp. 329–334). Tetouan, Morocco.
Rau, L. F. (1991). Extracting company names from text. In Proceedings of the seventh IEEE conference on artificial intelligence application (pp. 29–32). FL, USA.
Shaalan, K., & Oudah, M. (2014). A hybrid approach to Arabic named entity recognition. Journal of Information Science, 40(1), 67–87.
Smith, A. M., Katz, D. S., & Niemeyer, K. E. (2016). Software citation principles. PeerJ, 2, e86.
Soito, L. & Hwang, L. J, (2016). Citations for Software: Providing Identification Access and Recognition for Research Software. International Journal of Digital Curation, 11(2), 48–63.
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A 50-year survey. Journal of the Medical Library Association, 92(3), 364–371.
Sundheim, B. M. (1995). Overview of results of the MUC-6 evaluation. In Proceedings of the 6th conference on message understanding (pp. 13–31). Maryland, USA.
Thelwall, M., & Kousha, K. (2016). Academic software downloads from Google code. Information Research, 21(1), n1.
Ukov-Gregori, A., Bachrach, Y., & Coope, S. (2018). Named Entity Recognition with Parallel Recurrent Neural Networks. In Proceedings of the 56th annual meeting of the association for computational linguistics (pp. 69–74). Melbourne, Australia.
Wang, H. B., Gao, H. K., Shen, Q., & Xian, Y. (2019). Thai language names, place names and organization names entity recognition. Journal of System Simulation, 31(5), 1010–1018.
Wang, S. J., Mathew, A., Chen, Y., Xi, L. F., Ma, L., & Lee, J. (2009). Empirical analysis of support vector machine ensemble classifiers. Expert Systems with Applications, 36(3), 6466–6476.
Wu, J. (2011). Improving the writing of research papers: IMRAD and beyond. Landscape Ecology, 26(10), 1345–1349.
Yang, B., Rousseau, R., Wang, X., & Huang, S. (2018). How important is scientific software in bioinformatics research? A comparative study between international and Chinese research communities. Journal of the Association for Information Science and Technology, 69(9), 1122–1133.
Zeng, D., Sun, C., Lin, L., & Liu, B. (2017). LSTM-CRF for drug-named entity recognition. Entropy, 19(6), 283.
Zhang, Y. C., Liu, J. Y., Liu, J., Sheng, J., & Lv, J. W. (2018). EEG recognition of motor imagery based on SVM ensemble. In Proceedings of the 5th international conference on systems and informatics (pp. 866–870). Nanjing, China.
Zhou, J. T., Zhang, H., Jin, D., Peng, X., Xiao, Y., & Cao, Z. (2019). Roseq: Robust sequence labeling. IEEE Transactions on Neural Networks and Learning Systems, 31(7), 2304–2314.
Zhu, F., & Shen, B. (2012). Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing. PLoS ONE, 7(6), e39230.
Zitnik, M., Nguyen, F., Wang, B., Leskovec, J., Goldenberg, A., & Hoffman, M. M. (2019). Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50, 71–91.
Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., & Telenti, A. (2019). A primer on deep learning in genomics. Nature Genetics, 51(1), 12–18.
Acknowledgements
This study is supported by the National Social Science Fund of China under Grant No. 18BTQ077. The authors would like to thank two anonymous reviewers for their great suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author has no conflicts of interest to declare that are relevant to the content of this article.
Appendices
Appendix A
The top 10 boundary words.
Id | Boundary words |
---|---|
1 | Use |
2 | Software |
3 | Package |
4 | Program |
5 | Tool |
6 | Analysis |
7 | Search |
8 | Version |
9 | Script |
10 | Result |
Appendix B
Scores of the top 50 packages (ranked by document frequencies).
Software | DF | CitedS | CitedP |
---|---|---|---|
BLAST | 3137 | 60123 | 19 |
Primer | 2932 | 52625 | 18 |
SAM | 1517 | 37469 | 25 |
Cooper | 1470 | 33849 | 23 |
SPSS | 1268 | 15654 | 12 |
QTL | 1219 | 25041 | 21 |
Excel | 1195 | 21659 | 18 |
MEGA | 1024 | 17225 | 17 |
BWA | 837 | 24933 | 30 |
Clustal | 817 | 12479 | 15 |
IPA | 772 | 15396 | 20 |
GCTA | 732 | 12552 | 17 |
DAVID | 697 | 14043 | 20 |
SAS | 669 | 11536 | 17 |
BAM | 638 | 18503 | 29 |
PAM | 635 | 16710 | 26 |
Blast2GO | 632 | 12068 | 19 |
GraphPad Prism | 630 | 11374 | 18 |
Heatmap | 621 | 14299 | 23 |
ImageJ | 595 | 11366 | 19 |
Python | 590 | 14930 | 25 |
Cufflinks | 582 | 16966 | 29 |
Genome Browser | 572 | 16188 | 28 |
RMA | 570 | 12294 | 22 |
BLASTN | 549 | 12392 | 23 |
Trinity | 548 | 10796 | 20 |
DESeq | 527 | 11053 | 21 |
BLASTP | 482 | 9849 | 20 |
edgeR | 479 | 9624 | 20 |
Picard | 474 | 11944 | 25 |
RepeatMasker | 458 | 13067 | 29 |
Cytoscape | 453 | 7962 | 18 |
dChip | 453 | 13740 | 30 |
Java | 441 | 10471 | 24 |
ClustalW | 419 | 6426 | 15 |
GATK | 417 | 11180 | 27 |
Primer3 | 415 | 6964 | 17 |
R-project | 414 | 10871 | 26 |
MCL | 413 | 11046 | 27 |
Scaffold | 405 | 11590 | 29 |
FastQC | 404 | 7343 | 18 |
MUSCLE | 398 | 8986 | 23 |
HapMap | 385 | 9661 | 25 |
MEME | 371 | 7645 | 21 |
SAP | 364 | 6730 | 18 |
LightCycler | 362 | 6161 | 17 |
Bowtie2 | 338 | 6672 | 20 |
InterProScan | 336 | 8008 | 24 |
IGV | 313 | 7075 | 23 |
Velvet | 312 | 9571 | 31 |
Rights and permissions
About this article
Cite this article
Jiang, L., Kang, X., Huang, S. et al. A refinement strategy for identification of scientific software from bioinformatics publications. Scientometrics 127, 3293–3316 (2022). https://doi.org/10.1007/s11192-022-04381-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-022-04381-y