Skip to main content
Log in

A refinement strategy for identification of scientific software from bioinformatics publications

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

In the field of bioinformatics, a large number of classical software becomes a necessary research tool. To measure the influence of scientific software as one kind of important intellectual products, a few strategies have been proposed to identify the software names from full texts of papers to collect the usage data of packages in bioinformatics research. However, the performance of these strategies is limited because of the highly imbalance of data in the full texts. This study proposes EnsembleSVMs-CRF, a two-step refinement strategy based on ensemble learning that gradually increases the sentences that contain software mentions to improve the performance of named entity recognition. The experiment on the bioinformatics corpus shows that the performance of EnsembleSVMs-CRF, in terms of the local F1 (78.81%) and the global F1-A (73.49%), is superior to the rule-based bootstrapping method and direct CRF. Application of this strategy to the articles published between 2013 and 2017 in 27 bioinformatics journals extracted 8,239 unique packages. The most popular 50 packages thus identified demonstrate that most of them are professional software which generally requires inter-discipline knowledge, rather than programming skill. Meanwhile, we found that researchers in bioinformatics tend to use free scientific software, and the application of general software is increasing compared with professional software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Alfred, R., Leong, L. C., On, C. K., & Anthony, P. (2014). Malay named entity recognition based on rule-based approach. International Journal of Machine Learning and Computing, 4(3), 300–306.

    Article  Google Scholar 

  • Aryani, A., Poblet, M., Unsworth, K., Wang, J., Evans, B., Devaraju, A., Brigitte, H., Claus-Peter, K., Benjamin, Z., & Samuele, K. (2018). A research graph dataset for connecting research data repositories using RD-Switchboard. Scientific Data, 5, 180099.

    Article  Google Scholar 

  • Bertin, M., Atanassova, I., Lariviere, V., & Gingras, Y. (2013). The distribution of references in scientific papers: An analysis of the IMRaD structure. In Proceedings of the international conference on scientometrics and informetrics (pp. 591–603), Vienna, Austria.

  • Borgman, C. L., Wallis, J., & Mayernik, M. (2012). Who’s got the data? Interdependencies in science and technology collaborations. Computer Supported Cooperative Work, 21(6), 485–523.

    Article  Google Scholar 

  • Boudjellal, N., Zhang, H., Khan, A., Ahmad, A., & Dai, L. (2021). Abioner: A bert-based model for arabic biomedical named-entity recognition. Complexity, 3, 1–6.

    Article  Google Scholar 

  • Bressan, B. (2013). The SciencePAD treasure hunt of persistent identifiers. CERN Bulletin.

  • Chassanoff, A., & Altman, M. (2019). Curation as “Interoperability with the Future”: Preserving scholarly research software in academic libraries. Journal of the Association for Information Science and Technology, 71(3), 325–337.

    Article  Google Scholar 

  • Chen, L., & Davidson, S. B. (2020). Automating software citation using gitcite. In 2020 IEEE 36th international conference on data engineering (ICDE) (pp.1754–1757). Texas, USA.

  • Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow, P. M., Zietz, M., Hoffman, M. M., Xie, W., Rosen, G. L., Lengerich, B. J., Israeli, J., Lanchantin, J., Woloszynek, S., Carpenter, A. E., Shrikumar, A., Xu, J., … Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface, 15(141), 20170387.

    Article  Google Scholar 

  • Chiticariu, L., Li, Y., & Reiss, F. (2013). Rule-based information extraction is dead! long live rule-based information extraction systems! In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 827–832). Washington, USA.

  • Cho, M., Ha, J., Park, C., & Park, S. (2020). Combinatorial feature embedding based on cnn and lstm for biomedical named entity recognition. Journal of Biomedical Informatics, 103, 103381.

    Article  Google Scholar 

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(76), 2493–2537.

    MATH  Google Scholar 

  • Cosmo, R. D. (2020). Announcing biblatex-software: Software citation made easy. Software Engineering Notes, 45(4), 22–23.

    Article  Google Scholar 

  • Devi, G. R., Kumar, M. A., & Soman, K. P. (2019). Extraction of named entities from social media text in tamil language using N-gram embedding for disaster management. Nature-Inspired Computation in Data Mining and Machine Learning, 855, 207–223.

    Google Scholar 

  • Dong, C., Zhang, J., Zong, C., Hattori, M., & Di, H. (2016). Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In Natural language understanding and intelligent applications (pp. 239–250). Springer.

  • Farquad, M. A. H., & Bose, I. (2012). Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1), 226–233.

    Article  Google Scholar 

  • Ghosh, S., Matsuoka, Y., Asai, Y., Hsin, K. Y., & Kitano, H. (2011). Software for systems biology: From tools to integrated platforms. Nature Reviews Genetics, 12(12), 821–832.

    Article  Google Scholar 

  • Goble, C. (2014). Better software, better research. IEEE Internet Computing, 18(5), 4–8.

    Article  Google Scholar 

  • Goyala, A., Guptab, V., & Kumarc, M. (2018). Recent named entity recognition and classification techniques: A systematic review. Computer Science Review, 29, 21–43.

    Article  Google Scholar 

  • Gridach, M. (2017). Character-level neural network for biomedical named entity recognition. Journal of Biomedical Informatics, 70, 85–91.

    Article  Google Scholar 

  • Hakala, K., Pyysalo, S. (2019). Biomedical Named Entity Recognition with Multilingual BERT. Association for Computational Linguistics, In Proceedings of the 5th workshop on BioNLP open shared tasks (pp. 56–61). Hong Kong, China.

  • Hemati, W., & Mehler, A. (2019). Lstmvoter: Chemical named entity recognition using a conglomerate of sequence labeling tools. Journal of Cheminformatics, 11, 3.

    Article  Google Scholar 

  • Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(2), 107–116.

    Article  MathSciNet  MATH  Google Scholar 

  • Howison, J., & Herbsleb, J. D. (2011). Scientific software production: incentives and collaboration. In Proceedings of the 2011 ACM conference on computer supported cooperative work (pp. 513–522). Hangzhou, China.

  • Howison, J., & Bullard, J. (2016). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science & Technology, 67(9), 2137–2155.

    Article  Google Scholar 

  • Howison, J., Deelman, E., Mc Lennan, M. J., et al. (2015). Understanding the scientific software ecosystem and its impact: Current and future measures. Research Evaluation, 24(4), 454–470.

    Article  Google Scholar 

  • Hsu, B. M. (2020). Comparison of supervised classification models on textual data. Mathematics, 8(5), 851.

    Article  Google Scholar 

  • Jackson, M. (2012). How to cite and describe software. Retrieved December 7, 2021, from https://www.software.ac.uk/how-cite-software.

  • Katz, D. S., Bouquin, D., Hong, N., Hausman, J., Jones, C., & Chivvis, D., et al. (2019a). Software citation implementation challenges. arXiv, 1905.08674.

  • Katz, D. S., McInnes, L. C., Bernholdt, D. E., Mayes, A. C., Hong, N. P. C., Duckles, J., Gesing, S., Heroux, M. A., Hettrick, S., Jimenez, R. C., Pierce, M., Weaver, B., & Wilkins-Diehr, N. (2019b). Community organizations: Changing the culture in which research software is developed and sustained. Computing in Science & Engineering, 21(2), 8–24.

    Article  Google Scholar 

  • Katz, D. S., Hong, N., Clark, T., Muench, A., & Yeston, J. (2020). The importance of software citation. F1000 Research, 9, 1257.

  • Kristina, T., Dan, K., Christopher, M., & Yoram, S. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL (pp. 252–259). Edmonton, Canada.

  • Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 260–270). California, USA.

  • Le, T. A., Arkhipov, M. Y., & Burtsev, M. S. (2017). Application of a hybrid Bi-LSTM-CRF model to the task of Russian named entity recognition. In Conference on artificial intelligence and natural language (pp. 91–103). Petersburg, Russia.

  • Leroy, D., Sallou, J., Bourcier, J., & Combemale, B. (2021). When scientific software meets software engineering. Computer, 54(12), 60–71.

    Article  Google Scholar 

  • Li, J., Sun, A., & Joty, S. R. (2018). SegBot: A generic neural text segmentation model with pointer network. In Proceedings of the twenty-seventh international joint conference on artificial intelligence (pp. 4166–4172). Stockholm, Sweden.

  • Li, K., Chen, P. Y., & Yan, E. (2019). Challenges of measuring software impact through citations: An examination of the lme4 R package. Journal of Informetrics, 13(1), 449–461.

    Article  Google Scholar 

  • Li, K., Yan, E., & Feng, Y. (2017). How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics, 11(4), 989–1002.

    Article  Google Scholar 

  • Lin, F., & Xie, D. (2020). Research on named entity recognition of traditional Chinese medicine electronic medical records. In Proceedings of ninth international conference on health information science (pp.61–67). Amsterdam and Leiden, Netherlands.

  • Liu, P., Choo, K. K. R., Wang, L., & Huang, F. (2017). SVM or deep learning? A comparative study on remote sensing image classification. Soft Computing, 21(23), 7053–7065.

    Article  Google Scholar 

  • Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., & Wang, J. (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8), 1381–1388.

    Article  Google Scholar 

  • Löffler, F., Brandt, S. R., Allen, G., & Schnetter, E. (2014). Cactus: Issues for sustainable simulation software. Journal of Open Research Software, 2(1), e12.

    Article  Google Scholar 

  • Marcot, B. G., & Hanea, A. M. (2021). What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis. Computational Statistics, 36(3), 2009–2031.

    Article  MathSciNet  MATH  Google Scholar 

  • Marx, V. (2013). Biology: The big challenges of big data. Nature, 498(7453), 255–260.

    Article  Google Scholar 

  • Mikolov, T., Karafiat, M., Burget, L., Cernock, J., & Khudanpur, S. (2010). Recurrent neural network-based language model. In Proceedings of eleventh annual conference of the international speech communication association (pp.1045–1048). Chiba, Japan.

  • Na, S. H., Kim, H., Min, J., & Kim, K. (2019). Improving LSTM CRFs using character-based compositions for Korean named entity recognition. Computer Speech & Language, 54, 106–121.

    Article  Google Scholar 

  • Nandar, T. L., Soe, T. L., & Soe, K. M. (2020). A comparative study of named entity recognition on myanmar language. In Proceedings of 23rd conference of the oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (pp. 60–64). Yangon, Myanmar.

  • Nguyen, T., Nguyen, D., & Rao, P. (2003). Adaptive name entity recognition under highly unbalanced data. arXiv preprint, 10296.

  • Ordua-Malea, E., & Costas, R. (2021). Link-based approach to study scientific software usage: The case of VOSviewer. Scientometrics, 126, 8153–8186.

    Article  Google Scholar 

  • Pan, X. L., Yan, E., Wang, Q. Q., & Hua, W. N. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871.

    Article  Google Scholar 

  • Pan, X., Yan, E., & Hua, W. (2016). Disciplinary differences of software use and impact in scientific literature. Scientometrics, 109(3), 1–18.

    Article  Google Scholar 

  • Park, H., & Wolfram, D. (2019). Research software citation in the data citation index: Current practices and implications for research software sharing and reuse. Journal of Informetrics, 13(2), 574–582.

    Article  Google Scholar 

  • Piwowar, H. (2013). Altmetrics: Value all research products. Nature, 493(7431), 159–159.

    Article  Google Scholar 

  • Rais, M., Lachkar, A., Lachkar A, & Ouatik, S. E. A. (2014). A comparative study of biomedical named entity recognition methods based machine learning approach. In Proceedings of 3rd IEEE international colloquium on information science and technology (pp. 329–334). Tetouan, Morocco.

  • Rau, L. F. (1991). Extracting company names from text. In Proceedings of the seventh IEEE conference on artificial intelligence application (pp. 29–32). FL, USA.

  • Shaalan, K., & Oudah, M. (2014). A hybrid approach to Arabic named entity recognition. Journal of Information Science, 40(1), 67–87.

    Article  Google Scholar 

  • Smith, A. M., Katz, D. S., & Niemeyer, K. E. (2016). Software citation principles. PeerJ, 2, e86.

    Google Scholar 

  • Soito, L. & Hwang, L. J, (2016). Citations for Software: Providing Identification Access and Recognition for Research Software. International Journal of Digital Curation, 11(2), 48–63.

  • Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A 50-year survey. Journal of the Medical Library Association, 92(3), 364–371.

    Google Scholar 

  • Sundheim, B. M. (1995). Overview of results of the MUC-6 evaluation. In Proceedings of the 6th conference on message understanding (pp. 13–31). Maryland, USA.

  • Thelwall, M., & Kousha, K. (2016). Academic software downloads from Google code. Information Research, 21(1), n1.

    Google Scholar 

  • Ukov-Gregori, A., Bachrach, Y., & Coope, S. (2018). Named Entity Recognition with Parallel Recurrent Neural Networks. In Proceedings of the 56th annual meeting of the association for computational linguistics (pp. 69–74). Melbourne, Australia.

  • Wang, H. B., Gao, H. K., Shen, Q., & Xian, Y. (2019). Thai language names, place names and organization names entity recognition. Journal of System Simulation, 31(5), 1010–1018.

    Google Scholar 

  • Wang, S. J., Mathew, A., Chen, Y., Xi, L. F., Ma, L., & Lee, J. (2009). Empirical analysis of support vector machine ensemble classifiers. Expert Systems with Applications, 36(3), 6466–6476.

    Article  Google Scholar 

  • Wu, J. (2011). Improving the writing of research papers: IMRAD and beyond. Landscape Ecology, 26(10), 1345–1349.

    Article  Google Scholar 

  • Yang, B., Rousseau, R., Wang, X., & Huang, S. (2018). How important is scientific software in bioinformatics research? A comparative study between international and Chinese research communities. Journal of the Association for Information Science and Technology, 69(9), 1122–1133.

    Article  Google Scholar 

  • Zeng, D., Sun, C., Lin, L., & Liu, B. (2017). LSTM-CRF for drug-named entity recognition. Entropy, 19(6), 283.

    Article  Google Scholar 

  • Zhang, Y. C., Liu, J. Y., Liu, J., Sheng, J., & Lv, J. W. (2018). EEG recognition of motor imagery based on SVM ensemble. In Proceedings of the 5th international conference on systems and informatics (pp. 866–870). Nanjing, China.

  • Zhou, J. T., Zhang, H., Jin, D., Peng, X., Xiao, Y., & Cao, Z. (2019). Roseq: Robust sequence labeling. IEEE Transactions on Neural Networks and Learning Systems, 31(7), 2304–2314.

    MathSciNet  Google Scholar 

  • Zhu, F., & Shen, B. (2012). Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing. PLoS ONE, 7(6), e39230.

    Article  Google Scholar 

  • Zitnik, M., Nguyen, F., Wang, B., Leskovec, J., Goldenberg, A., & Hoffman, M. M. (2019). Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50, 71–91.

    Article  Google Scholar 

  • Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., & Telenti, A. (2019). A primer on deep learning in genomics. Nature Genetics, 51(1), 12–18.

    Article  Google Scholar 

Download references

Acknowledgements

This study is supported by the National Social Science Fund of China under Grant No. 18BTQ077. The authors would like to thank two anonymous reviewers for their great suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Yang.

Ethics declarations

Conflict of interest

The author has no conflicts of interest to declare that are relevant to the content of this article.

Appendices

Appendix A

The top 10 boundary words.

Id

Boundary words

1

Use

2

Software

3

Package

4

Program

5

Tool

6

Analysis

7

Search

8

Version

9

Script

10

Result

Appendix B

Scores of the top 50 packages (ranked by document frequencies).

Software

DF

CitedS

CitedP

BLAST

3137

60123

19

Primer

2932

52625

18

SAM

1517

37469

25

Cooper

1470

33849

23

SPSS

1268

15654

12

QTL

1219

25041

21

Excel

1195

21659

18

MEGA

1024

17225

17

BWA

837

24933

30

Clustal

817

12479

15

IPA

772

15396

20

GCTA

732

12552

17

DAVID

697

14043

20

SAS

669

11536

17

BAM

638

18503

29

PAM

635

16710

26

Blast2GO

632

12068

19

GraphPad Prism

630

11374

18

Heatmap

621

14299

23

ImageJ

595

11366

19

Python

590

14930

25

Cufflinks

582

16966

29

Genome Browser

572

16188

28

RMA

570

12294

22

BLASTN

549

12392

23

Trinity

548

10796

20

DESeq

527

11053

21

BLASTP

482

9849

20

edgeR

479

9624

20

Picard

474

11944

25

RepeatMasker

458

13067

29

Cytoscape

453

7962

18

dChip

453

13740

30

Java

441

10471

24

ClustalW

419

6426

15

GATK

417

11180

27

Primer3

415

6964

17

R-project

414

10871

26

MCL

413

11046

27

Scaffold

405

11590

29

FastQC

404

7343

18

MUSCLE

398

8986

23

HapMap

385

9661

25

MEME

371

7645

21

SAP

364

6730

18

LightCycler

362

6161

17

Bowtie2

338

6672

20

InterProScan

336

8008

24

IGV

313

7075

23

Velvet

312

9571

31

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, L., Kang, X., Huang, S. et al. A refinement strategy for identification of scientific software from bioinformatics publications. Scientometrics 127, 3293–3316 (2022). https://doi.org/10.1007/s11192-022-04381-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-022-04381-y

Keywords

Navigation