A refinement strategy for identification of scientific software from bioinformatics publications

Jiang, Lu; Kang, Xinyu; Huang, Shan; Yang, Bo

doi:10.1007/s11192-022-04381-y

A refinement strategy for identification of scientific software from bioinformatics publications

Published: 03 May 2022

Volume 127, pages 3293–3316, (2022)
Cite this article

Scientometrics Aims and scope Submit manuscript

Lu Jiang^1,2,
Xinyu Kang³,
Shan Huang⁴ &
…
Bo Yang ORCID: orcid.org/0000-0003-1903-6292^1,5

468 Accesses
1 Citation
Explore all metrics

Abstract

In the field of bioinformatics, a large number of classical software becomes a necessary research tool. To measure the influence of scientific software as one kind of important intellectual products, a few strategies have been proposed to identify the software names from full texts of papers to collect the usage data of packages in bioinformatics research. However, the performance of these strategies is limited because of the highly imbalance of data in the full texts. This study proposes EnsembleSVMs-CRF, a two-step refinement strategy based on ensemble learning that gradually increases the sentences that contain software mentions to improve the performance of named entity recognition. The experiment on the bioinformatics corpus shows that the performance of EnsembleSVMs-CRF, in terms of the local F1 (78.81%) and the global F1-A (73.49%), is superior to the rule-based bootstrapping method and direct CRF. Application of this strategy to the articles published between 2013 and 2017 in 27 bioinformatics journals extracted 8,239 unique packages. The most popular 50 packages thus identified demonstrate that most of them are professional software which generally requires inter-discipline knowledge, rather than programming skill. Meanwhile, we found that researchers in bioinformatics tend to use free scientific software, and the application of general software is increasing compared with professional software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection

A Framework to Automatically Extract Funding Information from Text

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

References

Alfred, R., Leong, L. C., On, C. K., & Anthony, P. (2014). Malay named entity recognition based on rule-based approach. International Journal of Machine Learning and Computing, 4(3), 300–306.
Article Google Scholar
Aryani, A., Poblet, M., Unsworth, K., Wang, J., Evans, B., Devaraju, A., Brigitte, H., Claus-Peter, K., Benjamin, Z., & Samuele, K. (2018). A research graph dataset for connecting research data repositories using RD-Switchboard. Scientific Data, 5, 180099.
Article Google Scholar
Bertin, M., Atanassova, I., Lariviere, V., & Gingras, Y. (2013). The distribution of references in scientific papers: An analysis of the IMRaD structure. In Proceedings of the international conference on scientometrics and informetrics (pp. 591–603), Vienna, Austria.
Borgman, C. L., Wallis, J., & Mayernik, M. (2012). Who’s got the data? Interdependencies in science and technology collaborations. Computer Supported Cooperative Work, 21(6), 485–523.
Article Google Scholar
Boudjellal, N., Zhang, H., Khan, A., Ahmad, A., & Dai, L. (2021). Abioner: A bert-based model for arabic biomedical named-entity recognition. Complexity, 3, 1–6.
Article Google Scholar
Bressan, B. (2013). The SciencePAD treasure hunt of persistent identifiers. CERN Bulletin.
Chassanoff, A., & Altman, M. (2019). Curation as “Interoperability with the Future”: Preserving scholarly research software in academic libraries. Journal of the Association for Information Science and Technology, 71(3), 325–337.
Article Google Scholar
Chen, L., & Davidson, S. B. (2020). Automating software citation using gitcite. In 2020 IEEE 36th international conference on data engineering (ICDE) (pp.1754–1757). Texas, USA.
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow, P. M., Zietz, M., Hoffman, M. M., Xie, W., Rosen, G. L., Lengerich, B. J., Israeli, J., Lanchantin, J., Woloszynek, S., Carpenter, A. E., Shrikumar, A., Xu, J., … Greene, C. S. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface, 15(141), 20170387.
Article Google Scholar
Chiticariu, L., Li, Y., & Reiss, F. (2013). Rule-based information extraction is dead! long live rule-based information extraction systems! In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 827–832). Washington, USA.
Cho, M., Ha, J., Park, C., & Park, S. (2020). Combinatorial feature embedding based on cnn and lstm for biomedical named entity recognition. Journal of Biomedical Informatics, 103, 103381.
Article Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(76), 2493–2537.
MATH Google Scholar
Cosmo, R. D. (2020). Announcing biblatex-software: Software citation made easy. Software Engineering Notes, 45(4), 22–23.
Article Google Scholar
Devi, G. R., Kumar, M. A., & Soman, K. P. (2019). Extraction of named entities from social media text in tamil language using N-gram embedding for disaster management. Nature-Inspired Computation in Data Mining and Machine Learning, 855, 207–223.
Google Scholar
Dong, C., Zhang, J., Zong, C., Hattori, M., & Di, H. (2016). Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In Natural language understanding and intelligent applications (pp. 239–250). Springer.
Farquad, M. A. H., & Bose, I. (2012). Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1), 226–233.
Article Google Scholar
Ghosh, S., Matsuoka, Y., Asai, Y., Hsin, K. Y., & Kitano, H. (2011). Software for systems biology: From tools to integrated platforms. Nature Reviews Genetics, 12(12), 821–832.
Article Google Scholar
Goble, C. (2014). Better software, better research. IEEE Internet Computing, 18(5), 4–8.
Article Google Scholar
Goyala, A., Guptab, V., & Kumarc, M. (2018). Recent named entity recognition and classification techniques: A systematic review. Computer Science Review, 29, 21–43.
Article Google Scholar
Gridach, M. (2017). Character-level neural network for biomedical named entity recognition. Journal of Biomedical Informatics, 70, 85–91.
Article Google Scholar
Hakala, K., Pyysalo, S. (2019). Biomedical Named Entity Recognition with Multilingual BERT. Association for Computational Linguistics, In Proceedings of the 5th workshop on BioNLP open shared tasks (pp. 56–61). Hong Kong, China.
Hemati, W., & Mehler, A. (2019). Lstmvoter: Chemical named entity recognition using a conglomerate of sequence labeling tools. Journal of Cheminformatics, 11, 3.
Article Google Scholar
Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(2), 107–116.
Article MathSciNet MATH Google Scholar
Howison, J., & Herbsleb, J. D. (2011). Scientific software production: incentives and collaboration. In Proceedings of the 2011 ACM conference on computer supported cooperative work (pp. 513–522). Hangzhou, China.
Howison, J., & Bullard, J. (2016). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science & Technology, 67(9), 2137–2155.
Article Google Scholar
Howison, J., Deelman, E., Mc Lennan, M. J., et al. (2015). Understanding the scientific software ecosystem and its impact: Current and future measures. Research Evaluation, 24(4), 454–470.
Article Google Scholar
Hsu, B. M. (2020). Comparison of supervised classification models on textual data. Mathematics, 8(5), 851.
Article Google Scholar
Jackson, M. (2012). How to cite and describe software. Retrieved December 7, 2021, from https://www.software.ac.uk/how-cite-software.
Katz, D. S., Bouquin, D., Hong, N., Hausman, J., Jones, C., & Chivvis, D., et al. (2019a). Software citation implementation challenges. arXiv, 1905.08674.
Katz, D. S., McInnes, L. C., Bernholdt, D. E., Mayes, A. C., Hong, N. P. C., Duckles, J., Gesing, S., Heroux, M. A., Hettrick, S., Jimenez, R. C., Pierce, M., Weaver, B., & Wilkins-Diehr, N. (2019b). Community organizations: Changing the culture in which research software is developed and sustained. Computing in Science & Engineering, 21(2), 8–24.
Article Google Scholar
Katz, D. S., Hong, N., Clark, T., Muench, A., & Yeston, J. (2020). The importance of software citation. F1000 Research, 9, 1257.
Kristina, T., Dan, K., Christopher, M., & Yoram, S. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL (pp. 252–259). Edmonton, Canada.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 260–270). California, USA.
Le, T. A., Arkhipov, M. Y., & Burtsev, M. S. (2017). Application of a hybrid Bi-LSTM-CRF model to the task of Russian named entity recognition. In Conference on artificial intelligence and natural language (pp. 91–103). Petersburg, Russia.
Leroy, D., Sallou, J., Bourcier, J., & Combemale, B. (2021). When scientific software meets software engineering. Computer, 54(12), 60–71.
Article Google Scholar
Li, J., Sun, A., & Joty, S. R. (2018). SegBot: A generic neural text segmentation model with pointer network. In Proceedings of the twenty-seventh international joint conference on artificial intelligence (pp. 4166–4172). Stockholm, Sweden.
Li, K., Chen, P. Y., & Yan, E. (2019). Challenges of measuring software impact through citations: An examination of the lme4 R package. Journal of Informetrics, 13(1), 449–461.
Article Google Scholar
Li, K., Yan, E., & Feng, Y. (2017). How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics, 11(4), 989–1002.
Article Google Scholar
Lin, F., & Xie, D. (2020). Research on named entity recognition of traditional Chinese medicine electronic medical records. In Proceedings of ninth international conference on health information science (pp.61–67). Amsterdam and Leiden, Netherlands.
Liu, P., Choo, K. K. R., Wang, L., & Huang, F. (2017). SVM or deep learning? A comparative study on remote sensing image classification. Soft Computing, 21(23), 7053–7065.
Article Google Scholar
Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., & Wang, J. (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8), 1381–1388.
Article Google Scholar
Löffler, F., Brandt, S. R., Allen, G., & Schnetter, E. (2014). Cactus: Issues for sustainable simulation software. Journal of Open Research Software, 2(1), e12.
Article Google Scholar
Marcot, B. G., & Hanea, A. M. (2021). What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis. Computational Statistics, 36(3), 2009–2031.
Article MathSciNet MATH Google Scholar
Marx, V. (2013). Biology: The big challenges of big data. Nature, 498(7453), 255–260.
Article Google Scholar
Mikolov, T., Karafiat, M., Burget, L., Cernock, J., & Khudanpur, S. (2010). Recurrent neural network-based language model. In Proceedings of eleventh annual conference of the international speech communication association (pp.1045–1048). Chiba, Japan.
Na, S. H., Kim, H., Min, J., & Kim, K. (2019). Improving LSTM CRFs using character-based compositions for Korean named entity recognition. Computer Speech & Language, 54, 106–121.
Article Google Scholar
Nandar, T. L., Soe, T. L., & Soe, K. M. (2020). A comparative study of named entity recognition on myanmar language. In Proceedings of 23rd conference of the oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (pp. 60–64). Yangon, Myanmar.
Nguyen, T., Nguyen, D., & Rao, P. (2003). Adaptive name entity recognition under highly unbalanced data. arXiv preprint, 10296.
Ordua-Malea, E., & Costas, R. (2021). Link-based approach to study scientific software usage: The case of VOSviewer. Scientometrics, 126, 8153–8186.
Article Google Scholar
Pan, X. L., Yan, E., Wang, Q. Q., & Hua, W. N. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871.
Article Google Scholar
Pan, X., Yan, E., & Hua, W. (2016). Disciplinary differences of software use and impact in scientific literature. Scientometrics, 109(3), 1–18.
Article Google Scholar
Park, H., & Wolfram, D. (2019). Research software citation in the data citation index: Current practices and implications for research software sharing and reuse. Journal of Informetrics, 13(2), 574–582.
Article Google Scholar
Piwowar, H. (2013). Altmetrics: Value all research products. Nature, 493(7431), 159–159.
Article Google Scholar
Rais, M., Lachkar, A., Lachkar A, & Ouatik, S. E. A. (2014). A comparative study of biomedical named entity recognition methods based machine learning approach. In Proceedings of 3rd IEEE international colloquium on information science and technology (pp. 329–334). Tetouan, Morocco.
Rau, L. F. (1991). Extracting company names from text. In Proceedings of the seventh IEEE conference on artificial intelligence application (pp. 29–32). FL, USA.
Shaalan, K., & Oudah, M. (2014). A hybrid approach to Arabic named entity recognition. Journal of Information Science, 40(1), 67–87.
Article Google Scholar
Smith, A. M., Katz, D. S., & Niemeyer, K. E. (2016). Software citation principles. PeerJ, 2, e86.
Google Scholar
Soito, L. & Hwang, L. J, (2016). Citations for Software: Providing Identification Access and Recognition for Research Software. International Journal of Digital Curation, 11(2), 48–63.
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A 50-year survey. Journal of the Medical Library Association, 92(3), 364–371.
Google Scholar
Sundheim, B. M. (1995). Overview of results of the MUC-6 evaluation. In Proceedings of the 6th conference on message understanding (pp. 13–31). Maryland, USA.
Thelwall, M., & Kousha, K. (2016). Academic software downloads from Google code. Information Research, 21(1), n1.
Google Scholar
Ukov-Gregori, A., Bachrach, Y., & Coope, S. (2018). Named Entity Recognition with Parallel Recurrent Neural Networks. In Proceedings of the 56th annual meeting of the association for computational linguistics (pp. 69–74). Melbourne, Australia.
Wang, H. B., Gao, H. K., Shen, Q., & Xian, Y. (2019). Thai language names, place names and organization names entity recognition. Journal of System Simulation, 31(5), 1010–1018.
Google Scholar
Wang, S. J., Mathew, A., Chen, Y., Xi, L. F., Ma, L., & Lee, J. (2009). Empirical analysis of support vector machine ensemble classifiers. Expert Systems with Applications, 36(3), 6466–6476.
Article Google Scholar
Wu, J. (2011). Improving the writing of research papers: IMRAD and beyond. Landscape Ecology, 26(10), 1345–1349.
Article Google Scholar
Yang, B., Rousseau, R., Wang, X., & Huang, S. (2018). How important is scientific software in bioinformatics research? A comparative study between international and Chinese research communities. Journal of the Association for Information Science and Technology, 69(9), 1122–1133.
Article Google Scholar
Zeng, D., Sun, C., Lin, L., & Liu, B. (2017). LSTM-CRF for drug-named entity recognition. Entropy, 19(6), 283.
Article Google Scholar
Zhang, Y. C., Liu, J. Y., Liu, J., Sheng, J., & Lv, J. W. (2018). EEG recognition of motor imagery based on SVM ensemble. In Proceedings of the 5th international conference on systems and informatics (pp. 866–870). Nanjing, China.
Zhou, J. T., Zhang, H., Jin, D., Peng, X., Xiao, Y., & Cao, Z. (2019). Roseq: Robust sequence labeling. IEEE Transactions on Neural Networks and Learning Systems, 31(7), 2304–2314.
MathSciNet Google Scholar
Zhu, F., & Shen, B. (2012). Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing. PLoS ONE, 7(6), e39230.
Article Google Scholar
Zitnik, M., Nguyen, F., Wang, B., Leskovec, J., Goldenberg, A., & Hoffman, M. M. (2019). Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50, 71–91.
Article Google Scholar
Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., & Telenti, A. (2019). A primer on deep learning in genomics. Nature Genetics, 51(1), 12–18.
Article Google Scholar

Download references

Acknowledgements

This study is supported by the National Social Science Fund of China under Grant No. 18BTQ077. The authors would like to thank two anonymous reviewers for their great suggestions.

Author information

Authors and Affiliations

College of Information Management, Nanjing Agricultural University, Nanjing, China
Lu Jiang & Bo Yang
Scientometrics & Evaluation Research Center (SERC), Chengdu Library and Information Center of Chinese Academy of Sciences, Chengdu, China
Lu Jiang
College of Management Science, Chengdu University of Technology, Chengdu, China
Xinyu Kang
School of Information Management, Sun Yat-Sen University, Guangzhou, China
Shan Huang
Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing, China
Bo Yang

Authors

Lu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Kang
View author publications
You can also search for this author in PubMed Google Scholar
Shan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Yang.

Ethics declarations

Conflict of interest

The author has no conflicts of interest to declare that are relevant to the content of this article.

Appendices

Appendix A

The top 10 boundary words.

Id	Boundary words
1	Use
2	Software
3	Package
4	Program
5	Tool
6	Analysis
7	Search
8	Version
9	Script
10	Result

Appendix B

Scores of the top 50 packages (ranked by document frequencies).

Software	DF	CitedS	CitedP
BLAST	3137	60123	19
Primer	2932	52625	18
SAM	1517	37469	25
Cooper	1470	33849	23
SPSS	1268	15654	12
QTL	1219	25041	21
Excel	1195	21659	18
MEGA	1024	17225	17
BWA	837	24933	30
Clustal	817	12479	15
IPA	772	15396	20
GCTA	732	12552	17
DAVID	697	14043	20
SAS	669	11536	17
BAM	638	18503	29
PAM	635	16710	26
Blast2GO	632	12068	19
GraphPad Prism	630	11374	18
Heatmap	621	14299	23
ImageJ	595	11366	19
Python	590	14930	25
Cufflinks	582	16966	29
Genome Browser	572	16188	28
RMA	570	12294	22
BLASTN	549	12392	23
Trinity	548	10796	20
DESeq	527	11053	21
BLASTP	482	9849	20
edgeR	479	9624	20
Picard	474	11944	25
RepeatMasker	458	13067	29
Cytoscape	453	7962	18
dChip	453	13740	30
Java	441	10471	24
ClustalW	419	6426	15
GATK	417	11180	27
Primer3	415	6964	17
R-project	414	10871	26
MCL	413	11046	27
Scaffold	405	11590	29
FastQC	404	7343	18
MUSCLE	398	8986	23
HapMap	385	9661	25
MEME	371	7645	21
SAP	364	6730	18
LightCycler	362	6161	17
Bowtie2	338	6672	20
InterProScan	336	8008	24
IGV	313	7075	23
Velvet	312	9571	31

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, L., Kang, X., Huang, S. et al. A refinement strategy for identification of scientific software from bioinformatics publications. Scientometrics 127, 3293–3316 (2022). https://doi.org/10.1007/s11192-022-04381-y

Download citation

Received: 19 August 2021
Accepted: 11 April 2022
Published: 03 May 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11192-022-04381-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A refinement strategy for identification of scientific software from bioinformatics publications

Abstract

Access this article

Similar content being viewed by others

Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection

A Framework to Automatically Extract Funding Information from Text

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A refinement strategy for identification of scientific software from bioinformatics publications

Abstract

Access this article

Similar content being viewed by others

Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection

A Framework to Automatically Extract Funding Information from Text

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation