ABSTRACT
Developers construct bioinformatics software to automate crucial analysis and research related to biological science. However, challenges while developing bioinformatics software can prohibit advancement in biological science research. Through a human-centric systematic analysis, we can identify challenges related to bioinformatics software development and envision future research directions. From our qualitative analysis with 221 Stack Overflow questions, we identify six categories of challenges: file operations, searching genetic entities, defect resolution, configuration management, sequence alignment, and translation of genetic information. To mitigate the identified challenges we envision three research directions that require synergies between bioinformatics and automated software engineering: (i) automated configuration recommendation using optimization algorithms, (ii) automated and comprehensive defect categorization, and (iii) intelligent task assistance with active and reinforcement learning.
- Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. Basic local alignment search tool. Journal of molecular biology 215, 3 (1990), 403--410.Google ScholarCross Ref
- Evan Anderson, G. Veith, and David Weininger. 1987. SMILES: a line notation and computerized interpreter for chemical structures.Google Scholar
- P. Arora, D. Ganguly, and G. J. F. Jones. 2015. The good, the bad and their kins: Identifying questions with negative scores in StackOverflow. In 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). 1232--1239.Google ScholarDigital Library
- Sebastian Baltes, Lorik Dumani, Christoph Treude, and Stephan Diehl. 2018. SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR '18). ACM, New York, NY, USA, 319--330. Google ScholarDigital Library
- Brock Angus Campbell and Christoph Treude. 2017. NLP2Code: Code snippet content assist via natural language tasks. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 628--632.Google ScholarCross Ref
- Mikaela Cashman, Myra B. Cohen, Priya Ranjan, and Robert W. Cottingham. 2018. Navigating the Maze: The Impact of Configurability in Bioinformatics Software. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 757--767. Google ScholarDigital Library
- P. K. Chilana, C. L. Palmer, and A. J. Ko. 2009. Comparing bioinformatics software development by computer scientists and biologists: An exploratory study. In 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. 72--79.Google Scholar
- Levin Clement, Dynomant Emeric, Mouchard Laurent, Landsman David, Hovig Eivind, Vlahovicek Kristian, et al. 2018. A data-supported history of bioinformatics tools. arXiv preprint arXiv:1807.06808 (2018).Google Scholar
- Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J. L. de Hoon. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 11 (03 2009), 1422--1423. arXiv:https://academic.oup.com/bioinformatics/article-pdf/25/11/1422/944180/btp163.pdf Google ScholarDigital Library
- Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 1 (1960), 37--46. Google ScholarCross Ref
- Benjamin F Crabtree and William L Miller. 1999. Doing qualitative research. sage publications.Google Scholar
- Kalyanmoy Deb. 2001. Multi-objective optimization using evolutionary algorithms. Vol. 16. John Wiley & Sons.Google ScholarDigital Library
- Stack Exchange. 2019. Stack Exchange. https://data.stackexchange.com/. [Online; accessed 08-06-2020].Google Scholar
- E. Farhana, N. Imtiaz, and A. Rahman. 2019. Synthesizing Program Execution Time Discrepancies in Julia Used for Scientific Software. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). 496--500.Google Scholar
- Mathieu Fourment and Michael R. Gillings. 2007. A comparison of common programming languages used in bioinformatics. BMC Bioinformatics 9 (2007), 82 -- 82.Google ScholarCross Ref
- Anthony JF Griffiths, Susan R Wessler, Richard C Lewontin, William M Gelbart, David T Suzuki, Jeffrey H Miller, et al. 2005. An introduction to genetic analysis. Macmillan.Google Scholar
- Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API Learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (Seattle, WA, USA) (FSE 2016). Association for Computing Machinery, New York, NY, USA, 631--642. Google ScholarDigital Library
- Junxiao Han, Emad Shihab, Zhiyuan Wan, Shuiguang Deng, and Xin Xia. 2020. What do Programmers Discuss about Deep Learning Frameworks. EMPIRICAL SOFTWARE ENGINEERING (2020).Google Scholar
- Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang. 2018. API Method Recommendation without Worrying about the Task-API Knowledge Gap. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 293--304. Google ScholarDigital Library
- Wolfgang Huber, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, et al. 2015. Orchestrating high-throughput genomic analysis with Bioconductor. Nature methods 12, 2 (2015), 115.Google Scholar
- Nasif Imtiaz, Akond Rahman, Effat Farhana, and Laurie Williams. 2019. Challenges with Responding to Static Analysis Tool Alerts. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Canada) (MSR '19).Google ScholarDigital Library
- Someswa Kesh and Wullianallur Raghupathi. 2004. Critical issues in bioinformatics and computing. Perspectives in health information management/AHIMA, American Health Information Management Association 1 (2004).Google Scholar
- Muin J Khoury, Terri H Beaty, Terri H Beaty, Bernice H Cohen, et al. 1993. Fundamentals of genetic epidemiology. Vol. 22. Monographs in Epidemiology and.Google Scholar
- J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 1 (1977), 159--174. http://www.jstor.org/stable/2529310Google ScholarCross Ref
- Brendan Lawlor and Paul Walsh. 2015. Engineering bioinformatics: building reliability, performance and productivity into bioinformatics software. Bio-engineered 6, 4 (2015), 193--203. arXiv:https://doi.org/10.1080/21655979.2015.1050162 PMID: 25996054. Google ScholarCross Ref
- David W Mount. 2001. Bioinformatics: sequence and genome analysis. Vol. 1. Cold spring harbor laboratory press New York.Google Scholar
- NCBI. 2020. BLAST Topics. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp &DOC_TYPE=BlastHelp [Online; accessed 09-06-2020].Google Scholar
- NCBI. 2020. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/. [Online; accessed 07-06-2020].Google Scholar
- Akond Rahman, Effat Farhana, and Nasif Imtiaz. 2019. Snakes in Paradise?: Insecure Python-related Coding Practices in Stack Overflow. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Canada) (MSR '19).Google ScholarDigital Library
- Akond Rahman, Effat Farhana, Chris Parnin, and Laurie Williams. 2020. Gang of Eight: A Defect Taxonomy for Infrastructure As Code Scripts. In Proceedings of the 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE '20). to appear. pre-print: https://akondrahman.github.io/papers/icse20_acid.pdf.Google ScholarDigital Library
- Pamela H. Russell, Rachel L. Johnson, Shreyas Ananthan, Benjamin Harnke, and Nichole E. Carlson. 2018. A large-scale analysis of bioinformatics code on GitHub. PLOS ONE 13, 10 (10 2018), 1--19. Google ScholarCross Ref
- Johnny Saldana. 2015. The coding manual for qualitative researchers. Sage.Google Scholar
- Burr Settles. 2009. Active learning literature survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
- Yi Shang, Hongchi Shi, and Su-Shing Chen. 2001. An Intelligent Distributed Environment for Active Learning. J. Educ. Resour. Comput. 1, 2es (Aug. 2001), 4--es. Google ScholarDigital Library
- Stack Overflow. 2011. bioinformatics - Find nucleotides in DNA sequence with perl. https://stackoverflow.com/questions/7090371/. [Online; accessed 06-06-2020].Google Scholar
- Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.Google ScholarDigital Library
- Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. 2007. Usage-Based Web Recommendations: A Reinforcement Learning Approach. In Proceedings of the 2007 ACM Conference on Recommender Systems (Minneapolis, MN, USA) (RecSys '07). Association for Computing Machinery, New York, NY, USA, 113--120. Google ScholarDigital Library
- Mohammad Tahaei, Kami Vaniea, and Naomi Saphra. 2020. Understanding Privacy-Related Questions on Stack Overflow. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI '20). Association for Computing Machinery, New York, NY, USA, 1--14. Google ScholarDigital Library
- Trias Thireou, George Spyrou, and Vassilis Atlamazoglou. 2007. A Survey of the Availability of Primary Bioinformatics Web Resources. Genomics, Proteomics Bioinformatics 5, 1 (2007), 70 -- 76. Google ScholarCross Ref
- Emily Waltz. 2020. Software and Genetic Sequencing Track the Coronavirus's Path. https://spectrum.ieee.org/the-human-os/biomedical/devices/genetic-sequencing-and-online-software-tools-track-caronaviruss-path. [Online; accessed 07-05-2020].Google Scholar
- Zhiyuan Wan, David Lo, Xin Xia, and Liang Cai. 2017. Bug Characteristics in Blockchain Systems: A Large-Scale Empirical Study. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 413--424.Google Scholar
Index Terms
- A vision to mitigate bioinformatics software development challenges
Recommendations
Navigating the maze: the impact of configurability in bioinformatics software
ASE '18: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software EngineeringThe bioinformatics software domain contains thousands of applications for automating tasks such as the pairwise alignment of DNA sequences, building and reasoning about metabolic models or simulating growth of an organism. Its end users range from ...
Software Engineering Education for Bioinformatics
CSEET '09: Proceedings of the 2009 22nd Conference on Software Engineering Education and TrainingAs software engineering educators, it is important for us to realize the increasing domain-specificity of software, and incorporate these changes in our design of teaching material. Bioinformatics software is an example of immensely complex and critical ...
Bioinformatics Strategies for Identifying Regions of Epigenetic Deregulation Associated with Aberrant Transcript Splicing and RNA-editing
BIOSTEC 2015: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3Epigenetic modifications are associated with the regulation of co/post-transcriptional processing and differential transcript isoforms are known to be important during cancer progression. It remains unclear how disruptions of chromatin-based ...
Comments