ABSTRACT
Mathematical expressions are required not only for numerical calculations but also for discussions and documentation. They play a significant role in providing clarifications and disambiguation of ideas, concepts and definitions. Software engineering artifacts such as source code, documentation and bug reports have mathematical expressions in them. Our experiments, with a commit message generation tool, suggest that mathematical expressions present in the input data affect the tool’s accuracy. To help future research in this direction, we have constructed and shared a dataset of bug reports with both automated and manually annotated mathematical expressions. We have also shared a tool () to identify mathematical expressions. Our data set contains 2,040,120 bug reports from 10 different projects. We have used our tool to annotate all of these bug reports. From each project, we have also manually annotated 1000 bug reports. We annotated the dataset with the objective of developing toolkits to identify mathematical expressions in software engineering artifacts. Finally, we have made a case for future work to deal with mathematical expressions when improving software engineering tasks, especially those that use bug reports.
- 2021. Authoring Tools listed by the Math Working Group. https://www.w3.org/wiki/Math_Tools##Authoring_tools. [Online; accessed 19-May-2021].Google Scholar
- 2021. Modified NNGen data. https://doi.org/10.5281/zenodo.5559242. [Online; accessed 19-Aug-2021].Google Scholar
- 2021. Replication pacakge for MEDSEA mathematical expression detector tool. https://github.com/MathyB/MathyB_Dataset. [Online; accessed 19-Aug-2021].Google Scholar
- 2021. Web version of MEDSEA mathematical expression detector tool. https://med-sea.herokuapp.com/. [Online; accessed 10-Oct-2021].Google Scholar
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165(2020). arXiv:2005.14165https://arxiv.org/abs/2005.14165Google Scholar
- Raymond PL Buse and Westley R Weimer. 2010. Automatically documenting program changes. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 33–42.Google ScholarDigital Library
- Luis Fernando Cortés-Coy, Mario Linares-Vásquez, Jairo Aponte, and Denys Poshyvanyk. 2014. On automatically generating commit messages via summarization of source code changes. In 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation. IEEE, 275–284.Google ScholarDigital Library
- Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A Comprehensive Study of Real-World Numerical Bug Characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering(Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, 509–519.Google ScholarCross Ref
- Kenichi Iwatsuki, Takeshi Sagara, Tadayoshi Hara, and Akiko Aizawa. 2017. Detecting In-Line Mathematical Expressions in Scientific Documents. In Proceedings of the 2017 ACM Symposium on Document Engineering (Valletta, Malta) (DocEng ’17). Association for Computing Machinery, New York, NY, USA, 141–144. https://doi.org/10.1145/3103010.3121041Google ScholarDigital Library
- Ridhi Jain, Sai Prathik, Venkatesh Vinayakarao, and Rahul Purandare. 2018. A Search System for Mathematical Expressions on Software Binaries. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery, New York, NY, USA, 487–491. https://doi.org/10.1145/3196398.3196413Google ScholarDigital Library
- Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically Generating Commit Messages from Diffs Using Neural Machine Translation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, 135–146.Google ScholarDigital Library
- Jianming Jin, Xionghu Han, and Qingren Wang. 2003. Mathematical formulas extraction. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.1138–1141. https://doi.org/10.1109/ICDAR.2003.1227834Google Scholar
- Shahab Kamali and Frank Wm. Tompa. 2013. Retrieving Documents with Mathematical Content. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 353–362. https://doi.org/10.1145/2484028.2484083Google ScholarDigital Library
- Ahmed Lamkanfi, Javier Pérez, and Serge Demeyer. 2013. The Eclipse and Mozilla Defect Tracking Dataset: A Genuine Dataset for Mining Bug Information. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR ’13). IEEE Press, 203–206.Google ScholarDigital Library
- Mario Linares-Vásquez, Luis Fernando Cortés-Coy, Jairo Aponte, and Denys Poshyvanyk. 2015. Changescribe: A tool for automatically generating commit messages. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 709–712.Google ScholarCross Ref
- Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, and Yu Qian. 2019. Generating Commit Messages from Diffs Using Pointer-Generator Network. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 299–309. https://doi.org/10.1109/MSR.2019.00056Google ScholarDigital Library
- Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 373–384. https://doi.org/10.1145/3238147.3238190Google ScholarDigital Library
- Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 373–384. https://doi.org/10.1145/3238147.3238190Google ScholarDigital Library
- David Lorge Parnas. 2011. The Use of Mathematics in Software Development. In Theoretical Aspects of Computing – ICTAC 2011, Antonio Cerone and Pekka Pihlajasaari (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–5.Google ScholarCross Ref
- Michael Rath and Patrick Mäder. 2019. The SEOSS 33 dataset — Requirements, bug reports, code history, and trace links for entire projects. Data in Brief 25(2019), 104005. https://doi.org/10.1016/j.dib.2019.104005Google ScholarCross Ref
- Moritz Schubotz, André Greiner-Petter, Philipp Scharpf, Norman Meuschke, Howard S. Cohl, and Bela Gipp. 2018. Improving the Representation and Conversion of Mathematical Formulae by Considering Their Textual Context. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA) (JCDL ’18). Association for Computing Machinery, New York, NY, USA, 233–242. https://doi.org/10.1145/3197026.3197058Google ScholarDigital Library
- Jinfeng Shen, Xiaobing Sun, Bin Li, Hui Yang, and Jiajun Hu. 2016. On automatic summarization of what and why information in source code changes. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. IEEE, 103–112.Google ScholarCross Ref
- Leonardo Sousa, Anderson Oliveira, Willian Oizumi, Simone Barbosa, Alessandro Garcia, Jaejoon Lee, Marcos Kalinowski, Rafael de Mello, Baldoino Fonseca, Roberto Oliveira, Carlos Lucena, and Rodrigo Paes. 2018. Identifying Design Problems in the Source Code: A Grounded Theory. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 921–931. https://doi.org/10.1145/3180155.3180239Google ScholarDigital Library
- Yiannos Stathopoulos and Simone Teufel. 2016. Mathematical Information Retrieval based on Type Embeddings and Query Expansion. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 2344–2355. https://www.aclweb.org/anthology/C16-1221Google Scholar
- Renan Vieira, A. D. Silva, L. S. Rocha, and J. P. P. Gomes. 2019. From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache’s Open Source Projects. Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering(2019).Google ScholarDigital Library
- Zelun Wang and Jyh-Charn Liu. 2020. PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX. In Proceedings of the ACM Symposium on Document Engineering 2020 (Virtual Event, CA, USA) (DocEng ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3395027.3419580Google ScholarDigital Library
- Wikipedia. 2021. List of Numerical Libraries. https://en.wikipedia.org/wiki/List_of_numerical_libraries. [Online; accessed 19-May-2021].Google Scholar
- Guanping Xiao, Xiaoting Du, Yulei Sui, and Tao Yue. 2020. HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 195–206. https://doi.org/10.1109/ISSRE5003.2020.00027Google Scholar
- Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Wm. Tompa. 2016. Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 145–154. https://doi.org/10.1145/2911451.2911512Google ScholarDigital Library
Index Terms
- Mathematical Expressions in Software Engineering Artifacts
Recommendations
A search system for mathematical expressions on software binaries
MSR '18: Proceedings of the 15th International Conference on Mining Software RepositoriesDevelopers often ask for libraries that implement specific mathematical expressions. A fundamental bottleneck in building information retrieval (IR) systems to answer such mathematical queries is the inability to detect a given expression in software ...
Revisiting reopened bugs in open source software systems
AbstractReopened bugs can degrade the overall quality of a software system since they require unnecessary rework by developers. Moreover, reopened bugs also lead to a loss of trust in the end-users regarding the quality of the software. Thus, predicting ...
A Survey of Software Engineering Practice: Tools, Methods, and Results
The results of a survey of software development practice are reported and analyzed. The problems encountered in various phases of the software life cycle are measured and correlated with characteristics of the responding installations. The use and ...
Comments