skip to main content
10.1145/3570991.3571033acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
short-paper

Mathematical Expressions in Software Engineering Artifacts

Published: 04 January 2023 Publication History

Abstract

Mathematical expressions are required not only for numerical calculations but also for discussions and documentation. They play a significant role in providing clarifications and disambiguation of ideas, concepts and definitions. Software engineering artifacts such as source code, documentation and bug reports have mathematical expressions in them. Our experiments, with a commit message generation tool, suggest that mathematical expressions present in the input data affect the tool’s accuracy. To help future research in this direction, we have constructed and shared a dataset of bug reports with both automated and manually annotated mathematical expressions. We have also shared a tool () to identify mathematical expressions. Our data set contains 2,040,120 bug reports from 10 different projects. We have used our tool to annotate all of these bug reports. From each project, we have also manually annotated 1000 bug reports. We annotated the dataset with the objective of developing toolkits to identify mathematical expressions in software engineering artifacts. Finally, we have made a case for future work to deal with mathematical expressions when improving software engineering tasks, especially those that use bug reports.

References

[1]
2021. Authoring Tools listed by the Math Working Group. https://www.w3.org/wiki/Math_Tools##Authoring_tools. [Online; accessed 19-May-2021].
[2]
2021. Modified NNGen data. https://doi.org/10.5281/zenodo.5559242. [Online; accessed 19-Aug-2021].
[3]
2021. Replication pacakge for MEDSEA mathematical expression detector tool. https://github.com/MathyB/MathyB_Dataset. [Online; accessed 19-Aug-2021].
[4]
2021. Web version of MEDSEA mathematical expression detector tool. https://med-sea.herokuapp.com/. [Online; accessed 10-Oct-2021].
[5]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165(2020). arXiv:2005.14165https://arxiv.org/abs/2005.14165
[6]
Raymond PL Buse and Westley R Weimer. 2010. Automatically documenting program changes. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 33–42.
[7]
Luis Fernando Cortés-Coy, Mario Linares-Vásquez, Jairo Aponte, and Denys Poshyvanyk. 2014. On automatically generating commit messages via summarization of source code changes. In 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation. IEEE, 275–284.
[8]
Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A Comprehensive Study of Real-World Numerical Bug Characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering(Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, 509–519.
[9]
Kenichi Iwatsuki, Takeshi Sagara, Tadayoshi Hara, and Akiko Aizawa. 2017. Detecting In-Line Mathematical Expressions in Scientific Documents. In Proceedings of the 2017 ACM Symposium on Document Engineering (Valletta, Malta) (DocEng ’17). Association for Computing Machinery, New York, NY, USA, 141–144. https://doi.org/10.1145/3103010.3121041
[10]
Ridhi Jain, Sai Prathik, Venkatesh Vinayakarao, and Rahul Purandare. 2018. A Search System for Mathematical Expressions on Software Binaries. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery, New York, NY, USA, 487–491. https://doi.org/10.1145/3196398.3196413
[11]
Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically Generating Commit Messages from Diffs Using Neural Machine Translation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, 135–146.
[12]
Jianming Jin, Xionghu Han, and Qingren Wang. 2003. Mathematical formulas extraction. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.1138–1141. https://doi.org/10.1109/ICDAR.2003.1227834
[13]
Shahab Kamali and Frank Wm. Tompa. 2013. Retrieving Documents with Mathematical Content. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 353–362. https://doi.org/10.1145/2484028.2484083
[14]
Ahmed Lamkanfi, Javier Pérez, and Serge Demeyer. 2013. The Eclipse and Mozilla Defect Tracking Dataset: A Genuine Dataset for Mining Bug Information. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR ’13). IEEE Press, 203–206.
[15]
Mario Linares-Vásquez, Luis Fernando Cortés-Coy, Jairo Aponte, and Denys Poshyvanyk. 2015. Changescribe: A tool for automatically generating commit messages. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 709–712.
[16]
Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, and Yu Qian. 2019. Generating Commit Messages from Diffs Using Pointer-Generator Network. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 299–309. https://doi.org/10.1109/MSR.2019.00056
[17]
Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 373–384. https://doi.org/10.1145/3238147.3238190
[18]
Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 373–384. https://doi.org/10.1145/3238147.3238190
[19]
David Lorge Parnas. 2011. The Use of Mathematics in Software Development. In Theoretical Aspects of Computing – ICTAC 2011, Antonio Cerone and Pekka Pihlajasaari (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–5.
[20]
Michael Rath and Patrick Mäder. 2019. The SEOSS 33 dataset — Requirements, bug reports, code history, and trace links for entire projects. Data in Brief 25(2019), 104005. https://doi.org/10.1016/j.dib.2019.104005
[21]
Moritz Schubotz, André Greiner-Petter, Philipp Scharpf, Norman Meuschke, Howard S. Cohl, and Bela Gipp. 2018. Improving the Representation and Conversion of Mathematical Formulae by Considering Their Textual Context. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA) (JCDL ’18). Association for Computing Machinery, New York, NY, USA, 233–242. https://doi.org/10.1145/3197026.3197058
[22]
Jinfeng Shen, Xiaobing Sun, Bin Li, Hui Yang, and Jiajun Hu. 2016. On automatic summarization of what and why information in source code changes. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. IEEE, 103–112.
[23]
Leonardo Sousa, Anderson Oliveira, Willian Oizumi, Simone Barbosa, Alessandro Garcia, Jaejoon Lee, Marcos Kalinowski, Rafael de Mello, Baldoino Fonseca, Roberto Oliveira, Carlos Lucena, and Rodrigo Paes. 2018. Identifying Design Problems in the Source Code: A Grounded Theory. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 921–931. https://doi.org/10.1145/3180155.3180239
[24]
Yiannos Stathopoulos and Simone Teufel. 2016. Mathematical Information Retrieval based on Type Embeddings and Query Expansion. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 2344–2355. https://www.aclweb.org/anthology/C16-1221
[25]
Renan Vieira, A. D. Silva, L. S. Rocha, and J. P. P. Gomes. 2019. From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache’s Open Source Projects. Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering(2019).
[26]
Zelun Wang and Jyh-Charn Liu. 2020. PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX. In Proceedings of the ACM Symposium on Document Engineering 2020 (Virtual Event, CA, USA) (DocEng ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3395027.3419580
[27]
Wikipedia. 2021. List of Numerical Libraries. https://en.wikipedia.org/wiki/List_of_numerical_libraries. [Online; accessed 19-May-2021].
[28]
Guanping Xiao, Xiaoting Du, Yulei Sui, and Tao Yue. 2020. HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 195–206. https://doi.org/10.1109/ISSRE5003.2020.00027
[29]
Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Wm. Tompa. 2016. Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 145–154. https://doi.org/10.1145/2911451.2911512

Cited By

View all
  • (2024)TEIMMA: The First Content Reuse Annotator for Text, Images, and MathProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00056(271-273)Online publication date: 26-Jun-2024
  • (2023)Requirement Change Prediction Model for Small Software SystemsComputers10.3390/computers1208016412:8(164)Online publication date: 14-Aug-2023

Index Terms

  1. Mathematical Expressions in Software Engineering Artifacts

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
    January 2023
    357 pages
    ISBN:9781450397971
    DOI:10.1145/3570991
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 January 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bug Reports
    2. Commit Message Generation
    3. Datasets
    4. Defects
    5. Mathematical Expressions
    6. Software Engineering

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    CODS-COMAD 2023

    Acceptance Rates

    Overall Acceptance Rate 197 of 680 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)TEIMMA: The First Content Reuse Annotator for Text, Images, and MathProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00056(271-273)Online publication date: 26-Jun-2024
    • (2023)Requirement Change Prediction Model for Small Software SystemsComputers10.3390/computers1208016412:8(164)Online publication date: 14-Aug-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media