skip to main content
10.1145/3570991.3571033acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
short-paper

Mathematical Expressions in Software Engineering Artifacts

Published:04 January 2023Publication History

ABSTRACT

Mathematical expressions are required not only for numerical calculations but also for discussions and documentation. They play a significant role in providing clarifications and disambiguation of ideas, concepts and definitions. Software engineering artifacts such as source code, documentation and bug reports have mathematical expressions in them. Our experiments, with a commit message generation tool, suggest that mathematical expressions present in the input data affect the tool’s accuracy. To help future research in this direction, we have constructed and shared a dataset of bug reports with both automated and manually annotated mathematical expressions. We have also shared a tool () to identify mathematical expressions. Our data set contains 2,040,120 bug reports from 10 different projects. We have used our tool to annotate all of these bug reports. From each project, we have also manually annotated 1000 bug reports. We annotated the dataset with the objective of developing toolkits to identify mathematical expressions in software engineering artifacts. Finally, we have made a case for future work to deal with mathematical expressions when improving software engineering tasks, especially those that use bug reports.

References

  1. 2021. Authoring Tools listed by the Math Working Group. https://www.w3.org/wiki/Math_Tools##Authoring_tools. [Online; accessed 19-May-2021].Google ScholarGoogle Scholar
  2. 2021. Modified NNGen data. https://doi.org/10.5281/zenodo.5559242. [Online; accessed 19-Aug-2021].Google ScholarGoogle Scholar
  3. 2021. Replication pacakge for MEDSEA mathematical expression detector tool. https://github.com/MathyB/MathyB_Dataset. [Online; accessed 19-Aug-2021].Google ScholarGoogle Scholar
  4. 2021. Web version of MEDSEA mathematical expression detector tool. https://med-sea.herokuapp.com/. [Online; accessed 10-Oct-2021].Google ScholarGoogle Scholar
  5. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165(2020). arXiv:2005.14165https://arxiv.org/abs/2005.14165Google ScholarGoogle Scholar
  6. Raymond PL Buse and Westley R Weimer. 2010. Automatically documenting program changes. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 33–42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Luis Fernando Cortés-Coy, Mario Linares-Vásquez, Jairo Aponte, and Denys Poshyvanyk. 2014. On automatically generating commit messages via summarization of source code changes. In 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation. IEEE, 275–284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Anthony Di Franco, Hui Guo, and Cindy Rubio-González. 2017. A Comprehensive Study of Real-World Numerical Bug Characteristics. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering(Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, 509–519.Google ScholarGoogle ScholarCross RefCross Ref
  9. Kenichi Iwatsuki, Takeshi Sagara, Tadayoshi Hara, and Akiko Aizawa. 2017. Detecting In-Line Mathematical Expressions in Scientific Documents. In Proceedings of the 2017 ACM Symposium on Document Engineering (Valletta, Malta) (DocEng ’17). Association for Computing Machinery, New York, NY, USA, 141–144. https://doi.org/10.1145/3103010.3121041Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ridhi Jain, Sai Prathik, Venkatesh Vinayakarao, and Rahul Purandare. 2018. A Search System for Mathematical Expressions on Software Binaries. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery, New York, NY, USA, 487–491. https://doi.org/10.1145/3196398.3196413Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically Generating Commit Messages from Diffs Using Neural Machine Translation. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, 135–146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jianming Jin, Xionghu Han, and Qingren Wang. 2003. Mathematical formulas extraction. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.1138–1141. https://doi.org/10.1109/ICDAR.2003.1227834Google ScholarGoogle Scholar
  13. Shahab Kamali and Frank Wm. Tompa. 2013. Retrieving Documents with Mathematical Content. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 353–362. https://doi.org/10.1145/2484028.2484083Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ahmed Lamkanfi, Javier Pérez, and Serge Demeyer. 2013. The Eclipse and Mozilla Defect Tracking Dataset: A Genuine Dataset for Mining Bug Information. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR ’13). IEEE Press, 203–206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mario Linares-Vásquez, Luis Fernando Cortés-Coy, Jairo Aponte, and Denys Poshyvanyk. 2015. Changescribe: A tool for automatically generating commit messages. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 709–712.Google ScholarGoogle ScholarCross RefCross Ref
  16. Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, and Yu Qian. 2019. Generating Commit Messages from Diffs Using Pointer-Generator Network. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 299–309. https://doi.org/10.1109/MSR.2019.00056Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 373–384. https://doi.org/10.1145/3238147.3238190Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 373–384. https://doi.org/10.1145/3238147.3238190Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. David Lorge Parnas. 2011. The Use of Mathematics in Software Development. In Theoretical Aspects of Computing – ICTAC 2011, Antonio Cerone and Pekka Pihlajasaari (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–5.Google ScholarGoogle ScholarCross RefCross Ref
  20. Michael Rath and Patrick Mäder. 2019. The SEOSS 33 dataset — Requirements, bug reports, code history, and trace links for entire projects. Data in Brief 25(2019), 104005. https://doi.org/10.1016/j.dib.2019.104005Google ScholarGoogle ScholarCross RefCross Ref
  21. Moritz Schubotz, André Greiner-Petter, Philipp Scharpf, Norman Meuschke, Howard S. Cohl, and Bela Gipp. 2018. Improving the Representation and Conversion of Mathematical Formulae by Considering Their Textual Context. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA) (JCDL ’18). Association for Computing Machinery, New York, NY, USA, 233–242. https://doi.org/10.1145/3197026.3197058Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jinfeng Shen, Xiaobing Sun, Bin Li, Hui Yang, and Jiajun Hu. 2016. On automatic summarization of what and why information in source code changes. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. IEEE, 103–112.Google ScholarGoogle ScholarCross RefCross Ref
  23. Leonardo Sousa, Anderson Oliveira, Willian Oizumi, Simone Barbosa, Alessandro Garcia, Jaejoon Lee, Marcos Kalinowski, Rafael de Mello, Baldoino Fonseca, Roberto Oliveira, Carlos Lucena, and Rodrigo Paes. 2018. Identifying Design Problems in the Source Code: A Grounded Theory. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 921–931. https://doi.org/10.1145/3180155.3180239Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yiannos Stathopoulos and Simone Teufel. 2016. Mathematical Information Retrieval based on Type Embeddings and Query Expansion. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 2344–2355. https://www.aclweb.org/anthology/C16-1221Google ScholarGoogle Scholar
  25. Renan Vieira, A. D. Silva, L. S. Rocha, and J. P. P. Gomes. 2019. From Reports to Bug-Fix Commits: A 10 Years Dataset of Bug-Fixing Activity from 55 Apache’s Open Source Projects. Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering(2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Zelun Wang and Jyh-Charn Liu. 2020. PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX. In Proceedings of the ACM Symposium on Document Engineering 2020 (Virtual Event, CA, USA) (DocEng ’20). Association for Computing Machinery, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3395027.3419580Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wikipedia. 2021. List of Numerical Libraries. https://en.wikipedia.org/wiki/List_of_numerical_libraries. [Online; accessed 19-May-2021].Google ScholarGoogle Scholar
  28. Guanping Xiao, Xiaoting Du, Yulei Sui, and Tao Yue. 2020. HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 195–206. https://doi.org/10.1109/ISSRE5003.2020.00027Google ScholarGoogle Scholar
  29. Richard Zanibbi, Kenny Davila, Andrew Kane, and Frank Wm. Tompa. 2016. Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metrics at Scale. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 145–154. https://doi.org/10.1145/2911451.2911512Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mathematical Expressions in Software Engineering Artifacts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
      January 2023
      357 pages
      ISBN:9781450397971
      DOI:10.1145/3570991

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 January 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate197of680submissions,29%
    • Article Metrics

      • Downloads (Last 12 months)26
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format