skip to main content
10.1145/3661167.3661234acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

How Much Logs Does My Source Code File Need? Learning to Predict the Density of Logs

Published: 18 June 2024 Publication History

Abstract

Software logging is the practice of recording different events that occur within a software system, which are useful for several analysis activities. However, striking the right balance between logging and system overhead is challenging. Prior work has conducted various machine learning-based solutions to suggest where to insert logging statements. But most importantly, before answering the question “where to log?’’, practitioners first need to determine whether a file needs logging at the first place. To do so, we conduct in this paper an empirical study to characterize the log density (i.e., ratio of log lines over the total lines of code) in seven open-source software projects. Then, we propose a deep learning based approach to predict the log density based on syntactic and semantic features of the source code. We find that the percentage of files with at least one log line ranges from 5% to 33% across the studied projects. Additionally, the median log density in the files with at least one log line ranges from 0.95% to 1.85% across the seven projects and can go up to 18%. Our findings resonate with the hypothesis that not all source code files require logging. On the other hand, our log density models achieve an average accuracy of 84%. Whereas our cross-project log density prediction results show a promising performance with an average accuracy of 72%, which represents over 86% (ratio of cross/within) of the corresponding within-project predictions using syntactic features. Our results show that we can accurately predict whether a file needs logging and such predictions may be generalized across projects.

References

[1]
Batoun Mohamed Amine, Yung Ka Lai, Tian Yuan, and Sayagh Mohammed. 2023. An Empirical Study on GitHub Pull Requests’ Reactions. ACM Transactions on Software Engineering and Methodology (2023).
[2]
Pecchia Antonio, Cinque Marcello, Carrozza Gabriella, and Cotroneo Domenico. 2015. Industry Practices and Event Logging: Assessment of a Critical Software Development Process. In Proceedings of the 2012 IEEE Annual Computer Software and Applications Conference. 169–178.
[3]
Chen Boyuan and Jiang Zhen M. 2019. Extracting and Studying the Logging-Code-Issue-Introducing Changes in Java-Based Large-Scale Open Source Software Systems. Empirical Software Engineering (2019), 2285–2322.
[4]
Chen Boyuan and Jiang Z. Ming. 2017. Characterizing and Detecting Anti-Patterns in the Logging Code. In Proceedings of the 2017 IEEE/ACM International Conference on Software Engineering (ICSE)). 71–81.
[5]
Chen Boyuan and Jiang Z. Ming. 2017. Characterizing Logging Practices in Java-Based Open Source Software Projects –a Replication Study in Apache Software Foundation. Empirical Software Engineering (2017), 330–374.
[6]
Chen Boyuan and Jiang Zhen Ming. 2021. A Survey of Software Log Instrumentation. Comput. Surveys (2021), 1–34.
[7]
Yuan Ding, Park Soyeon, Huang Peng, Liu Yang, Lee Michael M, Tang Xiaoming, Zhou Yuanyuan, and Savage Stefan. 2012. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In Proceedings of the 2012 { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 12). 293–306.
[8]
Baccanico Fabio, Carrozza Gabriella, Cinque Marcello, Cotroneo Domenico, Pecchia Antonio, and Savignano Agostino. 2014. Event Logging in an Industrial Development Process: Practices and Reengineering Challenges. In Proceedings of the 2014 International Symposium on Software Reliability Engineering Workshops. 10–13.
[9]
Rong Guoping, Zhang Qiuping, Liu Xinbei, and Gu Shenghiu. 2017. A Systematic Review of Logging Practice in Software Engineering. In Proceedings of the 2017 Asia-Pacific Software Engineering Conference (APSEC). 534–539.
[10]
Rong Guoping, Gu Shenghui, Zhang He, Shao Dong, and Liu Wanggen. 2018. How is Logging Practice Implemented in Open Source Software Projects? A Preliminary Exploration. In Proceedings of the 2018 Australasian Software Engineering Conference (ASWEC). 171–180.
[11]
Rong Guoping, Xu Yangchen, Gu Shenghui, Zhang He, and Shao Dong. 2018. Can You Capture Information As You Intend To? A Case Study on Logging Practice in Industry. In Proceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 171–180.
[12]
Pirzadeh Heidar, Shanian Sara, Hamou-Lhadj Abdelwahab, and Mehrabian Ali. 2011. The Concept of Stratified Sampling of Execution Traces. In Proceedings of the 2011 IEEE International Conference on Program Comprehension. 225–226.
[13]
Li Heng, Zhang Haoxiang, Wang Shaowei, and Hassan Ahmed E. 2021. Studying the Practices of Logging Exception Stack Traces in Open-Source Software Projects. IEEE Transactions on Software Engineering (2021).
[14]
Li Heng, Chen Tse-Hsun P., Shang Weiyi, and Hassan Ahmed E. 2018. Studying Software Logging Using Topic Models. Empirical Software Engineering (2018), 2655–2694.
[15]
Li Heng, Shang Weiyi, Adams Bram, Sayagh Mohammed, and Hassan Ahmed E. 2020. A Qualitative Study of the Benefits and Costs of Logging From Developers’ Perspectives. IEEE Transactions on Software Engineering (2020).
[16]
Li Heng, Shang Weiyi, and Hassan Ahmed E. 2017. Which Log Level Should Developers Choose For a New Logging Statement?Empirical Software Engineering (2017), 1684–1716.
[17]
Li Heng, Shang Weiyi, and Hassan Ahmed E. 2017. Which Log Level Should Developers Choose For a New Logging Statement?Empirical Software Engineering (2017), 1684–1716.
[18]
Ha Huong and Zhang Hongyu. 2019. DeepPerf: Performance Prediction for Configurable Software With Deep Sparse Neural Network. In Proceedings of the 2019 IEEE/ACM International Conference on Software Engineering (ICSE). 1095–1106.
[19]
Hand David J and Till Robert J. 2001. A Simple Generalisation of The Area Under The ROC Curve For Multiple Class Classification Problems. Machine learning (2001), 171–186.
[20]
Yosinski Jason, Clune Jeff, Bengio Yoshua, and Lipson Hod. 2014. How Transferable are Features in Deep Neural Networks?Advances in Neural Information Processing Systems 27 (2014).
[21]
Cândido Jeanderson, Haesen Jan, Aniche Maurício, and Van Deursen Arie. 2021. An Exploratory Study of Log Placement Recommendation in an Enterprise System. In Proceedings of the 2021 IEEE/ACM International Conference on Mining Software Repositories (MSR). 143–154.
[22]
Zhu Jieming, He Pinjia, Fu Qiang, Zhang Hongyu, Lyu Michael R, and Zhang Dongmei. 2015. Learning to log: Helping developers make informed logging decisions. In Proceedings of the 2015 IEEE/ACM IEEE International Conference on Software Engineering. 415–425.
[23]
Kang Hong Jin, Tegawendé F Bissyandé, and Lo David. 2019. Assessing the Generalizability of Code2vec Token Embeddings. In Proceedings of the 2019 IEEE/ACM International Conference on Automated Software Engineering (ASE). 1–12.
[24]
Zhu Jing, Rong Guoping, Huang Guocheng, Gu Shenghui, Zhang He, and Shao Dong. 2019. JLLAR: A Logging Recommendation Plug-in Tool for Java. In Proceedings of the 2019 Asia-Pacific Symposium on Internetware. 1–6.
[25]
Patel Keyur, Faccin João, Hamou-Lhadj Abdelwahab, and Nunes Ingrid. 2022. The Sense of Logging in the Linux Kernel. Empirical Software Engineering (2022), 153.
[26]
Foalem Patrick L, Khomh Foutse, and Li Heng. 2023. Studying Logging Practice in Machine Learning-based Applications. arXiv preprint arXiv:2301.04234 (2023).
[27]
Mou Lili, Li Ge, Zhang Lu, Wang Tao, and Jin Zhi. 2016. Convolutional Neural Networks Over Tree Structures for Programming Language Processing. In Proceedings of the 2016 AAAI conference on artificial intelligence, Vol. 30.
[28]
Alves Marco and Paula Hugo. 2021. Identifying Logging Practices in Open Source Python Containerized Application Projects. In Proceedings of the 2021 Brazilian Symposium on Software Engineering. 16–20.
[29]
Linares-Vásquez Mario, McMillan Collin, Poshyvanyk Denys, and Grechanik Mark. 2014. On Using Machine Learning to Automatically Classify Software Applications Into Domain Categories. Empirical Software Engineering (2014), 582–618.
[30]
Tufano Michele, Pantiuchina Jevgenija, Watson Cody, Bavota Gabriele, and Poshyvanyk Denys. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of the 2019 IEEE/ACM International Conference on Software Engineering (ICSE). 25–36.
[31]
Yang Nan, Cuijpers Pieter, Hendriks Dennis, Schiffelers Ramon, Lukkien Johan, and Serebrenik Alexander. 2023. An Interview Study About the Use of Logs in Embedded Software Engineering. Empirical Software Engineering (2023), 43.
[32]
Fu Qiang, Zhu Jieming, Hu Wenlu, Lou Jian-Guang, Ding Rui, Lin Qingwei, Zhang Dongmei, and Xie Tao. 2014. Where Do Developers Log? An Empirical Study on Logging Practices in Industry. In Proceedings of the 2014 International Conference on Software Engineering. 24–33.
[33]
Rehurek Radim. 2020. Gensim: word2vec. Retrieved Jun 2020 from https://radimrehurek.com/gensim/ models/word2vec.html
[34]
Chen A. Ran, Chen Tse-Hsun, and Wang Shaowei. 2021. Demystifying the Challenges and Benefits of Analyzing User-Reported Logs in Bug Reports. Empirical Software Engineering (2021), 1–30.
[35]
Nainggolan Rena, Resianta Perangin-angin, Emma Simarmata, and Astuti F. Tarigan. 2019. Improved the Performance of the K-means Cluster Using the Sum of Squared Error (SSE) Optimized by Using The Elbow Method. In Journal of Physics: Conference Series. 012–015.
[36]
Lal Sangeeta and Sureka Ashish. 2016. Logopt: Static Feature Extraction from Source Code for Automated Catch Block Logging Prediction. In Proceedings of the 2016 India Software Engineering Conference. 151–155.
[37]
Lal Sangeeta, Sardana Neetu, and Sureka Ashish. 2016. LogOptPlus: Learning to optimize logging in catch and if programming constructs. In Proceedings of the 2016 IEEE Annual Computer Software and Applications Conference (COMPSAC). 215–220.
[38]
Lal Sangeeta, Sardana Neetu, and Sureka Ashish. 2017. Analysis and Prediction of Log Statement in Open Source Java Projects. Buenos Aires, Argentina (2017), 65.
[39]
Lal Sangeeta, Sardana Neetu, and Sureka Ashish. 2019. Three-Level Learning for Improving Cross-Project Logging Prediction for If-Blocks. Journal of King Saud University-Computer and Information Sciences (2019), 481–496.
[40]
Lal Sangeeta, Sardana Neetu, and Sureka Ashish. 2020. Improving Logging Prediction on Imbalanced Datasets: A Case Study on Open Source Java Projects. In Cognitive Analytics: Concepts, Methodologies, Tools, and Applications. 740–772.
[41]
Tschudin Peter Senna, Lawall Julia, and Muller Gilles. 2015. 3l: Learning linux logging. In Proceedings of the 2015 BElgian-NEtherlands software eVOLution seminar (BENEVOL 2015).
[42]
Chowdhury Shaiful, Di Nardo S, Hindle Abram, and Jiang Zhen M. 2018. An Exploratory Study on Assessing the Energy Impact of Logging on Android Applications. Empirical Software Engineering (2018), 1422–1456.
[43]
Dai Shaozhi, Luan Zhongzhi, Huang Shaohan, Fung Carol, Wang He, Yang Hailong, and Qian Depei. 2022. REVAL: REcommend which VAriables to Log with Pre-Trained Model and Graph Neural Network. IEEE Transactions on Network and Service Management (2022).
[44]
Joe Song and Haizhou Wang. 2020. Optimal (Weighted) Univariate Clustering.
[45]
Kabinna Suhas, Bezemer Cor-Paul, Shang Weiyi, Syer Mark D, and Hassan Ahmed E. 2018. Examining the Stability of Logging Statements. Empirical Software Engineering (2018), 290–333.
[46]
Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781 (2013).
[47]
Alon Uri, Zilberstein Meital, Levy Omer, and Yahav Eran. 2019. Code2vec: Learning Distributed Representations of Code. Proceedings of the ACM on Programming Languages (2019), 1–29.
[48]
Niu Xu, Li Shanshan, Jia Zhouyang, Zhou Shulin, Li Wang, and Liao Xiangke. 2018. Understanding the Similarity of Log Revision Behaviors in Open Source Software. International Journal of Performability Engineering (2018), 1887.
[49]
Zhao Xu, Rodrigues Kirk, Luo Yu, Stumm Michael, Yuan Ding, and Zhou Yuanyuan. 2017. The Game of Twenty Questions: Do You Know Where to Log?. In Proceedings of the 2017 Workshop on Hot Topics in Operating Systems. 125–131.
[50]
Zhao Xu, Rodrigues Kirk, Luo Yu, Stumm Michael, Yuan Ding, and Zhou Yuanyuan. 2017. Log20: Fully Automated Optimal Placement of Log Printing Statements Under Specified Overhead Threshold. In Proceedings of the 2017 Symposium on Operating Systems Principles. 565–581.
[51]
Zhang Xu, Xu Yong, Lin Qingwei, Qiao Bo, Zhang Hongyu, Dang Yingnong, Xie Chunyu, Yang Xinsheng, Cheng Qian, Li Ze, 2019. Robust Log-based Anomaly Detection on Unstable Log Data. In Proceedings of the 2019 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 807–817.
[52]
Zeng Yi, Chen Jinfu, Shang Weiyi, and Chen Tse-Hsun. 2019. Studying the Characteristics of Logging Practices in Mobile Apps: A Case Study on F-Droid. Empirical Software Engineering (2019), 3394–3434.
[53]
Li Zhenhao, Li Heng, Chen Tse-Hsun, and Shang Weiyi. 2021. Deeplv: Suggesting Log Levels Using Ordinal Based Neural Networks. In Proceedings of the 2021 IEEE/ACM International Conference on Software Engineering (ICSE). 1461–1472.
[54]
Li Zhenhao, Chen Tse-Hsun, and Shang Weiyi. 2020. Where Shall We Log? Studying and Suggesting Logging Locations in Code Blocks. In Proceedings of the 2020 IEEE/ACM International Conference on Automated Software Engineering. 361–372.
[55]
Liu Zhongxin, Xia Xin, Lo David, Xing Zhenchang, Hassan Ahmed E, and Li Shanping. 2019. Which Variables Should I Log?Empirical Software Engineering (2019), 2012–2031.
[56]
Liu Zhongxin, Xia Xin, Lo David, Xing Zhenchang, Hassan Ahmed E, and Li Shanping. 2019. Which Variables Should I Log?IEEE Transactions on Software Engineering (2019), 2012–2031.
[57]
Jia Zhouyang, Li Shanshan, Liu Xiaodong, Liao Xiangke, and Liu Yunhuai. 2018. SMARTLOG: Place Error Log Statement by Deep Understanding of Log Intention. In Proceedings of the 2018 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 61–71.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering
June 2024
728 pages
ISBN:9798400717017
DOI:10.1145/3661167
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep Learning
  2. Log Density
  3. Software Logging

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EASE 2024

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 70
    Total Downloads
  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)14
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media