Abstract
Error propagation analysis is a consolidated practice to gain insights into error modes and effects that pertain to the activation of faults in software systems. A variety of approaches, such as architecture-based, source code instrumentation and variable tracing, have been proposed so far to address software error propagation analysis. Although valuable, existing approaches entail a substantial degree of system internals’ knowledge, visibility and code manipulation that is not well-suited for real-life production environments. This paper proposes an empirical analysis of error propagation. We specifically address the challenges in using fault data and error events in the logs, which are a convenient byproduct of the system’s execution. The approach puts forth the construction of error reporting graphs. We apply the approach to 2,042 failure data points from two real-world critical systems from the Air Traffic Control domain by a top industry provider. The approach contributes to develop a deep understanding on error modes and propagation paths, which can be leveraged by practitioners to make informed decisions on the placement of error detection mechanisms.
Similar content being viewed by others
Notes
In this study, we follow the notion that a software fault is a development fault originated during the coding phase. Faults can be activated by the computation process or environmental conditions and cause errors. An error is the part of the total state of the system that may lead to its subsequent service failure. A failure occurs when the delivered service deviates from correct service (Avizienis et al. 2004).
The evaluation version of these systems, testing applications and workloads are provided by the industry partner within the MINIMINDS Project (n. B21C12000710005).
Consistently with the software engineering terminology, we mean by component a software unit encompassing a cohesive subset of functionality provided by a given system; a subcomponent is a subset of functionality within the component (Lau and Wang 2007).
An assertion checks invariant properties holding in correct executions; an alert is generated if an invariant is violated at runtime (Rosenblum 1995).
References
Abdelmoez W, Nassar DM, Shereshevsky M, Gradetsky N, Gunnalan R, Ammar HH, Yu B, Mili A (2004) Error propagation in software architectures. In: 10th international symposium on software metrics, 2004. Proceedings., pages 384–393. https://doi.org/10.1109/METRIC.2004.1357923
Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1):11–33. ISSN 1545-5971. https://doi.org/10.1109/TDSC.2004.2
Arora A, Kulkarni SS (1998) Detectors and correctors: a theory of fault-tolerance components. In: Proceedings. 18th international conference on distributed computing systems (Cat. No.98CB36183), pp 436–443. https://doi.org/10.1109/ICDCS.1998.679772
Bondy JA, Murty USR, et al. (1976) Graph theory with applications, vol 290. Citeseer
Calhoun J, Snir M, Olson LN, Gropp WD (2017) Towards a more complete understanding of SDC propagation. In: Proceedings of the 26th international symposium on high-performance parallel and distributed computing, HPDC ’17. ISBN 978-1-4503-4699-3. ACM, New York, pp 131–142. https://doi.org/10.1145/3078597.3078617
Cinque M, Cotroneo D, Pecchia A (2013) Event logs for the analysis of software failures: a rule-based approach. IEEE Trans Softw Eng 39(6):806–821. ISSN 0098-5589. https://doi.org/10.1109/TSE.2012.67
Cinque M, Cotroneo D, Della Corte R, Pecchia A (2016) Characterizing direct monitoring techniques in software systems. IEEE Transactions on Reliability 65 (4):1665–1681. ISSN 0018-9529. https://doi.org/10.1109/TR.2016.2570564
Chan A, Winter S, Saissi H, Pattabiraman K, Suri N (2017) IPA: Error propagation analysis of multi-threaded programs using likely invariants. In: IEEE international conference on software testing, verification and validation (ICST), pp 184–195. https://doi.org/10.1109/ICST.2017.24
Chillarege R, Bhandari IS, Chaar JK, Halliday MJ, Moebus DS, Ray BK, Wong M-Y (1992) Orthogonal defect classification-a concept for in-process measurements. IEEE Trans Softw Eng 18:943–956. ISSN 0098-5589. http://doi.ieeecomputersociety.org/10.1109/32.177364
Chuah E, Jhumka A, Browne JC, Barth B, Narasimhamurthy S (2015) Insights into the diagnosis of system failures from cluster message logs. In: 11th European dependable computing conference (EDCC), pp 225–232. https://doi.org/10.1109/EDCC.2015.19
Cortellessa V, Grassi V (2007) Component-based software engineering: 10th International Symposium, CBSE 2007, Medford, MA, USA, July 9-11, 2007. Proceedings, chapter A Modeling Approach to Analyze the Impact of Error Propagation on Reliability of Component-Based Systems, pages 140–156. Springer Berlin Heidelberg, Berlin, Heidelberg. ISBN 978-3-540-73551-9. https://doi.org/10.1007/978-3-540-73551-9_10
Duraes JA, Madeira HS (2006) Emulation of software faults: a field data study and a practical approach. IEEE Trans Softw Eng 32(11):849–867. ISSN 0098-5589. https://doi.org/10.1109/TSE.2006.113
Filieri A, Ghezzi C, Grassi V, Mirandola R (2010) Reliability analysis of component-based systems with multiple failure modes. In: Grunske L, Reussner R, Plasil F (eds) Component-Based Software Engineering. Springer, Berlin, pp 1–20
Hiller M, Jhumka A, Suri N (2002a) Propane: An environment for examining the propagation of errors in software. In: Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA, pp 81–85, New York, NY, USA. ACM. ISBN 1-58113-562-9. https://doi.org/10.1145/566172.566184
Hiller M, Jhumka A, Suri N (2002b) On the placement of software mechanisms for detection of data errors. In: IEEE international conference on dependable systems and networks, pp 135–144. https://doi.org/10.1109/DSN.2002.1028894
Hiller M, Jhumka A, Suri N (2004) Epic: profiling the propagation and effect of data errors in software. IEEE Transactions on Computers 53(5):512–530. ISSN 0018-9340. https://doi.org/10.1109/TC.2004.1275294
Hsueh MC, Tsai TK, Iyer R (1997) Fault injection techniques and tools. IEEE Computer 30(4):75–82
Jhumka A, Hiller M, Suri N (2001) Assessing inter-modular error propagation in distributed software. In: 20th IEEE symposium on reliable distributed systems, 2001. Proceedings, pp 152–161. https://doi.org/10.1109/RELDIS.2001.969769
Jhumka A, Leeke M (2011) The early identification of detector locations in dependable software. In: IEEE 22nd international symposium on software reliability engineering, pp 40–49. https://doi.org/10.1109/ISSRE.2011.34
Johansson A, Suri N (2005) Error propagation profiling of operating systems. In: International conference on dependable systems and networks, 2005. DSN 2005. Proceedings. pp 86–95. https://doi.org/10.1109/DSN.2005.45
Kabinna S, Bezemer C-P, Shang W, Syer MD, Hassan AE (2018) Examining the stability of logging statements. Empirical Software Engineering 23(1):290–333. ISSN 1573-7616. https://doi.org/10.1007/s10664-017-9518-0
Kalyanakrishnam M, Kalbarczyk Z, Iyer R (1999) Failure data analysis of a LAN of windows NT based computers. In: Proceedings of the international symposium on reliable distributed systems (SRDS). IEEE Computer Society, pp 178–187
Khoshgoftaar TM, Allen EB, Tang WH, Michael CC, Voas JM (1999) Identifying modules which do not propagate errors. In: IEEE symposium on application-specific systems and software engineering and technology (ASSET), pp 185–193. https://doi.org/10.1109/ASSET.1999.756768
Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’04. ISBN 0-7695-2102-9. IEEE Computer Society, Washington, pp 75–. http://dl.acm.org/citation.cfm?id=977395.977673
Lau KK, Wang Z (2007) Software component models. IEEE Trans Softw Eng 33(10):709–724. ISSN 0098-5589. https://doi.org/10.1109/TSE.2007.70726
Leeke M, Jhumka A (2010) Towards understanding the importance of variables in dependable software. In: 2010 European Dependable Computing Conference (EDCC), pp 85–94. https://doi.org/10.1109/EDCC.2010.20
Li H, Chen T-H(Peter), Shang W, Hassan AE (2018) Studying software logging using topic models. Empirical Software Engineering 23(5):2655–2694. ISSN 1573-7616. https://doi.org/10.1007/s10664-018-9595-8
Lyu MR, et al. (1996) Handbook of software reliability engineering, vol 222. IEEE Computer Society Press, CA
Michael CC, Jones RC (1997) On the uniformity of error propagation in software. In: Proceedings of the 12th annual conference on computer assurance, 1997. COMPASS ’97. Are we making progress towards computer assurance?, pp 68–76. https://doi.org/10.1109/CMPASS.1997.613237
Makanju A, Zincir-Heywood AN, Milios EE (2012) A lightweight algorithm for message type extraction in system application logs. IEEE Trans Knowledge Data Eng 24(11):1921–1936. ISSN 1041-4347. https://doi.org/10.1109/TKDE.2011.138
Pattabiraman K, Saggese GP, Chen D, Kalbarczyk Z, Iyer R (2011) Automated derivation of application-specific error detectors using dynamic analysis. IEEE Transactions on Dependable and Secure Computing 8(5):640–655. ISSN 1545-5971. https://doi.org/10.1109/TDSC.2010.19
Popic P, Desovski D, Abdelmoez W, Cukic B (2005) Error propagation in the reliability analysis of component based systems. In: 16th IEEE international symposium on software reliability engineering, 2005. ISSRE 2005, pp 10–62. https://doi.org/10.1109/ISSRE.2005.18
Rosenblum DS (1995) A practical approach to programming with assertions. IEEE Trans Softw Eng, p 21
Russo B, Succi G, Pedrycz W (2015) Mining system logs to learn error predictors: a case study of a telemetry system. Empirical Software Engineering 20(4):879–927. ISSN 1573-7616. https://doi.org/10.1007/s10664-014-9303-2
Tian J, Rudraraju S, Li Z (2004) Evaluating web software reliability based on workload and failure data extracted from server logs. IEEE Trans Soft Eng 30 (11):754–769. ISSN 0098-5589. https://doi.org/10.1109/TSE.2004.87
Tucek J, Lu S, Huang C, Xanthos S, Zhou Y (2007) Triage: Diagnosing production run failures at the user’s site. In: Proceedings of Twenty-first ACM SIGOPS symposium on operating systems principles, SOSP ’07, pp 131–144, New York, NY, USA. ACM. ISBN 978-1-59593-591-5. https://doi.org/10.1145/1294261.1294275
Voas J (1997) Error propagation analysis for cots systems. Computing Control Engineering Journal 8(6):269–272. ISSN 0956-3385. https://doi.org/10.1049/cce:19970607
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell. ISBN 0-7923-8682-5
Yuan D, Mai H, Xiong W, Tan L, Zhou Y, Pasupathy S (2010) Sherlog: Error diagnosis by connecting clues from run-time logs. SIGARCH Comput Archit News 38(1):143–154. ISSN 0163-5964. https://doi.org/10.1145/1735970.1736038
Zheng Z, Lyu MR (2010) Collaborative reliability prediction of service-oriented systems. In: 2010 ACM/IEEE 32nd international conference on software engineering, vol 1, pp 35–44. https://doi.org/10.1145/1806799.1806809
Acknowledgements
This work has been partially supported by the Italian Ministry of Education, University and Research under the MINIMINDS PON Project (n. B21C12000710005) and by Programme STAR, funded by UniNA and Compagnia di San Paolo under project “Towards Cognitive Security Information and Event Management” (COSIEM).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Natalia Juristo
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cinque, M., Della Corte, R. & Pecchia, A. An empirical analysis of error propagation in critical software systems. Empir Software Eng 25, 2450–2484 (2020). https://doi.org/10.1007/s10664-020-09801-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-020-09801-2