Skip to main content
Log in

An empirical analysis of error propagation in critical software systems

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Error propagation analysis is a consolidated practice to gain insights into error modes and effects that pertain to the activation of faults in software systems. A variety of approaches, such as architecture-based, source code instrumentation and variable tracing, have been proposed so far to address software error propagation analysis. Although valuable, existing approaches entail a substantial degree of system internals’ knowledge, visibility and code manipulation that is not well-suited for real-life production environments. This paper proposes an empirical analysis of error propagation. We specifically address the challenges in using fault data and error events in the logs, which are a convenient byproduct of the system’s execution. The approach puts forth the construction of error reporting graphs. We apply the approach to 2,042 failure data points from two real-world critical systems from the Air Traffic Control domain by a top industry provider. The approach contributes to develop a deep understanding on error modes and propagation paths, which can be leveraged by practitioners to make informed decisions on the placement of error detection mechanisms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. In this study, we follow the notion that a software fault is a development fault originated during the coding phase. Faults can be activated by the computation process or environmental conditions and cause errors. An error is the part of the total state of the system that may lead to its subsequent service failure. A failure occurs when the delivered service deviates from correct service (Avizienis et al. 2004).

  2. The evaluation version of these systems, testing applications and workloads are provided by the industry partner within the MINIMINDS Project (n. B21C12000710005).

  3. Consistently with the software engineering terminology, we mean by component a software unit encompassing a cohesive subset of functionality provided by a given system; a subcomponent is a subset of functionality within the component (Lau and Wang 2007).

  4. https://www.dds-foundation.org/what-is-dds-3/

  5. An assertion checks invariant properties holding in correct executions; an alert is generated if an invariant is violated at runtime (Rosenblum 1995).

  6. https://bz.apache.org/bugzilla/show_bug.cgi?id= 54711#attach_30061

References

  • Abdelmoez W, Nassar DM, Shereshevsky M, Gradetsky N, Gunnalan R, Ammar HH, Yu B, Mili A (2004) Error propagation in software architectures. In: 10th international symposium on software metrics, 2004. Proceedings., pages 384–393. https://doi.org/10.1109/METRIC.2004.1357923

  • Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1):11–33. ISSN 1545-5971. https://doi.org/10.1109/TDSC.2004.2

    Article  Google Scholar 

  • Arora A, Kulkarni SS (1998) Detectors and correctors: a theory of fault-tolerance components. In: Proceedings. 18th international conference on distributed computing systems (Cat. No.98CB36183), pp 436–443. https://doi.org/10.1109/ICDCS.1998.679772

  • Bondy JA, Murty USR, et al. (1976) Graph theory with applications, vol 290. Citeseer

  • Calhoun J, Snir M, Olson LN, Gropp WD (2017) Towards a more complete understanding of SDC propagation. In: Proceedings of the 26th international symposium on high-performance parallel and distributed computing, HPDC ’17. ISBN 978-1-4503-4699-3. ACM, New York, pp 131–142. https://doi.org/10.1145/3078597.3078617

  • Cinque M, Cotroneo D, Pecchia A (2013) Event logs for the analysis of software failures: a rule-based approach. IEEE Trans Softw Eng 39(6):806–821. ISSN 0098-5589. https://doi.org/10.1109/TSE.2012.67

    Article  Google Scholar 

  • Cinque M, Cotroneo D, Della Corte R, Pecchia A (2016) Characterizing direct monitoring techniques in software systems. IEEE Transactions on Reliability 65 (4):1665–1681. ISSN 0018-9529. https://doi.org/10.1109/TR.2016.2570564

    Article  Google Scholar 

  • Chan A, Winter S, Saissi H, Pattabiraman K, Suri N (2017) IPA: Error propagation analysis of multi-threaded programs using likely invariants. In: IEEE international conference on software testing, verification and validation (ICST), pp 184–195. https://doi.org/10.1109/ICST.2017.24

  • Chillarege R, Bhandari IS, Chaar JK, Halliday MJ, Moebus DS, Ray BK, Wong M-Y (1992) Orthogonal defect classification-a concept for in-process measurements. IEEE Trans Softw Eng 18:943–956. ISSN 0098-5589. http://doi.ieeecomputersociety.org/10.1109/32.177364

    Article  Google Scholar 

  • Chuah E, Jhumka A, Browne JC, Barth B, Narasimhamurthy S (2015) Insights into the diagnosis of system failures from cluster message logs. In: 11th European dependable computing conference (EDCC), pp 225–232. https://doi.org/10.1109/EDCC.2015.19

  • Cortellessa V, Grassi V (2007) Component-based software engineering: 10th International Symposium, CBSE 2007, Medford, MA, USA, July 9-11, 2007. Proceedings, chapter A Modeling Approach to Analyze the Impact of Error Propagation on Reliability of Component-Based Systems, pages 140–156. Springer Berlin Heidelberg, Berlin, Heidelberg. ISBN 978-3-540-73551-9. https://doi.org/10.1007/978-3-540-73551-9_10

  • Duraes JA, Madeira HS (2006) Emulation of software faults: a field data study and a practical approach. IEEE Trans Softw Eng 32(11):849–867. ISSN 0098-5589. https://doi.org/10.1109/TSE.2006.113

    Article  Google Scholar 

  • Filieri A, Ghezzi C, Grassi V, Mirandola R (2010) Reliability analysis of component-based systems with multiple failure modes. In: Grunske L, Reussner R, Plasil F (eds) Component-Based Software Engineering. Springer, Berlin, pp 1–20

  • Hiller M, Jhumka A, Suri N (2002a) Propane: An environment for examining the propagation of errors in software. In: Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA, pp 81–85, New York, NY, USA. ACM. ISBN 1-58113-562-9. https://doi.org/10.1145/566172.566184

  • Hiller M, Jhumka A, Suri N (2002b) On the placement of software mechanisms for detection of data errors. In: IEEE international conference on dependable systems and networks, pp 135–144. https://doi.org/10.1109/DSN.2002.1028894

  • Hiller M, Jhumka A, Suri N (2004) Epic: profiling the propagation and effect of data errors in software. IEEE Transactions on Computers 53(5):512–530. ISSN 0018-9340. https://doi.org/10.1109/TC.2004.1275294

    Article  Google Scholar 

  • Hsueh MC, Tsai TK, Iyer R (1997) Fault injection techniques and tools. IEEE Computer 30(4):75–82

    Article  Google Scholar 

  • Jhumka A, Hiller M, Suri N (2001) Assessing inter-modular error propagation in distributed software. In: 20th IEEE symposium on reliable distributed systems, 2001. Proceedings, pp 152–161. https://doi.org/10.1109/RELDIS.2001.969769

  • Jhumka A, Leeke M (2011) The early identification of detector locations in dependable software. In: IEEE 22nd international symposium on software reliability engineering, pp 40–49. https://doi.org/10.1109/ISSRE.2011.34

  • Johansson A, Suri N (2005) Error propagation profiling of operating systems. In: International conference on dependable systems and networks, 2005. DSN 2005. Proceedings. pp 86–95. https://doi.org/10.1109/DSN.2005.45

  • Kabinna S, Bezemer C-P, Shang W, Syer MD, Hassan AE (2018) Examining the stability of logging statements. Empirical Software Engineering 23(1):290–333. ISSN 1573-7616. https://doi.org/10.1007/s10664-017-9518-0

    Article  Google Scholar 

  • Kalyanakrishnam M, Kalbarczyk Z, Iyer R (1999) Failure data analysis of a LAN of windows NT based computers. In: Proceedings of the international symposium on reliable distributed systems (SRDS). IEEE Computer Society, pp 178–187

  • Khoshgoftaar TM, Allen EB, Tang WH, Michael CC, Voas JM (1999) Identifying modules which do not propagate errors. In: IEEE symposium on application-specific systems and software engineering and technology (ASSET), pp 185–193. https://doi.org/10.1109/ASSET.1999.756768

  • Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’04. ISBN 0-7695-2102-9. IEEE Computer Society, Washington, pp 75–. http://dl.acm.org/citation.cfm?id=977395.977673

  • Lau KK, Wang Z (2007) Software component models. IEEE Trans Softw Eng 33(10):709–724. ISSN 0098-5589. https://doi.org/10.1109/TSE.2007.70726

    Article  Google Scholar 

  • Leeke M, Jhumka A (2010) Towards understanding the importance of variables in dependable software. In: 2010 European Dependable Computing Conference (EDCC), pp 85–94. https://doi.org/10.1109/EDCC.2010.20

  • Li H, Chen T-H(Peter), Shang W, Hassan AE (2018) Studying software logging using topic models. Empirical Software Engineering 23(5):2655–2694. ISSN 1573-7616. https://doi.org/10.1007/s10664-018-9595-8

    Article  Google Scholar 

  • Lyu MR, et al. (1996) Handbook of software reliability engineering, vol 222. IEEE Computer Society Press, CA

    Google Scholar 

  • Michael CC, Jones RC (1997) On the uniformity of error propagation in software. In: Proceedings of the 12th annual conference on computer assurance, 1997. COMPASS ’97. Are we making progress towards computer assurance?, pp 68–76. https://doi.org/10.1109/CMPASS.1997.613237

  • Makanju A, Zincir-Heywood AN, Milios EE (2012) A lightweight algorithm for message type extraction in system application logs. IEEE Trans Knowledge Data Eng 24(11):1921–1936. ISSN 1041-4347. https://doi.org/10.1109/TKDE.2011.138

    Article  Google Scholar 

  • Pattabiraman K, Saggese GP, Chen D, Kalbarczyk Z, Iyer R (2011) Automated derivation of application-specific error detectors using dynamic analysis. IEEE Transactions on Dependable and Secure Computing 8(5):640–655. ISSN 1545-5971. https://doi.org/10.1109/TDSC.2010.19

    Article  Google Scholar 

  • Popic P, Desovski D, Abdelmoez W, Cukic B (2005) Error propagation in the reliability analysis of component based systems. In: 16th IEEE international symposium on software reliability engineering, 2005. ISSRE 2005, pp 10–62. https://doi.org/10.1109/ISSRE.2005.18

  • Rosenblum DS (1995) A practical approach to programming with assertions. IEEE Trans Softw Eng, p 21

  • Russo B, Succi G, Pedrycz W (2015) Mining system logs to learn error predictors: a case study of a telemetry system. Empirical Software Engineering 20(4):879–927. ISSN 1573-7616. https://doi.org/10.1007/s10664-014-9303-2

    Article  Google Scholar 

  • Tian J, Rudraraju S, Li Z (2004) Evaluating web software reliability based on workload and failure data extracted from server logs. IEEE Trans Soft Eng 30 (11):754–769. ISSN 0098-5589. https://doi.org/10.1109/TSE.2004.87

    Article  Google Scholar 

  • Tucek J, Lu S, Huang C, Xanthos S, Zhou Y (2007) Triage: Diagnosing production run failures at the user’s site. In: Proceedings of Twenty-first ACM SIGOPS symposium on operating systems principles, SOSP ’07, pp 131–144, New York, NY, USA. ACM. ISBN 978-1-59593-591-5. https://doi.org/10.1145/1294261.1294275

  • Voas J (1997) Error propagation analysis for cots systems. Computing Control Engineering Journal 8(6):269–272. ISSN 0956-3385. https://doi.org/10.1049/cce:19970607

    Article  Google Scholar 

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell. ISBN 0-7923-8682-5

    Book  Google Scholar 

  • Yuan D, Mai H, Xiong W, Tan L, Zhou Y, Pasupathy S (2010) Sherlog: Error diagnosis by connecting clues from run-time logs. SIGARCH Comput Archit News 38(1):143–154. ISSN 0163-5964. https://doi.org/10.1145/1735970.1736038

    Article  Google Scholar 

  • Zheng Z, Lyu MR (2010) Collaborative reliability prediction of service-oriented systems. In: 2010 ACM/IEEE 32nd international conference on software engineering, vol 1, pp 35–44. https://doi.org/10.1145/1806799.1806809

Download references

Acknowledgements

This work has been partially supported by the Italian Ministry of Education, University and Research under the MINIMINDS PON Project (n. B21C12000710005) and by Programme STAR, funded by UniNA and Compagnia di San Paolo under project “Towards Cognitive Security Information and Event Management” (COSIEM).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raffaele Della Corte.

Additional information

Communicated by: Natalia Juristo

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cinque, M., Della Corte, R. & Pecchia, A. An empirical analysis of error propagation in critical software systems. Empir Software Eng 25, 2450–2484 (2020). https://doi.org/10.1007/s10664-020-09801-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09801-2

Keywords

Navigation